Back to Blog
·Bhala AI Team·10 min read

Silver labels are noisy by design. Bhala's audit catches the worst of them — top-10 precision: 100%.

researchdata-qualitycross-lingualproducthow-to

Why this matters

Most production NLP runs on silver-labeled data — automatically tagged at scale by translation pipelines, distant supervision, weak heuristics. Silver is how teams get to enough training examples to make a model work; it's also noisy by design. Auto-translation drift, label-rule edge cases, brittle keyword heuristics — these accumulate into double-digit error rates that are invisible until something downstream breaks.

The contrast is gold-labeled data — human-annotated, multi-annotator, adjudicated. AfriSenti is gold. It's carefully built and considerably cleaner. But gold doesn't scale: a 100K-tweet gold corpus is a year of annotator work; a 100M-sentence silver corpus is a weekend of pipeline runs. Every team has both. And every team's silver corpora are quietly contaminating their training and evaluation.

Two questions follow:

  1. Can a model find the silver mislabels without being trained on the labels it's auditing?
  2. Does the answer transfer across languages without sentiment supervision in each?

Yes to both. With 100 hand-curated Zulu archetypes and Bhala's compositional embedding space — built for operator algebra over meaning, not for sentiment specifically — we hit 100% top-10 precision identifying the real silver errors on a held-out human-verified validation set. The same Zulu-trained sentiment direction transfers cross-lingually to surface clear errors and annotation-policy boundary cases in AfriSenti Swahili gold.

To stress-test the audit, we ran it on a worst-case substrate: a 10K-sentence Bhala-built bulk Zulu sentiment corpus auto-labeled at scale. Random review of 175 sentences from that corpus showed silver-labeling produced ~55% disagreement with human re-review — close to the public ceiling for what poorly-supervised auto-labeling produces in low-resource languages. The audit found the real errors in that noise.

The setup, in one paragraph

Bhala's multilingual encoder produces sentence embeddings in a space where meaning operations are well-defined vectors. Think of word2vec's famous "king − man + woman ≈ queen" — meaningful arithmetic on word vectors. Bhala does that one level up, on whole-sentence meaning: shifting a sentence embedding along a learned axis changes the sentence's meaning along that axis predictably. The encoder was never trained on sentiment — we just need to find the sentiment direction in this space, which is what the seed pairs do for us. We hand-curated 50 unmistakably positive and 50 unmistakably negative Zulu sentences, embedded them through the Bhala API, and computed a single direction:

sentiment_axis = unit_norm( mean(embed(positive_seeds))
                          - mean(embed(negative_seeds)) )

That's the entire learned object. One unit vector. No classifier head, no fine-tuning, no labeled training data outside those 100 seeds.

For any new sentence:

score = embed(sentence) · sentiment_axis

inconsistency = -score    if declared_label == "positive"
                +score    if declared_label == "negative"
                |score|   if declared_label == "neutral"

Sort by inconsistency descending. The top of the list is the candidate-mislabel queue.

Three experiments

Exp A — 40 hand-curated archetypes (baseline)

20 positive + 20 negative archetypal Zulu seeds. Validate on 175 random-sampled silver-labeled sentences from the bulk corpus that we hand-corrected. Top-10 precision: 90% — of the 10 sentences Bhala flagged loudest, 9 were ones the human reviewer corrected. AUROC across all 175: 0.571 — the signal lives at the extremes, not uniformly.

Exp B — 175 hand-labels as seed pool, 5-fold CV

We tried scaling the seed pool by using the 175 hand-corrected sentences themselves as seeds (with cross-validation to avoid leakage). Mean top-10 precision dropped to 58%.

The lesson: more isn't better if more isn't clean. The 175 hand-labels include genuinely-ambiguous sentences (some at conf=2 — annotator unsure). Using messy real-world sentences as seeds creates a noisier sentiment axis. Archetypes are clean prototypes; hand-labeled in-the-wild text is not.

Exp C — 100 clean seeds (40 archetypes + 60 highest-confidence hand-labels)

We took the 40 hand-curated archetypes and added 60 hand-corrected sentences filtered to confidence ≥ 4 and label ∈ {positive, negative} — dropping the borderline conf<4 cases and neutrals. Total: 100 clean seeds. Validate on the held-out portion of the 175 (lower-confidence and neutral cases) — no seed/test contamination.

Metric Result
Held-out N 80
AUROC 0.732 (vs 0.571 with 40 seeds)
Average precision 0.925
Top-3 precision 100%
Top-5 precision 100%
Top-10 precision 100%
Top-20 precision 90%

More clean seeds did help. AUROC rose +0.16, top-K saturated at 100%. The compositional geometry has more signal as the seed pool gets larger and stays clean.

What this catches in the wild

Top-15 most-inconsistent sentences from the 9,924 unlabeled bulk Zulu sentences (silver-labeled positive, Bhala says highly inconsistent):

Sentence (translation) Honest read
"Sometimes apartheid disrupts our unity" clearly negative
"Each of these three people made real attempts to commit suicide" clearly negative
"We said, 'They imprisoned you for refusing to kill people'" clearly negative
"And if they do, they will be fined" clearly negative (punishment)
"Cursed be the day I was born" (Job's lament) clearly negative

A spot-check of 15 surfaces ~10 obvious errors. The silver auto-labeler appears to have systematically miscategorized somber/biblical/news content as positive — Bhala caught the pattern.

The cross-lingual finding — Bhala's moat

We took the same sentiment axis built from 100 Zulu seeds and applied it to AfriSenti Swahili gold (3,011 native-speaker-annotated tweets, per Muhammad et al., 2023). Bhala has zero Swahili sentiment supervision. Bhala's encoder transfers across Bantu languages (FLORES Zulu↔Swahili 58.2% on translation) — the question is whether the sentiment direction transfers along with the language coverage.

It does — but the right framing for what gets surfaced is more nuanced than "mislabels." AfriSenti was carefully built (multi-annotator with majority vote and adjudication, native-speaker reviewers per language). What Bhala's cross-lingual probe surfaces is closer to "places where two reasonable annotators or two annotation policies would disagree" — sometimes that is a clear annotation error, sometimes it's a boundary case, sometimes Bhala itself is wrong. We split the top-15 honestly:

Clear annotation errors (Bhala correct, AfriSenti wrong):

Sentence (Swahili → English) AfriSenti Should be
"Pole sana kwa changamoto uliopata" — "I'm so sorry for the problems you experienced" positive sympathetic apology, not positive
"Tunaomba radhi kwa usumbufu uliojitokeza" — "We apologize for the inconvenience" positive apology, not positive
"Hadi chuo ilivujahadi interview zilivuja…" — "Even college papers leaked, even interview papers leaked…" neutral sarcastic complaint, clearly negative

Annotation-policy boundary cases (defensible either way):

Sentence AfriSenti Why both readings are reasonable
"Habari, pole sana kwa changamoto. Tunaomba namba yako DM kwa msaada" — "Hello, sorry. Send your number for help" neutral Brand-voice helpful follow-up = neutral; or sympathy + customer pain = negative. Annotation-guideline call.
"Umasikini sio sifa" — "Poverty is not a virtue" neutral Moral observation = neutral, or implicit critique of poverty = negative. Both defensible.

Bhala's own false positives (Bhala wrong, AfriSenti correct):

Sentence AfriSenti Bhala's mistake
"TAARIFA KWA VYOMBO VYA HABARI" — "STATEMENT TO MEDIA" neutral Bhala over-flagged a short headline as carrying sentiment polarity. Correctly neutral.
"UTABIRI WA HALI YA HEWA 12022020" — "WEATHER FORECAST 02-12-2020" neutral Same: very short factual headline, no sentiment to detect. Correctly neutral.

The honest takeaway — a small fraction of the top flags are clear AfriSenti errors (the customer-service-apology mislabels are systematic and worth a heads-up to the maintainers), most are annotation-policy boundary cases that surface where two reasonable annotators would disagree (which is exactly what an auditor wants to know about), and a small fraction are Bhala over-flagging short or context-light strings.

The encoder generalized the meaning of pole sana (sympathy/regret) from Zulu seeds without ever being told what pole sana means in a Swahili sentiment-labeled training pair — that's the cross-lingual transfer claim, and it holds. The downstream value isn't "Bhala will tell you which AfriSenti labels are wrong." It's "Bhala will rank your dataset for human reviewers, and the top of the rank surfaces a productive mix of clear errors, boundary cases, and policy ambiguities."

Why the AUROC is 0.732 and not 0.95

Honest framing: this is a precision-at-top tool, not a uniform classifier. The signal is strong at the extremes (top 5–10% most-inconsistent are 80–100% real mislabels) and weaker in the middle (medium-inconsistency is mostly noise). That's exactly the right shape for a reviewer-prioritization tool:

  • Random review of 1,000 silver labels → 547 confirmed mislabels (54.7% base rate)
  • Bhala-prioritized review of top 100 most-inconsistent → ~80–100 confirmed mislabels (80–100% precision)
  • 5–10× reviewer-time multiplier, depending on the cutoff

That's the product. Not "Bhala will fix your dataset." Not "Bhala will tell you the right label." Bhala will rank your dataset so a human reviewer's first hour finds 5–10× as many real errors as random sampling would.

How to use it on your dataset

The recipe is short:

  1. Pick a 1-D sentiment axis you care about (positive↔negative; or polite↔abusive; or any binary).
  2. Hand-curate 30–50 clear archetypes per side. Short, unambiguous, varied topics. ~30 minutes of human work.
  3. Encode all your data with Bhala's API.
  4. Compute axis = unit_norm( mean(embed(positive_seeds)) - mean(embed(negative_seeds)) ) — one unit vector defining the direction.
  5. Score each sentence by the inconsistency between its position on axis and its declared label.
  6. Sort descending. Review the top 5%. Expect 70–100% of them to be real errors.

For Bantu-language datasets you can curate seeds in any one Bantu language and apply the resulting axis cross-lingually to other Bantu languages — at least as a strong starting point. The Zulu→Swahili transfer above shows it works empirically.

Where Bhala saves time across the AI lifecycle

This post showed one slice — data curation — but the time-saving pattern repeats at every stage of the AI lifecycle. The through-line is bring your own data, small data: a frozen Bhala backbone plus a tiny customer-supplied input unlocks production-grade outputs at every step.

Stage What teams normally do With Bhala Time saved
Data curation Hand-review every silver label, or accept whatever auto-labeling produces (often 30–60% errors at scale) Hand-curate 100 seeds (~30 min) → review Bhala's top 100 most-inconsistent → ~80–100 confirmed mislabels surfaced 5–10× reviewer-time multiplier (this post)
Training a downstream classifier Fine-tune a 7B–70B foundation model on your labeled data: hours-to-days on a GPU cluster, hyperparameter sweeps, checkpoint management Bhala embeddings + a small classifier head on your labels Classifier head trains in <2 sec on a laptop CPU. No GPU required. No fine-tuning. No catastrophic forgetting.
Inference Production LLM API call: 200ms–2s p50 latency, GPU-bound, expensive at scale Bhala embedding + operator: ~36ms p50 (under typical 50ms SLO) 5–50× latency reduction, CPU-feasible serving, predictable cost per call
Adding a new language Collect labeled examples per language (~1K–10K per task per language), retrain or fine-tune per locale Curate ~50 seeds in one Bantu language → cross-lingual operator transfer to other Bantu languages emergently (this post: Zulu seeds → Swahili audit) Eliminates the labeled-data-per-language requirement. New Bantu languages attach with seed curation, not dataset construction.
Compliance / audit Retrofit explainability onto a closed-weight LLM Per-call signed audit receipts naming exactly which operators ran with what magnitude Regulator-ready out of the box — EU AI Act / NIST AI RMF artifacts are the API response, not a follow-up project

The composable framing: every Bhala API call gives you (a) the embedding, (b) the operator-shifted version, (c) the audit receipt, (d) cross-lingual reach for free if your axis was learned in any Bantu language. Customers bring small data — seed sets, labeled examples in the dozens to hundreds — and Bhala's frozen geometry does the heavy lift.

For most teams the bottleneck isn't model quality. It's the data and ops tax at every step: curate, label, train, serve, audit, repeat per language. Bhala's bet is that a strong frozen compositional encoder plus the right tiny manual input compresses each of those steps from days to minutes.

What's next

  • Multi-axis extension: same recipe applied to bias axes (28 dimensions) for fairness audits, intent axes for chatbot training data, hate axes for moderation pipelines. Each axis needs its own ~50 archetypes — that's the only manual cost.
  • Human-in-the-loop API: POST /v1/audit/labels accepting a dataset + 50 seed sentences + a label field, returning a confidence-ranked list of suspect labels with margin scores. Coming.
  • AfriSenti coordination: the customer-service-apology pattern looks systematic enough that we'll reach out to the AfriSenti maintainers with a candidate list — framed as "here's a probe-flagged subset for the maintainers' own review," not as unilateral correction. Their annotation policy is theirs to set.

Reproduce

The full audit ran in ~2 minutes on a single CPU. To reproduce on your own data:

  • 100 Zulu seeds (50 positive + 50 negative, hand-curated archetypes)
  • A Bhala API key — self-serve here
  • Your labeled corpus

Code to compute the sentiment axis and rank your dataset is ~30 lines. We're publishing the seed list and the AfriSenti Swahili audit JSON alongside the API; if you want early access to either, talk to us.


This work was conducted by the Bhala research team in May 2026. The cross-lingual transfer result (Zulu seeds → Swahili audit) is consistent with Bhala's prior published cross-lingual numbers on Bantu sentiment and intent.

Continue reading

·22 min read

A 15M Zulu-only model beats GPT-4o on Swahili — and understands Korean without ever seeing it

A 15M-parameter encoder pretrained on isiZulu — and nothing else — reaches 73.2% on Swahili intent (above GPT-4o zero-shot at 70.6%) and 72.5% on Korean using only a linear probe on the frozen encoder. Korean has nothing in common with Zulu — different family, different script, never seen in pretraining. By the strictest version of the field's gold-standard test (frozen encoder + linear probe + zero target-language data), this is the strongest published cross-lingual transfer result we know of.

·8 min read

We audited six open LLMs for bias. The 0.4B model beat every 7B we tested — and RLHF wasn't the reason.

Bhala ran the same 28-axis fairness audit we used on Gemma 4 against six popular open-weight LLMs — LLaMA-2 7B base and chat, Mistral 7B base and instruct, Phi-2 2.7B, and InkubaLM 0.4B. The 0.4B model showed bias on 7 of 28 axes; every 7B model showed bias on 15–18. Size doesn't explain it (Phi-2 sits with the 7Bs). RLHF doesn't either (chat-tuned variants came out marginally worse than bases). Pretraining-data composition does. If you ship any of these models, the table at the top of the post is your before-deployment cheat sheet.

·8 min read

We audited Gemma 4. The bias didn't go away — it went into hiding.

Standard fairness audits call Gemma 4 clean. We ran a stronger one and found bias intact in all 28 protected dimensions we tested. Here's what it means for your deployment, how to audit any open model the same way, and a live API you can paste into a terminal right now to flip a biased sentence.