Blog
From the Bhala Team
Language tech, product launches, engineering notes.
A 15M Zulu-only model beats GPT-4o on Swahili — and understands Korean without ever seeing it
A 15M-parameter encoder pretrained on isiZulu — and nothing else — reaches 73.2% on Swahili intent (above GPT-4o zero-shot at 70.6%) and 72.5% on Korean using only a linear probe on the frozen encoder. Korean has nothing in common with Zulu — different family, different script, never seen in pretraining. By the strictest version of the field's gold-standard test (frozen encoder + linear probe + zero target-language data), this is the strongest published cross-lingual transfer result we know of.
Silver labels are noisy by design. Bhala's audit catches the worst of them — top-10 precision: 100%.
Every production NLP team is sitting on silver-labeled training data — auto-tagged at scale, noisy by design. Bhala's audit tool surfaces the real mislabels in those corpora using just 100 hand-curated seeds and zero sentiment supervision. Top-10 precision on held-out validation: 100%. AUROC: 0.732. The same seeds curated in one Bantu language transfer cross-lingually to surface clear errors and policy-boundary cases in another Bantu language with no extra supervision. The product play: 5–10× reviewer-time multiplier across the AI lifecycle.
We audited six open LLMs for bias. The 0.4B model beat every 7B we tested — and RLHF wasn't the reason.
Bhala ran the same 28-axis fairness audit we used on Gemma 4 against six popular open-weight LLMs — LLaMA-2 7B base and chat, Mistral 7B base and instruct, Phi-2 2.7B, and InkubaLM 0.4B. The 0.4B model showed bias on 7 of 28 axes; every 7B model showed bias on 15–18. Size doesn't explain it (Phi-2 sits with the 7Bs). RLHF doesn't either (chat-tuned variants came out marginally worse than bases). Pretraining-data composition does. If you ship any of these models, the table at the top of the post is your before-deployment cheat sheet.
We audited Gemma 4. The bias didn't go away — it went into hiding.
Standard fairness audits call Gemma 4 clean. We ran a stronger one and found bias intact in all 28 protected dimensions we tested. Here's what it means for your deployment, how to audit any open model the same way, and a live API you can paste into a terminal right now to flip a biased sentence.