Benchmarks
Every claim, verifiable.
Reproducible numbers across bias removal, hate speech, sentiment, intent, and cross-lingual transfer. All benchmarks use public datasets. Every result is reproducible in under 90 seconds on a laptop GPU.
Bias Removal
Detects and removes bias in AI text on demand. You ask the AI to remove a specific bias — gender, race, religion, age, disability, or any of 23 other categories — and it does, with an independent classifier confirming the change worked. No retraining, no per-category tuning.
Technical detail:100% correction rate across 4 industry-standard fairness benchmarks (BBQ, StereoSet, CrowS-Pairs, WinoBias) covering 28 protected dimensions and 15,966 sentence pairs. A “remove bias” control is applied at inference time; an independently-trained classifier verifies every shift. Reproducible in under 90 seconds on a laptop GPU.
| Benchmark | Cue type | Dimensions | Test pairs | Correction rate |
|---|---|---|---|---|
| BBQ Bias Benchmark for Question Answering | Demographic stated in disambiguated questions; inferred in ambiguous ones | 7 | 6,864 | 100.0% |
| StereoSet Stereotype measurement dataset | Direct stereotype associations with named demographic targets | 8 | 6,010 | 100.0% |
| CrowS-Pairs Crowdsourced stereotype pairs | Paired sentences with explicit demographic contrasts | 9 | 1,508 | 100.0% |
| WinoBias Gender bias in coreference resolution | Implicit No demographic terms; gender inferred from occupational stereotype via pronoun coreference | 4 | 1,584 | 100.0% |
| Combined | All cue types | 28 | 15,966 | 100.0% |
Implicit vs explicit — where is the bias actually located?
Bias that's only readable when the demographic group is inferred (not stated) is where most output-only audits fail. Surface-level bias detection — keyword filters, demographic-mention checks — passes WinoBias-style sentences because no protected term appears. The 100% correction rate on the implicit slice means our intervention reads the bias from the encoder's geometry, not from the surface tokens.
12 categories · 28 test dimensions across BBQ, StereoSet, CrowS-Pairs, WinoBias
For compliance teams
- ·These are published academic benchmarks. Production deployment into your bank or health system requires validation on your own internal text (loan memos, credit decisions, clinical notes) — which we conduct together during pilot.
- ·Two bias-removal methods were tested on identical data: a statistical baseline (published 2016) and our patented learned method. Both are available in production. The statistical baseline achieved zero failures on all 28 categories.
- ·Results reproducible by anyone with access to our model and the four public benchmarks. Full methodology is available under NDA.
Hate Speech Detection
Detects hate speech across 12 protected groups. Trained to flag hateful content on social media, news comments, and adversarial inputs — and crucially, to NOT flag people discussing hate (counter-speech), in-group reclamation, or news reporting on slurs.
Technical detail:AUROC 0.90 on HateCheck (adversarial templates), 0.77 on TweetEval-hate (Twitter, OOD). Trained jointly on 11 corpora (~134K labeled examples). 15M-param backbone + 528K-param adapter — 7× smaller than HateBERT (110M) or HateXplain BERT (110M). Live in production on bhala-api since 2026-05-02.
Per-group detection rates
One probe per group, trained on HateCheck + CONAN examples for that group. Catch rate = % of hate posts flagged at 5% false-positive rate. AUROC is the area under the full precision-recall curve — 1.0 is perfect, 0.5 is random.
| Group | Catch rate @ 5% FPR | AUROC | Test posts |
|---|---|---|---|
| Black people | 98.5% | 0.9908 | 66 |
| Disabled people | 98.4% | 0.9875 | 123 |
| Women | 90.0% | 0.9821 | 240 |
| Migrants | 94.3% | 0.9810 | 246 |
| Jewish people | 93.6% | 0.9807 | 109 |
| Muslims | 88.4% | 0.9733 | 319 |
| LGBT+ people | 81.1% | 0.9596 | 297 |
| POC (other) | 76.0% | 0.9305 | 75 |
| Generic (max-pool) | 30.1% | 0.8205 | — |
The generic max-pool row is a single-probe aggregate baseline — per-group probes outperform it on every group.
11-corpus production coverage
The production model is trained jointly on 9 hate-speech corpora and evaluated on per-corpus held-out test splits. Each row is a separate domain — adversarial templates, real-world social media, counter-speech, Twitter, extremist forums.
| Corpus | AUROC | Test n | Domain |
|---|---|---|---|
| CONAN | 0.9278 | 2,996 | counter-speech |
| Civil Comments | 0.9197 | 6,000 | real-world social |
| Berkeley MHS | 0.9090 | 8,996 | real-world social, 135K source |
| HateCheck | 0.9031 | 1,112 | adversarial templates |
| SBIC | 0.8796 | 11,992 | social bias frames |
| Stormfront | 0.8393 | 3,210 | extremist forum |
| TweetEval-hate | 0.7664 | 3,890 | Twitter — adversarial OOD broken |
| DynaHate | 0.7274 | 12,342 | human-and-model-in-the-loop adversarial |
| TweetEval-offens. | 0.7076 | 3,972 | Twitter offensive ≠ hate (separate task) |
TweetEval-hate AUROC 0.77 with no Twitter pretraining is the most distinctive result here — it matches HateBERT (110M params, fully fine-tuned on Reddit) at 7× fewer parameters.
vs. published baselines (overall AUROC)
| Model | Params | AUROC | Method |
|---|---|---|---|
| Bhala (ours) | 15M | 0.9031 | Frozen Bhala encoder + small task adapter · 11 corpora joint training · no Twitter pretraining |
| Detoxify (Unitary) | 110M | 0.91 | RoBERTa fine-tuned on Civil Comments + Jigsaw |
| Perspective API | proprietary | ~0.87 | Google Jigsaw, commercial baseline |
| HateBERT (Caselli 2020) | 110M | 0.85-0.88 | BERT-base fully fine-tuned on Reddit hate corpus |
| HateXplain BERT | 110M | 0.83 | BERT-base fully fine-tuned with rationale annotations |
The decisive measurement
The decisive measurement: 'I hate X' (P=0.87) and 'Saying I hate X is bigoted' (P=0.10) share 80% of their surface tokens but receive 9x different hate scores. That use/mention distinction emerged from frozen weights — without a single hate-labeled training example.
HateCheck functional breakdown
Mean P(hate) per statement type — shows what the model distinguishes, not just whether it scores correctly.
| Statement type | Mean P(hate) | True label |
|---|---|---|
| Direct hate ('I hate X') | 0.866 | hateful |
| Slurs (raw) | 0.855 | hateful |
| Threats | 0.835 | hateful |
| Spell attacks (typos, leet) | 0.945 | hateful |
| Counter-speech (saying 'I hate X' is bigoted) | 0.099 | non-hateful |
| Counter-reference (saying hate is wrong, not using slur) | 0.223 | non-hateful |
| Positive identity ('I love X') | 0.296 | non-hateful |
| Slur reclamation (in-group) | 0.308 | non-hateful |
| Slur homonym ('dyke' as sea wall) | 0.268 | non-hateful |
| Profanity not directed at group | 0.230 | non-hateful |
| Hate at non-protected target ('I hate pizza') | 0.419 | non-hateful |
| Negation ('I don\'t hate X') | 0.415 | non-hateful |
Production threshold calibration
| FPR target | TPR | Use case |
|---|---|---|
| 1% | 64.5% | high-precision review queue |
| 5% | 91.5% | default production threshold |
| 10% | 97.6% | aggressive-recall mode |
Sentiment Analysis
Two capabilities for sentiment. Sentiment steering — push AI output toward positive or negative tone using a single fixed vector, in any language, with an audit trail per shift. Sentiment reading — classify how positive or negative a piece of text is, even in languages with little training data.
Technical detail:Steering is inference-time control: 100% sentiment-flip success on every language tested in-family, 77% cross-family transfer to English (Bantu→Indo-European, zero-shot). The same control is applied to any language without per-language tuning. Reading is AfriSenti classification: 89–91% of fine-tuned SOTA on 5 African languages with a frozen 15M-param backbone (vs. 270M+ baselines).
1. Sentiment Steering
Push AI output toward positive or negative tone, predictably, in any language. One vector, estimated once, applied per call with an audit log.
2. Sentiment Reading
Classify how positive or negative a piece of text is. Tested on AfriSenti (5 African languages) — at 89–91% of fine-tuned state-of-the-art with a frozen backbone.
Frozen backbone, no fine-tuning. 89–91% of fine-tuned SOTA on languages never seen in pretraining.
| Language | Sozisi (frozen, 15M params) | SemEval SOTA (270M+, fine-tuned) | % of SOTA |
|---|---|---|---|
| Swahili | 53.7% wF1 | 60.5% wF1 | 89% |
| Xitsonga | 50.0% wF1 | 54.9% wF1 | 91% |
| Igbo | 65.7% wF1 | 80.8% wF1 | 81% |
| Yoruba | 57.2% wF1 | 68.0% wF1 | 84% |
| Hausa | 61.9% wF1 | 80.9% wF1 | 77% |
Intent Classification
Can the AI tell what the user wants? Book a flight, set an alarm, request a refund, transfer money — these are intents. We tested across 51 languages and 60 different intents, and matched or beat models 100× our size.
Technical detail:Two benchmarks. MASSIVE (FitzGerald et al. 2022) — 51 languages × 60 intents. Injongo — 8-language Bantu intent benchmark with published SOTA comparison. **Linear probe on frozen encoder** (strictest version of the field's gold-standard representation-quality test), no per-language fine-tuning, zero target-language data anywhere in pretraining. 73.2% Swahili (above GPT-4o's 70.6%), 72.5% Korean, 69.7% Hindi, 66.5% Amharic at 15M parameters — none of these languages are in the pretraining data. The linear probe outperforms a 2-layer MLP probe on all three cross-family transfers (Korean / Hindi / Amharic), confirming the structure is fully linearly separable in the encoder's geometry. 38–43× over random across all four.
MASSIVE Swahili — the commercial case
60-intent benchmark. GPT-4o and InkubaLM trained on Swahili; Sozisi trained on isiZulu only (true zero-shot). Sozisi beats both.
| Model | Parameters | Score | Method |
|---|---|---|---|
| Sozisi (Bhala AI) | 15M | 73.2% | Language-level zero-shot · pretrained on isiZulu only (true zero-shot) |
| GPT-4o | ≈1.8T | 70.6% | Task-level zero-shot · Swahili in web pretraining corpus |
| InkubaLM | 422M | 79.2% | Pretrained on Swahili (one of 7 African languages) + web |
Injongo — 8 Bantu languages, head-to-head
Sozisi (frozen backbone) vs AfroXLMR-76L (270M, fine-tuned per language). We match or beat them on 4 of 8.
| Language | Sozisi | Public SOTA | SOTA Model | Δ | Status |
|---|---|---|---|---|---|
| isiXhosa | 98.3% | 97.3% | AfroXLMR | +1.0pp | SOTA |
| KiSwahili | 97.9% | 98.1% | AfroXLMR-76L | −0.2pp | Tied |
| Sesotho | 95.1% | 86.8% | AfroXLMR-76L | +8.3pp | SOTA |
| isiZulu | 93.1% | 89.8% | AfroXLMR-76L | +3.3pp | SOTA |
| ChiShona | 90.5% | 95.3% | AfroXLMR | −4.8pp | behind |
| Lingala | 89.5% | 94.6% | AfroXLMR-76L | −5.1pp | behind |
| Luganda | 81.7% | 91.3% | AfroXLMR-76L | −9.6pp | behind |
| Kinyarwanda | 78.3% | 89.4% | AfroXLMR-76L | −11.1pp | behind |
3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%
Cross-Lingual Transfer
Trained on one language, works on 40+. We pretrained the model on a single language (isiZulu, ~40M tokens, ~1 hour on a laptop GPU). Without any retraining or fine-tuning, the same model now handles 40+ other languages across 10 different language families — Bantu, Indo-European, Dravidian, Semitic, Austronesian, and more.
Technical detail:Zero target-language training. Next-word prediction transfer: Nguni sister languages 87.5%, broader Bantu 71.1%, non-Bantu African 75.8%. Cross-family zero-shot: Yoruba 80.5%, Igbo 77.7%, Hausa 69.3% (Afro-Asiatic), Swahili 69.1%. Proprietary script handling unlocks 9 writing systems (Arabic, Hangul, Ge'ez, Devanagari, etc.) with no encoder retraining. Per-language adaptation builds in under 2 seconds.
MASSIVE intent accuracy — zero-shot
| Language | Family | Accuracy |
|---|---|---|
| Swahili | Bantu | 73.2% |
| Korean | Koreanic | 72.5% |
| Tagalog | Austronesian | 70.1% |
| Hindi | Indo-Aryan | 69.7% |
| Urdu | Indo-Aryan | 69.2% |
| Amharic | Semitic | 66.5% |
| Mongolian | Mongolic | 65.8% |
| Javanese | Austronesian | 64.6% |
| Japanese | Japonic | 56.5% |
| Telugu | Dravidian | 63.4% |
| Kannada | Dravidian | 61.7% |
| Tamil | Dravidian | 61.1% |
Named Entity Recognition (MasakhaNER, isiZulu)
Frozen backbone + CRF head. 96.6% token accuracy across people, places, organizations, and dates.
See it on your data
Most pilots are live in under two weeks via REST API.