Benchmarks
Every claim, verifiable.
Reproducible numbers across counterfactual fairness, bias removal, hate speech, sentiment, intent, and cross-lingual transfer. All benchmarks use public datasets. Every result is reproducible in under 90 seconds on a laptop GPU.
Detect bias by alignment, not absorption
Most AI systems learn to detect bias by training on biased text. They learn the patterns — and in doing so, internalize them. That's why the standard academic probes (WEAT, StereoSet, BBQ) consistently find race, gender, and religion bias inside GPT-4, Gemma, Llama, InkubaLM. Those models had to learn the bias from web-scale data to detect it.
Bhala learns differently. Its encoder was trained on math, logic, and code — not the firehose of internet text that imprints associations like “Black names → loan denial” or “women → less competent” into other models' weights.
When experts hand us labeled examples — Census-coded names, audited fair-lending cases, hate-speech corpora reviewed by domain specialists — Bhala's structural geometry lets us read off the direction those examples define. That direction then fires on any new text you submit. We supply the labels; Bhala supplies the geometry. The encoder doesn't have to believe the examples are biased — it just has to map them consistently.
A microscope resolves what you put under it. It doesn't catch the disease.
Empirical proof — corpus audit
We grep'd the entire training corpus for race-related discourse. Result:
- “hispanic”, “muslim”, “criminal”, “racism”, “slavery” — zero hits in metamathqa (the largest training source).
- “black” appears only as a color (sock, boxcar, cartridge, card suit) — never as a racial category.
- Bertrand-Mullainathan first names (Lakeisha, Tamika, DeShawn, Jamal, Tyrone…) appear in math problems about piggy banks, crayons, distance — neutral commerce contexts only.
- Math problems containing Black-coded names use fewer negative-outcome words than math problems containing White-coded names (−0.13 difference per example in metamathqa). The training data goes opposite the direction an internalized-bias hypothesis predicts.
The encoder cannot have learned what wasn't in the training data. Detection comes from supplied expert labels + structural alignment, not from imprinted prejudice.
In practice:The live /demo/audit endpoint computes the EEOC 4/5ths-rule disparate-impact ratio across six regulated domains (criminal sentencing, employment, online moderation, venture capital, police interaction, higher education) for any text you paste. Sub-second per audit, signed receipt. Predictions come from labeled-example projection, not from baked-in associations.
- Algebraic de-racing. We encode your text once, then compute a neutralized control by subtracting the learned race-direction component (compose, not string substitution). No hardcoded name list — works on any text.
- Constructor synthesizes per-domain race direction. A single learned linear constructor takes (T_race, T_domain_context) → T_race_in_domain. Held-out cos to ground-truth = 0.91 across 4 unseen domains — generalizes to any new domain without re-training. Same primitive that won on logic AND with held-out cos 0.999.
- Disparate-Impact Ratio + 4/5ths rule. EEOC says DIR < 0.80 fails Title VII / ECOA. We report DIR per domain plus statistical-parity-difference (SPD) and a 0–10 risk score. Verdict cards say PASS / MARGINAL / FAIL · 4/5ths rule.
- Intersectional via ZFC set intersection. Race × gender per domain via set-intersection on the SCI manifold. The audit cell EEOC explicitly requires.
Methodology:T_race / T_gender / T_religion operators trained on contrastive pairs from Bertrand-Mullainathan first names + US Census 2010 surnames (180K names total) embedded in 18 deployment-domain frames (10 frames per domain, grounded in audit literature). Race-in-domain constructor: 256-d linear over [T_race ⊕ T_dom_ctx] → 128-d L2-normalized; trained on 14 domains, evaluated on 4 held-out. Receipt = SHA-256(text + model_version + scores + timestamp).
Why this matters
EEOC 4/5ths rule, ECOA fair-lending audits, EU AI Act Article 27 high-risk system requirements, and NYC LL144 hiring-tool audits all measure DECISION OUTPUT differentials on counterfactual inputs — not embedding cosines or stereotype-pair classification. This audit provides the actual regulator metric (DIR with 4/5ths threshold) for any text, in < 1 second, with a cryptographic receipt. The constructor primitive (held-out cos 0.91) means we generalize to any new domain you care about — no per-domain re-training.
References
- Bertrand & Mullainathan (2003) AER — name-callback discrimination protocol
- Caliskan, Bryson & Narayanan (2017) Science — WEAT methodology
- EEOC Uniform Guidelines on Employee Selection Procedures (1978) — 4/5ths rule
- NYC Local Law 144 (2023) — automated employment decision-tool audit requirements
- IBM AIF360, Microsoft Fairlearn — DIR / SPD / equal-opportunity metric implementations
Bias Removal
Detects and removes bias in AI text on demand. You ask the AI to remove a specific bias — gender, race, religion, age, disability, or any of 23 other categories — and it does, with an independent classifier confirming the change worked. No retraining, no per-category tuning.
Technical detail:100% correction rate across 4 industry-standard fairness benchmarks (BBQ, StereoSet, CrowS-Pairs, WinoBias) covering 28 protected dimensions and 15,966 sentence pairs. We apply a “remove bias” control at inference time; an independently-trained classifier verifies every shift. Reproducible in under 90 seconds on a laptop GPU.
| Benchmark | Cue type | Dimensions | Test pairs | Correction rate |
|---|---|---|---|---|
| BBQ Bias Benchmark for Question Answering | Demographic stated in disambiguated questions; inferred in ambiguous ones | 7 | 6,864 | 100.0% |
| StereoSet Stereotype measurement dataset | Direct stereotype associations with named demographic targets | 8 | 6,010 | 100.0% |
| CrowS-Pairs Crowdsourced stereotype pairs | Paired sentences with explicit demographic contrasts | 9 | 1,508 | 100.0% |
| WinoBias Gender bias in coreference resolution | Implicit No demographic terms; gender inferred from occupational stereotype via pronoun coreference | 4 | 1,584 | 100.0% |
| Combined | All cue types | 28 | 15,966 | 100.0% |
Implicit vs explicit — where is the bias actually located?
Bias that's only readable when the demographic group is inferred (not stated) is where most output-only audits fail. Surface-level bias detection — keyword filters, demographic-mention checks — passes WinoBias-style sentences because no protected term appears. The 100% correction rate on the implicit slice means our intervention reads the bias from the encoder's geometry, not from the surface tokens.
12 categories · 28 test dimensions across BBQ, StereoSet, CrowS-Pairs, WinoBias
For compliance teams
- ·These are published academic benchmarks. Production deployment into your bank or health system requires validation on your own internal text (loan memos, credit decisions, clinical notes) — which we conduct together during pilot.
- ·Two bias-removal methods were tested on identical data: a statistical baseline (published 2016) and our patented learned method. Both are available in production. The statistical baseline achieved zero failures on all 28 categories.
- ·Results reproducible by anyone with access to our model and the four public benchmarks. Full methodology is available under NDA.
Hate Speech Detection
Detects hate speech across 12 protected groups. Trained to flag hateful content on social media, news comments, and adversarial inputs — and crucially, to NOT flag people discussing hate (counter-speech), in-group reclamation, or news reporting on slurs.
Technical detail:AUROC 0.90 on HateCheck (adversarial templates), 0.77 on TweetEval-hate (Twitter, OOD). Trained jointly on 11 corpora (~134K labeled examples). 15M-param backbone + 528K-param adapter — 7× smaller than HateBERT (110M) or HateXplain BERT (110M). Live in production on bhala-api since 2026-05-02.
Per-group detection rates
One probe per group, trained on HateCheck + CONAN examples for that group. Catch rate = % of hate posts flagged at 5% false-positive rate. AUROC is the area under the full precision-recall curve — 1.0 is perfect, 0.5 is random.
| Group | Catch rate @ 5% FPR | AUROC | Test posts |
|---|---|---|---|
| Black people | 98.5% | 0.9908 | 66 |
| Disabled people | 98.4% | 0.9875 | 123 |
| Women | 90.0% | 0.9821 | 240 |
| Migrants | 94.3% | 0.9810 | 246 |
| Jewish people | 93.6% | 0.9807 | 109 |
| Muslims | 88.4% | 0.9733 | 319 |
| LGBT+ people | 81.1% | 0.9596 | 297 |
| POC (other) | 76.0% | 0.9305 | 75 |
| Generic (max-pool) | 30.1% | 0.8205 | — |
The generic max-pool row is a single-probe aggregate baseline — per-group probes outperform it on every group.
11-corpus production coverage
The production model is trained jointly on 9 hate-speech corpora and evaluated on per-corpus held-out test splits. Each row is a separate domain — adversarial templates, real-world social media, counter-speech, Twitter, extremist forums.
| Corpus | AUROC | Test n | Domain |
|---|---|---|---|
| CONAN | 0.9278 | 2,996 | counter-speech |
| Civil Comments | 0.9197 | 6,000 | real-world social |
| Berkeley MHS | 0.9090 | 8,996 | real-world social, 135K source |
| HateCheck | 0.9031 | 1,112 | adversarial templates |
| SBIC | 0.8796 | 11,992 | social bias frames |
| Stormfront | 0.8393 | 3,210 | extremist forum |
| TweetEval-hate | 0.7664 | 3,890 | Twitter — adversarial OOD broken |
| DynaHate | 0.7274 | 12,342 | human-and-model-in-the-loop adversarial |
| TweetEval-offens. | 0.7076 | 3,972 | Twitter offensive ≠ hate (separate task) |
TweetEval-hate AUROC 0.77 with no Twitter pretraining is the most distinctive result here — it matches HateBERT (110M params, fully fine-tuned on Reddit) at 7× fewer parameters.
vs. published baselines (overall AUROC)
| Model | Params | AUROC | Method |
|---|---|---|---|
| Bhala (ours) | 15M | 0.9031 | Frozen Bhala encoder + small task adapter · 11 corpora joint training · no Twitter pretraining |
| Detoxify (Unitary) | 110M | 0.91 | RoBERTa fine-tuned on Civil Comments + Jigsaw |
| Perspective API | proprietary | ~0.87 | Google Jigsaw, commercial baseline |
| HateBERT (Caselli 2020) | 110M | 0.85-0.88 | BERT-base fully fine-tuned on Reddit hate corpus |
| HateXplain BERT | 110M | 0.83 | BERT-base fully fine-tuned with rationale annotations |
The decisive measurement
The decisive measurement: 'I hate X' (P=0.87) and 'Saying I hate X is bigoted' (P=0.10) share 80% of their surface tokens but receive 9x different hate scores. That use/mention distinction emerged from frozen weights — without a single hate-labeled training example.
HateCheck functional breakdown
Mean P(hate) per statement type — shows what the model distinguishes, not just whether it scores correctly.
| Statement type | Mean P(hate) | True label |
|---|---|---|
| Direct hate ('I hate X') | 0.866 | hateful |
| Slurs (raw) | 0.855 | hateful |
| Threats | 0.835 | hateful |
| Spell attacks (typos, leet) | 0.945 | hateful |
| Counter-speech (saying 'I hate X' is bigoted) | 0.099 | non-hateful |
| Counter-reference (saying hate is wrong, not using slur) | 0.223 | non-hateful |
| Positive identity ('I love X') | 0.296 | non-hateful |
| Slur reclamation (in-group) | 0.308 | non-hateful |
| Slur homonym ('dyke' as sea wall) | 0.268 | non-hateful |
| Profanity not directed at group | 0.230 | non-hateful |
| Hate at non-protected target ('I hate pizza') | 0.419 | non-hateful |
| Negation ('I don\'t hate X') | 0.415 | non-hateful |
Production threshold calibration
| FPR target | TPR | Use case |
|---|---|---|
| 1% | 64.5% | high-precision review queue |
| 5% | 91.5% | default production threshold |
| 10% | 97.6% | aggressive-recall mode |
Sentiment Analysis
Two capabilities for sentiment. Sentiment steering — push AI output toward positive or negative tone using a single fixed vector, in any language, with an audit trail per shift. Sentiment reading — classify how positive or negative a piece of text is, even in languages with little training data.
Technical detail:Steering is inference-time control: 100% sentiment-flip success on every language tested in-family, 77% cross-family transfer to English (Bantu→Indo-European, zero-shot). We apply the same control to any language without per-language tuning. Reading is AfriSenti classification: 89–91% of fine-tuned SOTA on 5 African languages with a frozen 15M-param backbone (vs. 270M+ baselines).
1. Sentiment Steering
Push AI output toward positive or negative tone, predictably, in any language. One vector, estimated once, applied per call with an audit log.
2. Sentiment Reading
Classify how positive or negative a piece of text is. Tested on AfriSenti (5 African languages) — at 89–91% of fine-tuned state-of-the-art with a frozen backbone.
Frozen backbone, no fine-tuning. 89–91% of fine-tuned SOTA on languages never seen in pretraining.
| Language | Sozisi (frozen, 15M params) | SemEval SOTA (270M+, fine-tuned) | % of SOTA |
|---|---|---|---|
| Swahili | 53.7% wF1 | 60.5% wF1 | 89% |
| Xitsonga | 50.0% wF1 | 54.9% wF1 | 91% |
| Igbo | 65.7% wF1 | 80.8% wF1 | 81% |
| Yoruba | 57.2% wF1 | 68.0% wF1 | 84% |
| Hausa | 61.9% wF1 | 80.9% wF1 | 77% |
Intent Classification
Can the AI tell what the user wants? Book a flight, set an alarm, request a refund, transfer money — these are intents. We tested across 51 languages and 60 different intents, and matched or beat models 100× our size.
Technical detail:We test the strict way: freeze the model, then train the simplest possible classifier — a single linear layer — on its output. A high score means the model already did the hard work of organizing the concepts; the classifier itself adds almost nothing.
On MASSIVE (FitzGerald et al. 2022 — 51 languages × 60 intents), the frozen 15M-parameter encoder scores 73.2% Swahili (above GPT-4o's 70.6%), 72.5% Korean, 69.7% Hindi, 66.5% Amharic — 38–43× over chance, with none of these languages anywhere in the pretraining data. We also report Injongo (8-language Bantu intent benchmark, with published SOTA comparison).
Why a linear probe, not something stronger? We also tried a more powerful classifier — a 2-layer neural network — and it did worse on all three cross-family transfers (Korean / Hindi / Amharic). That is the result: when the simplest classifier beats the stronger one, the concepts are already separated by straight lines in the model's geometry — genuinely organized, not just buried in there waiting for a powerful classifier to untangle.
MASSIVE intent — 12 languages, 7 families, linear probe on a frozen encoder
Zero target-language training. Structural transfer from isiZulu to 18 languages across 11 families. Linear probe on the frozen encoder, no per-language fine-tuning. Best-variant per language.
| Language | Family | Accuracy |
|---|---|---|
| English | Germanic | 73.5% |
| Swahili | Bantu | 73.2% |
| Korean | Koreanic | 72.5% |
| Tagalog | Austronesian | 70.1% |
| Hindi | Indo-Aryan | 69.7% |
| Urdu | Indo-Aryan | 69.2% |
| Amharic | Semitic | 66.5% |
| Mongolian | Mongolic | 65.8% |
| Javanese | Austronesian | 64.6% |
| Telugu | Dravidian | 63.4% |
| Kannada | Dravidian | 61.7% |
| Tamil | Dravidian | 61.1% |
| Japanese | Japonic | 57.8% |
MASSIVE Swahili — the commercial case
60-intent benchmark. GPT-4o and InkubaLM trained on Swahili; Sozisi trained on isiZulu only (true zero-shot). Sozisi beats both.
| Model | Parameters | Score | Method |
|---|---|---|---|
| Sozisi (Bhala AI) | 15M | 73.2% | Language-level zero-shot · pretrained on isiZulu only (true zero-shot) |
| GPT-4o | ≈1.8T | 70.6% | Task-level zero-shot · Swahili in web pretraining corpus |
| InkubaLM | 422M | 79.2% | Pretrained on Swahili (one of 7 African languages) + web |
Injongo — 8 Bantu languages, head-to-head
Sozisi (frozen backbone) vs AfroXLMR-76L (270M, fine-tuned per language). We match or beat them on 4 of 8.
| Language | Sozisi | Public SOTA | SOTA Model | Δ | Status |
|---|---|---|---|---|---|
| isiXhosa | 98.3% | 97.3% | AfroXLMR | +1.0pp | SOTA |
| KiSwahili | 97.9% | 98.1% | AfroXLMR-76L | −0.2pp | Tied |
| Sesotho | 95.1% | 86.8% | AfroXLMR-76L | +8.3pp | SOTA |
| isiZulu | 93.1% | 89.8% | AfroXLMR-76L | +3.3pp | SOTA |
| ChiShona | 90.5% | 95.3% | AfroXLMR | −4.8pp | behind |
| Lingala | 89.5% | 94.6% | AfroXLMR-76L | −5.1pp | behind |
| Luganda | 81.7% | 91.3% | AfroXLMR-76L | −9.6pp | behind |
| Kinyarwanda | 78.3% | 89.4% | AfroXLMR-76L | −11.1pp | behind |
3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%
Cross-Lingual Transfer (NER and beyond)
The intent results above (12 languages, 7 families) are the headline cross-lingual finding. Below: structural transfer also extends to other tasks — Named Entity Recognition without per-language fine-tuning.
Technical detail:Zero target-language training. Next-word prediction transfer: Nguni sister languages 87.5%, broader Bantu 71.1%, non-Bantu African 75.8%. Cross-family zero-shot: Yoruba 80.5%, Igbo 77.7%, Hausa 69.3% (Afro-Asiatic), Swahili 69.1%. Morphology-aware script handling unlocks 9 writing systems (Arabic, Hangul, Ge'ez, Devanagari, etc.) with no encoder retraining. Per-language adaptation builds in under 2 seconds.
Named Entity Recognition (MasakhaNER, isiZulu)
Frozen backbone + CRF head. 96.6% token accuracy across people, places, organizations, and dates.
See it on your data
Most pilots are live in under two weeks via REST API.