Benchmarks

Every claim, verifiable.

Reproducible numbers across counterfactual fairness, bias removal, hate speech, sentiment, intent, and cross-lingual transfer. All benchmarks use public datasets. Every result is reproducible in under 90 seconds on a laptop GPU.

Calibrated to expert labels · sub-second · auditable by construction

Detect bias by alignment, not absorption

Most AI systems learn to detect bias by training on biased text. They learn the patterns — and in doing so, internalize them. That's why the standard academic probes (WEAT, StereoSet, BBQ) consistently find race, gender, and religion bias inside GPT-4, Gemma, Llama, InkubaLM. Those models had to learn the bias from web-scale data to detect it.

Bhala learns differently. Its encoder was trained on math, logic, and code — not the firehose of internet text that imprints associations like “Black names → loan denial” or “women → less competent” into other models' weights.

When experts hand us labeled examples — Census-coded names, audited fair-lending cases, hate-speech corpora reviewed by domain specialists — Bhala's structural geometry lets us read off the direction those examples define. That direction then fires on any new text you submit. We supply the labels; Bhala supplies the geometry. The encoder doesn't have to believe the examples are biased — it just has to map them consistently.

A microscope resolves what you put under it. It doesn't catch the disease.

Empirical proof — corpus audit

We grep'd the entire training corpus for race-related discourse. Result:

“hispanic”, “muslim”, “criminal”, “racism”, “slavery” — zero hits in metamathqa (the largest training source).
“black” appears only as a color (sock, boxcar, cartridge, card suit) — never as a racial category.
Bertrand-Mullainathan first names (Lakeisha, Tamika, DeShawn, Jamal, Tyrone…) appear in math problems about piggy banks, crayons, distance — neutral commerce contexts only.
Math problems containing Black-coded names use fewer negative-outcome words than math problems containing White-coded names (−0.13 difference per example in metamathqa). The training data goes opposite the direction an internalized-bias hypothesis predicts.

The encoder cannot have learned what wasn't in the training data. Detection comes from supplied expert labels + structural alignment, not from imprinted prejudice.

In practice:The live /demo/audit endpoint computes the EEOC 4/5ths-rule disparate-impact ratio across six regulated domains (criminal sentencing, employment, online moderation, venture capital, police interaction, higher education) for any text you paste. Sub-second per audit, signed receipt. Predictions come from labeled-example projection, not from baked-in associations.

Algebraic de-racing. We encode your text once, then compute a neutralized control by subtracting the learned race-direction component (compose, not string substitution). No hardcoded name list — works on any text.
Constructor synthesizes per-domain race direction. A single learned linear constructor takes (T_race, T_domain_context) → T_race_in_domain. Held-out cos to ground-truth = 0.91 across 4 unseen domains — generalizes to any new domain without re-training. Same primitive that won on logic AND with held-out cos 0.999.
Disparate-Impact Ratio + 4/5ths rule. EEOC says DIR < 0.80 fails Title VII / ECOA. We report DIR per domain plus statistical-parity-difference (SPD) and a 0–10 risk score. Verdict cards say PASS / MARGINAL / FAIL · 4/5ths rule.
Intersectional via ZFC set intersection. Race × gender per domain via set-intersection on the SCI manifold. The audit cell EEOC explicitly requires.

Methodology:T_race / T_gender / T_religion operators trained on contrastive pairs from Bertrand-Mullainathan first names + US Census 2010 surnames (180K names total) embedded in 18 deployment-domain frames (10 frames per domain, grounded in audit literature). Race-in-domain constructor: 256-d linear over [T_race ⊕ T_dom_ctx] → 128-d L2-normalized; trained on 14 domains, evaluated on 4 held-out. Receipt = SHA-256(text + model_version + scores + timestamp).

Constructor generalization (4 held-out domains)

criminal_sentencing	0.9116
pain_management	0.9007
promotion_review	0.9511
public_benefits	0.8900
mean	0.9134

cos(constructor-predicted, ground-truth) on domains never seen during training. ≥ 0.70 = endpoint uses synthesized direction (covers any new domain).

Operator quality (held-out)

T_race	cos 0.58	lift +0.28
T_gender	cos 0.58	lift +0.40
T_religion	cos 0.70	lift +0.34

Lift over random pairing on held-out evaluation (kill criterion ≥ 0.25).

Why this matters

EEOC 4/5ths rule, ECOA fair-lending audits, EU AI Act Article 27 high-risk system requirements, and NYC LL144 hiring-tool audits all measure DECISION OUTPUT differentials on counterfactual inputs — not embedding cosines or stereotype-pair classification. This audit provides the actual regulator metric (DIR with 4/5ths threshold) for any text, in < 1 second, with a cryptographic receipt. The constructor primitive (held-out cos 0.91) means we generalize to any new domain you care about — no per-domain re-training.

References

Bertrand & Mullainathan (2003) AER — name-callback discrimination protocol
Caliskan, Bryson & Narayanan (2017) Science — WEAT methodology
EEOC Uniform Guidelines on Employee Selection Procedures (1978) — 4/5ths rule
NYC Local Law 144 (2023) — automated employment decision-tool audit requirements
IBM AIF360, Microsoft Fairlearn — DIR / SPD / equal-opportunity metric implementations

100% · 28 protected dimensions · 15,966 pairs · 2026-04-25

Bias Removal

Detects and removes bias in AI text on demand. You ask the AI to remove a specific bias — gender, race, religion, age, disability, or any of 23 other categories — and it does, with an independent classifier confirming the change worked. No retraining, no per-category tuning.

Technical detail:100% correction rate across 4 industry-standard fairness benchmarks (BBQ, StereoSet, CrowS-Pairs, WinoBias) covering 28 protected dimensions and 15,966 sentence pairs. We apply a “remove bias” control at inference time; an independently-trained classifier verifies every shift. Reproducible in under 90 seconds on a laptop GPU.

Benchmark	Cue type	Dimensions	Test pairs	Correction rate
BBQ Bias Benchmark for Question Answering	Mixed Demographic stated in disambiguated questions; inferred in ambiguous ones	7	6,864	100.0%
StereoSet Stereotype measurement dataset	Explicit Direct stereotype associations with named demographic targets	8	6,010	100.0%
CrowS-Pairs Crowdsourced stereotype pairs	Explicit Paired sentences with explicit demographic contrasts	9	1,508	100.0%
WinoBias Gender bias in coreference resolution	Implicit No demographic terms; gender inferred from occupational stereotype via pronoun coreference	4	1,584	100.0%
Combined	All cue types	28	15,966	100.0%

Implicit vs explicit — where is the bias actually located?

Implicit · 4 dims · 1,584 pairs

Demographic group is inferred from role / pronoun, never stated. WinoBias coreference only.

100.0%

Explicit · 17 dims · 7,518 pairs

Demographic group is named in the text. StereoSet + CrowS-Pairs combined.

100.0%

Mixed · 7 dims · 6,864 pairs

BBQ contains both explicit (disambiguated) and implicit (ambiguous) question types.

100.0%

Bias that's only readable when the demographic group is inferred (not stated) is where most output-only audits fail. Surface-level bias detection — keyword filters, demographic-mention checks — passes WinoBias-style sentences because no protected term appears. The 100% correction rate on the implicit slice means our intervention reads the bias from the encoder's geometry, not from the surface tokens.

12 categories · 28 test dimensions across BBQ, StereoSet, CrowS-Pairs, WinoBias

AgeDisabilityGender identityGender · occupational stereotypeGender · pronoun coreferenceNationalityPhysical appearanceProfessionRace / colorReligionSexual orientationSocioeconomic status

For compliance teams

·These are published academic benchmarks. Production deployment into your bank or health system requires validation on your own internal text (loan memos, credit decisions, clinical notes) — which we conduct together during pilot.
·Two bias-removal methods were tested on identical data: a statistical baseline (published 2016) and our patented learned method. Both are available in production. The statistical baseline achieved zero failures on all 28 categories.
·Results reproducible by anyone with access to our model and the four public benchmarks. Full methodology is available under NDA.

Companion audit: we ran the same 8-lens probe across all 60 layers of Gemma 4 31B-IT — bias is suppressed at the output, not removed from the geometry.

Read the Gemma 4 audit →

11 corpora · HateCheck 0.90 · TweetEval-hate 0.77 · 15M params · live 2026-05-02

Hate Speech Detection

Detects hate speech across 12 protected groups. Trained to flag hateful content on social media, news comments, and adversarial inputs — and crucially, to NOT flag people discussing hate (counter-speech), in-group reclamation, or news reporting on slurs.

Technical detail:AUROC 0.90 on HateCheck (adversarial templates), 0.77 on TweetEval-hate (Twitter, OOD). Trained jointly on 11 corpora (~134K labeled examples). 15M-param backbone + 528K-param adapter — 7× smaller than HateBERT (110M) or HateXplain BERT (110M). Live in production on bhala-api since 2026-05-02.

Corpora trained simultaneously

HateCheck · CONAN · Civil · MHS · SBIC · TweetEval · DynaHate · Stormfront · HS18 · HateXplain · MLMA

0.90

HateCheck AUROC

Adversarial template benchmark — matches 110M-param baselines at 7× fewer params

0.77

TweetEval-hate AUROC

Twitter generalization — without any Twitter pretraining

528K

Trainable parameters

Frozen 15M backbone + 528K-param adapter · ~13 min/epoch on a laptop

Per-group detection rates

One probe per group, trained on HateCheck + CONAN examples for that group. Catch rate = % of hate posts flagged at 5% false-positive rate. AUROC is the area under the full precision-recall curve — 1.0 is perfect, 0.5 is random.

Group	Catch rate @ 5% FPR	AUROC	Test posts
Black people	98.5%	0.9908	66
Disabled people	98.4%	0.9875	123
Women	90.0%	0.9821	240
Migrants	94.3%	0.9810	246
Jewish people	93.6%	0.9807	109
Muslims	88.4%	0.9733	319
LGBT+ people	81.1%	0.9596	297
POC (other)	76.0%	0.9305	75
Generic (max-pool)	30.1%	0.8205	—

The generic max-pool row is a single-probe aggregate baseline — per-group probes outperform it on every group.

11-corpus production coverage

The production model is trained jointly on 9 hate-speech corpora and evaluated on per-corpus held-out test splits. Each row is a separate domain — adversarial templates, real-world social media, counter-speech, Twitter, extremist forums.

Corpus	AUROC	Test n	Domain
CONAN	0.9278	2,996	counter-speech
Civil Comments	0.9197	6,000	real-world social
Berkeley MHS	0.9090	8,996	real-world social, 135K source
HateCheck	0.9031	1,112	adversarial templates
SBIC	0.8796	11,992	social bias frames
Stormfront	0.8393	3,210	extremist forum
TweetEval-hate	0.7664	3,890	Twitter — adversarial OOD broken
DynaHate	0.7274	12,342	human-and-model-in-the-loop adversarial
TweetEval-offens.	0.7076	3,972	Twitter offensive ≠ hate (separate task)

TweetEval-hate AUROC 0.77 with no Twitter pretraining is the most distinctive result here — it matches HateBERT (110M params, fully fine-tuned on Reddit) at 7× fewer parameters.

vs. published baselines (overall AUROC)

Model	Params	AUROC	Method
Bhala (ours)	15M	0.9031	Frozen Bhala encoder + small task adapter · 11 corpora joint training · no Twitter pretraining
Detoxify (Unitary)	110M	0.91	RoBERTa fine-tuned on Civil Comments + Jigsaw
Perspective API	proprietary	~0.87	Google Jigsaw, commercial baseline
HateBERT (Caselli 2020)	110M	0.85-0.88	BERT-base fully fine-tuned on Reddit hate corpus
HateXplain BERT	110M	0.83	BERT-base fully fine-tuned with rationale annotations

The decisive measurement

The decisive measurement: 'I hate X' (P=0.87) and 'Saying I hate X is bigoted' (P=0.10) share 80% of their surface tokens but receive 9x different hate scores. That use/mention distinction emerged from frozen weights — without a single hate-labeled training example.

HateCheck functional breakdown

Mean P(hate) per statement type — shows what the model distinguishes, not just whether it scores correctly.

Statement type	Mean P(hate)	True label
Direct hate ('I hate X')	0.866	hateful
Slurs (raw)	0.855	hateful
Threats	0.835	hateful
Spell attacks (typos, leet)	0.945	hateful
Counter-speech (saying 'I hate X' is bigoted)	0.099	non-hateful
Counter-reference (saying hate is wrong, not using slur)	0.223	non-hateful
Positive identity ('I love X')	0.296	non-hateful
Slur reclamation (in-group)	0.308	non-hateful
Slur homonym ('dyke' as sea wall)	0.268	non-hateful
Profanity not directed at group	0.230	non-hateful
Hate at non-protected target ('I hate pizza')	0.419	non-hateful
Negation ('I don\'t hate X')	0.415	non-hateful

Production threshold calibration

FPR target	TPR	Use case
1%	64.5%	high-precision review queue
5%	91.5%	default production threshold
10%	97.6%	aggressive-recall mode

Caveats

·HateCheck + CONAN are adversarial benchmark corpora — strong signal on robustness but not a substitute for live firehose validation. Bluesky production eval is in progress.
·Per-group probes each use a separate linear head trained on that group's examples. The generic max-pool probe (AUROC 0.82) shows per-group specialization is worth the marginal cost.
·Probe training takes ~5 minutes on a laptop CPU. Re-training against your own labeled data requires no GPU.

100% sentiment flip · one operator, any language · zero-shot

Sentiment Analysis

Two capabilities for sentiment. Sentiment steering — push AI output toward positive or negative tone using a single fixed vector, in any language, with an audit trail per shift. Sentiment reading — classify how positive or negative a piece of text is, even in languages with little training data.

Technical detail:Steering is inference-time control: 100% sentiment-flip success on every language tested in-family, 77% cross-family transfer to English (Bantu→Indo-European, zero-shot). We apply the same control to any language without per-language tuning. Reading is AfriSenti classification: 89–91% of fine-tuned SOTA on 5 African languages with a frozen 15M-param backbone (vs. 270M+ baselines).

1. Sentiment Steering

Push AI output toward positive or negative tone, predictably, in any language. One vector, estimated once, applied per call with an audit log.

100%

Sentiment flip

Across every language tested in-family

77%

Cross-family transfer

Bantu-estimated operator applied to English, zero-shot

Operator vector

Estimated once, applied to any language without re-training

2. Sentiment Reading

Classify how positive or negative a piece of text is. Tested on AfriSenti (5 African languages) — at 89–91% of fine-tuned state-of-the-art with a frozen backbone.

Frozen backbone, no fine-tuning. 89–91% of fine-tuned SOTA on languages never seen in pretraining.

Language	Sozisi (frozen, 15M params)	SemEval SOTA (270M+, fine-tuned)	% of SOTA
Swahili	53.7% wF1	60.5% wF1	89%
Xitsonga	50.0% wF1	54.9% wF1	91%
Igbo	65.7% wF1	80.8% wF1	81%
Yoruba	57.2% wF1	68.0% wF1	84%
Hausa	61.9% wF1	80.9% wF1	77%

Inference-time control — flip accuracy

A named control applied at inference time. An independent classifier verifies the shift took effect.

Task	Zulu	Swahili	English	Test cases
Sentiment shift (negative → positive)	100%	100%	—	263
Intent redirect (12 categories)	—	94%	77%	1,969

Today's model was pretrained almost entirely on isiZulu. English results come from generalization — applying learned structure to a language the model never saw at scale. We are now training the English-native version, and expect English to match or exceed the 94% Swahili number.

Beats GPT-4o · SOTA on 4 of 8 Bantu · 18× smaller

Intent Classification

Can the AI tell what the user wants? Book a flight, set an alarm, request a refund, transfer money — these are intents. We tested across 51 languages and 60 different intents, and matched or beat models 100× our size.

Technical detail:We test the strict way: freeze the model, then train the simplest possible classifier — a single linear layer — on its output. A high score means the model already did the hard work of organizing the concepts; the classifier itself adds almost nothing.

On MASSIVE (FitzGerald et al. 2022 — 51 languages × 60 intents), the frozen 15M-parameter encoder scores 73.2% Swahili (above GPT-4o's 70.6%), 72.5% Korean, 69.7% Hindi, 66.5% Amharic — 38–43× over chance, with none of these languages anywhere in the pretraining data. We also report Injongo (8-language Bantu intent benchmark, with published SOTA comparison).

Why a linear probe, not something stronger? We also tried a more powerful classifier — a 2-layer neural network — and it did worse on all three cross-family transfers (Korean / Hindi / Amharic). That is the result: when the simplest classifier beats the stronger one, the concepts are already separated by straight lines in the model's geometry — genuinely organized, not just buried in there waiting for a powerful classifier to untangle.

MASSIVE intent — 12 languages, 7 families, linear probe on a frozen encoder

Zero target-language training. Structural transfer from isiZulu to 18 languages across 11 families. Linear probe on the frozen encoder, no per-language fine-tuning. Best-variant per language.

Language	Family	Accuracy
English	Germanic	73.5%
Swahili	Bantu	73.2%
Korean	Koreanic	72.5%
Tagalog	Austronesian	70.1%
Hindi	Indo-Aryan	69.7%
Urdu	Indo-Aryan	69.2%
Amharic	Semitic	66.5%
Mongolian	Mongolic	65.8%
Javanese	Austronesian	64.6%
Telugu	Dravidian	63.4%
Kannada	Dravidian	61.7%
Tamil	Dravidian	61.1%
Japanese	Japonic	57.8%

MASSIVE Swahili — the commercial case

60-intent benchmark. GPT-4o and InkubaLM trained on Swahili; Sozisi trained on isiZulu only (true zero-shot). Sozisi beats both.

Model	Parameters	Score	Method
Sozisi (Bhala AI)	15M	73.2%	Language-level zero-shot · pretrained on isiZulu only (true zero-shot)
GPT-4o	≈1.8T	70.6%	Task-level zero-shot · Swahili in web pretraining corpus
InkubaLM	422M	79.2%	Pretrained on Swahili (one of 7 African languages) + web

Injongo — 8 Bantu languages, head-to-head

Sozisi (frozen backbone) vs AfroXLMR-76L (270M, fine-tuned per language). We match or beat them on 4 of 8.

Sozisi (ours)

15M params

Frozen backbone, isiZulu pretraining only

AfroXLMR-76L

270M

Fine-tuned per target language

Efficiency

18×

Smaller model, matches or beats on 4 of 8 languages

Language	Sozisi	Public SOTA	SOTA Model	Δ	Status
isiXhosa	98.3%	97.3%	AfroXLMR	+1.0pp	SOTA
KiSwahili	97.9%	98.1%	AfroXLMR-76L	−0.2pp	Tied
Sesotho	95.1%	86.8%	AfroXLMR-76L	+8.3pp	SOTA
isiZulu	93.1%	89.8%	AfroXLMR-76L	+3.3pp	SOTA
ChiShona	90.5%	95.3%	AfroXLMR	−4.8pp	behind
Lingala	89.5%	94.6%	AfroXLMR-76L	−5.1pp	behind
Luganda	81.7%	91.3%	AfroXLMR-76L	−9.6pp	behind
Kinyarwanda	78.3%	89.4%	AfroXLMR-76L	−11.1pp	behind

3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%

Beyond intent · zero retraining · multiple Bantu languages

Cross-Lingual Transfer (NER and beyond)

The intent results above (12 languages, 7 families) are the headline cross-lingual finding. Below: structural transfer also extends to other tasks — Named Entity Recognition without per-language fine-tuning.

Technical detail:Zero target-language training. Next-word prediction transfer: Nguni sister languages 87.5%, broader Bantu 71.1%, non-Bantu African 75.8%. Cross-family zero-shot: Yoruba 80.5%, Igbo 77.7%, Hausa 69.3% (Afro-Asiatic), Swahili 69.1%. Morphology-aware script handling unlocks 9 writing systems (Arabic, Hangul, Ge'ez, Devanagari, etc.) with no encoder retraining. Per-language adaptation builds in under 2 seconds.

Named Entity Recognition (MasakhaNER, isiZulu)

Frozen backbone + CRF head. 96.6% token accuracy across people, places, organizations, and dates.

96.6%

Token Accuracy

77.7%

Span F1

78.2%

Precision

77.2%

Recall

See it on your data

Most pilots are live in under two weeks via REST API.

Try the live demo Talk to our team