Benchmarks

Every claim, verifiable.

Reproducible numbers across counterfactual fairness, bias removal, hate speech, sentiment, intent, and cross-lingual transfer. All benchmarks use public datasets. Every result is reproducible in under 90 seconds on a laptop GPU.

Calibrated to expert labels · sub-second · auditable by construction

Detect bias by alignment, not absorption

Most AI systems learn to detect bias by training on biased text. They learn the patterns — and in doing so, internalize them. That's why the standard academic probes (WEAT, StereoSet, BBQ) consistently find race, gender, and religion bias inside GPT-4, Gemma, Llama, InkubaLM. Those models had to learn the bias from web-scale data to detect it.

Bhala learns differently. Its encoder was trained on math, logic, and code — not the firehose of internet text that imprints associations like “Black names → loan denial” or “women → less competent” into other models' weights.

When experts hand us labeled examples — Census-coded names, audited fair-lending cases, hate-speech corpora reviewed by domain specialists — Bhala's structural geometry lets us read off the direction those examples define. That direction then fires on any new text you submit. We supply the labels; Bhala supplies the geometry. The encoder doesn't have to believe the examples are biased — it just has to map them consistently.

A microscope resolves what you put under it. It doesn't catch the disease.

Empirical proof — corpus audit

We grep'd the entire training corpus for race-related discourse. Result:

  • “hispanic”, “muslim”, “criminal”, “racism”, “slavery” — zero hits in metamathqa (the largest training source).
  • “black” appears only as a color (sock, boxcar, cartridge, card suit) — never as a racial category.
  • Bertrand-Mullainathan first names (Lakeisha, Tamika, DeShawn, Jamal, Tyrone…) appear in math problems about piggy banks, crayons, distance — neutral commerce contexts only.
  • Math problems containing Black-coded names use fewer negative-outcome words than math problems containing White-coded names (−0.13 difference per example in metamathqa). The training data goes opposite the direction an internalized-bias hypothesis predicts.

The encoder cannot have learned what wasn't in the training data. Detection comes from supplied expert labels + structural alignment, not from imprinted prejudice.

In practice:The live /demo/audit endpoint computes the EEOC 4/5ths-rule disparate-impact ratio across six regulated domains (criminal sentencing, employment, online moderation, venture capital, police interaction, higher education) for any text you paste. Sub-second per audit, signed receipt. Predictions come from labeled-example projection, not from baked-in associations.

  1. Algebraic de-racing. We encode your text once, then compute a neutralized control by subtracting the learned race-direction component (compose, not string substitution). No hardcoded name list — works on any text.
  2. Constructor synthesizes per-domain race direction. A single learned linear constructor takes (T_race, T_domain_context) → T_race_in_domain. Held-out cos to ground-truth = 0.91 across 4 unseen domains — generalizes to any new domain without re-training. Same primitive that won on logic AND with held-out cos 0.999.
  3. Disparate-Impact Ratio + 4/5ths rule. EEOC says DIR < 0.80 fails Title VII / ECOA. We report DIR per domain plus statistical-parity-difference (SPD) and a 0–10 risk score. Verdict cards say PASS / MARGINAL / FAIL · 4/5ths rule.
  4. Intersectional via ZFC set intersection. Race × gender per domain via set-intersection on the SCI manifold. The audit cell EEOC explicitly requires.

Methodology:T_race / T_gender / T_religion operators trained on contrastive pairs from Bertrand-Mullainathan first names + US Census 2010 surnames (180K names total) embedded in 18 deployment-domain frames (10 frames per domain, grounded in audit literature). Race-in-domain constructor: 256-d linear over [T_race ⊕ T_dom_ctx] → 128-d L2-normalized; trained on 14 domains, evaluated on 4 held-out. Receipt = SHA-256(text + model_version + scores + timestamp).

Constructor generalization (4 held-out domains)

criminal_sentencing0.9116
pain_management0.9007
promotion_review0.9511
public_benefits0.8900
mean0.9134

cos(constructor-predicted, ground-truth) on domains never seen during training. ≥ 0.70 = endpoint uses synthesized direction (covers any new domain).

Operator quality (held-out)

T_racecos 0.58lift +0.28
T_gendercos 0.58lift +0.40
T_religioncos 0.70lift +0.34

Lift over random pairing on held-out evaluation (kill criterion ≥ 0.25).

Why this matters

EEOC 4/5ths rule, ECOA fair-lending audits, EU AI Act Article 27 high-risk system requirements, and NYC LL144 hiring-tool audits all measure DECISION OUTPUT differentials on counterfactual inputs — not embedding cosines or stereotype-pair classification. This audit provides the actual regulator metric (DIR with 4/5ths threshold) for any text, in < 1 second, with a cryptographic receipt. The constructor primitive (held-out cos 0.91) means we generalize to any new domain you care about — no per-domain re-training.

References

  • Bertrand & Mullainathan (2003) AER — name-callback discrimination protocol
  • Caliskan, Bryson & Narayanan (2017) Science — WEAT methodology
  • EEOC Uniform Guidelines on Employee Selection Procedures (1978) — 4/5ths rule
  • NYC Local Law 144 (2023) — automated employment decision-tool audit requirements
  • IBM AIF360, Microsoft Fairlearn — DIR / SPD / equal-opportunity metric implementations
100% · 28 protected dimensions · 15,966 pairs · 2026-04-25

Bias Removal

Detects and removes bias in AI text on demand. You ask the AI to remove a specific bias — gender, race, religion, age, disability, or any of 23 other categories — and it does, with an independent classifier confirming the change worked. No retraining, no per-category tuning.

Technical detail:100% correction rate across 4 industry-standard fairness benchmarks (BBQ, StereoSet, CrowS-Pairs, WinoBias) covering 28 protected dimensions and 15,966 sentence pairs. We apply a “remove bias” control at inference time; an independently-trained classifier verifies every shift. Reproducible in under 90 seconds on a laptop GPU.

BenchmarkCue typeDimensionsTest pairsCorrection rate
BBQ
Bias Benchmark for Question Answering
Mixed
Demographic stated in disambiguated questions; inferred in ambiguous ones
76,864100.0%
StereoSet
Stereotype measurement dataset
Explicit
Direct stereotype associations with named demographic targets
86,010100.0%
CrowS-Pairs
Crowdsourced stereotype pairs
Explicit
Paired sentences with explicit demographic contrasts
91,508100.0%
WinoBias
Gender bias in coreference resolution
Implicit
No demographic terms; gender inferred from occupational stereotype via pronoun coreference
41,584100.0%
CombinedAll cue types2815,966100.0%

Implicit vs explicit — where is the bias actually located?

Implicit · 4 dims · 1,584 pairs
Demographic group is inferred from role / pronoun, never stated. WinoBias coreference only.
100.0%
Explicit · 17 dims · 7,518 pairs
Demographic group is named in the text. StereoSet + CrowS-Pairs combined.
100.0%
Mixed · 7 dims · 6,864 pairs
BBQ contains both explicit (disambiguated) and implicit (ambiguous) question types.
100.0%

Bias that's only readable when the demographic group is inferred (not stated) is where most output-only audits fail. Surface-level bias detection — keyword filters, demographic-mention checks — passes WinoBias-style sentences because no protected term appears. The 100% correction rate on the implicit slice means our intervention reads the bias from the encoder's geometry, not from the surface tokens.

12 categories · 28 test dimensions across BBQ, StereoSet, CrowS-Pairs, WinoBias

AgeDisabilityGender identityGender · occupational stereotypeGender · pronoun coreferenceNationalityPhysical appearanceProfessionRace / colorReligionSexual orientationSocioeconomic status

For compliance teams

  • ·These are published academic benchmarks. Production deployment into your bank or health system requires validation on your own internal text (loan memos, credit decisions, clinical notes) — which we conduct together during pilot.
  • ·Two bias-removal methods were tested on identical data: a statistical baseline (published 2016) and our patented learned method. Both are available in production. The statistical baseline achieved zero failures on all 28 categories.
  • ·Results reproducible by anyone with access to our model and the four public benchmarks. Full methodology is available under NDA.
Companion audit: we ran the same 8-lens probe across all 60 layers of Gemma 4 31B-IT — bias is suppressed at the output, not removed from the geometry.
Read the Gemma 4 audit →
11 corpora · HateCheck 0.90 · TweetEval-hate 0.77 · 15M params · live 2026-05-02

Hate Speech Detection

Detects hate speech across 12 protected groups. Trained to flag hateful content on social media, news comments, and adversarial inputs — and crucially, to NOT flag people discussing hate (counter-speech), in-group reclamation, or news reporting on slurs.

Technical detail:AUROC 0.90 on HateCheck (adversarial templates), 0.77 on TweetEval-hate (Twitter, OOD). Trained jointly on 11 corpora (~134K labeled examples). 15M-param backbone + 528K-param adapter — 7× smaller than HateBERT (110M) or HateXplain BERT (110M). Live in production on bhala-api since 2026-05-02.

11
Corpora trained simultaneously
HateCheck · CONAN · Civil · MHS · SBIC · TweetEval · DynaHate · Stormfront · HS18 · HateXplain · MLMA
0.90
HateCheck AUROC
Adversarial template benchmark — matches 110M-param baselines at 7× fewer params
0.77
TweetEval-hate AUROC
Twitter generalization — without any Twitter pretraining
528K
Trainable parameters
Frozen 15M backbone + 528K-param adapter · ~13 min/epoch on a laptop

Per-group detection rates

One probe per group, trained on HateCheck + CONAN examples for that group. Catch rate = % of hate posts flagged at 5% false-positive rate. AUROC is the area under the full precision-recall curve — 1.0 is perfect, 0.5 is random.

GroupCatch rate @ 5% FPRAUROCTest posts
Black people98.5%0.990866
Disabled people98.4%0.9875123
Women90.0%0.9821240
Migrants94.3%0.9810246
Jewish people93.6%0.9807109
Muslims88.4%0.9733319
LGBT+ people81.1%0.9596297
POC (other)76.0%0.930575
Generic (max-pool)30.1%0.8205

The generic max-pool row is a single-probe aggregate baseline — per-group probes outperform it on every group.

11-corpus production coverage

The production model is trained jointly on 9 hate-speech corpora and evaluated on per-corpus held-out test splits. Each row is a separate domain — adversarial templates, real-world social media, counter-speech, Twitter, extremist forums.

CorpusAUROCTest nDomain
CONAN0.92782,996counter-speech
Civil Comments0.91976,000real-world social
Berkeley MHS0.90908,996real-world social, 135K source
HateCheck0.90311,112adversarial templates
SBIC0.879611,992social bias frames
Stormfront0.83933,210extremist forum
TweetEval-hate0.76643,890Twitter — adversarial OOD broken
DynaHate0.727412,342human-and-model-in-the-loop adversarial
TweetEval-offens.0.70763,972Twitter offensive ≠ hate (separate task)

TweetEval-hate AUROC 0.77 with no Twitter pretraining is the most distinctive result here — it matches HateBERT (110M params, fully fine-tuned on Reddit) at 7× fewer parameters.

vs. published baselines (overall AUROC)

ModelParamsAUROCMethod
Bhala (ours)15M0.9031Frozen Bhala encoder + small task adapter · 11 corpora joint training · no Twitter pretraining
Detoxify (Unitary)110M0.91RoBERTa fine-tuned on Civil Comments + Jigsaw
Perspective APIproprietary~0.87Google Jigsaw, commercial baseline
HateBERT (Caselli 2020)110M0.85-0.88BERT-base fully fine-tuned on Reddit hate corpus
HateXplain BERT110M0.83BERT-base fully fine-tuned with rationale annotations

The decisive measurement

The decisive measurement: 'I hate X' (P=0.87) and 'Saying I hate X is bigoted' (P=0.10) share 80% of their surface tokens but receive 9x different hate scores. That use/mention distinction emerged from frozen weights — without a single hate-labeled training example.

HateCheck functional breakdown

Mean P(hate) per statement type — shows what the model distinguishes, not just whether it scores correctly.

Statement typeMean P(hate)True label
Direct hate ('I hate X')0.866hateful
Slurs (raw)0.855hateful
Threats0.835hateful
Spell attacks (typos, leet)0.945hateful
Counter-speech (saying 'I hate X' is bigoted)0.099non-hateful
Counter-reference (saying hate is wrong, not using slur)0.223non-hateful
Positive identity ('I love X')0.296non-hateful
Slur reclamation (in-group)0.308non-hateful
Slur homonym ('dyke' as sea wall)0.268non-hateful
Profanity not directed at group0.230non-hateful
Hate at non-protected target ('I hate pizza')0.419non-hateful
Negation ('I don\'t hate X')0.415non-hateful

Production threshold calibration

FPR targetTPRUse case
1%64.5%high-precision review queue
5%91.5%default production threshold
10%97.6%aggressive-recall mode

Caveats

  • ·HateCheck + CONAN are adversarial benchmark corpora — strong signal on robustness but not a substitute for live firehose validation. Bluesky production eval is in progress.
  • ·Per-group probes each use a separate linear head trained on that group's examples. The generic max-pool probe (AUROC 0.82) shows per-group specialization is worth the marginal cost.
  • ·Probe training takes ~5 minutes on a laptop CPU. Re-training against your own labeled data requires no GPU.
100% sentiment flip · one operator, any language · zero-shot

Sentiment Analysis

Two capabilities for sentiment. Sentiment steering — push AI output toward positive or negative tone using a single fixed vector, in any language, with an audit trail per shift. Sentiment reading — classify how positive or negative a piece of text is, even in languages with little training data.

Technical detail:Steering is inference-time control: 100% sentiment-flip success on every language tested in-family, 77% cross-family transfer to English (Bantu→Indo-European, zero-shot). We apply the same control to any language without per-language tuning. Reading is AfriSenti classification: 89–91% of fine-tuned SOTA on 5 African languages with a frozen 15M-param backbone (vs. 270M+ baselines).

1. Sentiment Steering

Push AI output toward positive or negative tone, predictably, in any language. One vector, estimated once, applied per call with an audit log.

100%
Sentiment flip
Across every language tested in-family
77%
Cross-family transfer
Bantu-estimated operator applied to English, zero-shot
1
Operator vector
Estimated once, applied to any language without re-training

2. Sentiment Reading

Classify how positive or negative a piece of text is. Tested on AfriSenti (5 African languages) — at 89–91% of fine-tuned state-of-the-art with a frozen backbone.

Frozen backbone, no fine-tuning. 89–91% of fine-tuned SOTA on languages never seen in pretraining.

LanguageSozisi (frozen, 15M params)SemEval SOTA (270M+, fine-tuned)% of SOTA
Swahili53.7% wF160.5% wF189%
Xitsonga50.0% wF154.9% wF191%
Igbo65.7% wF180.8% wF181%
Yoruba57.2% wF168.0% wF184%
Hausa61.9% wF180.9% wF177%

Inference-time control — flip accuracy

A named control applied at inference time. An independent classifier verifies the shift took effect.

TaskZuluSwahiliEnglishTest cases
Sentiment shift (negative → positive)100%100%263
Intent redirect (12 categories)94%77%1,969

Today's model was pretrained almost entirely on isiZulu. English results come from generalization — applying learned structure to a language the model never saw at scale. We are now training the English-native version, and expect English to match or exceed the 94% Swahili number.

Beats GPT-4o · SOTA on 4 of 8 Bantu · 18× smaller

Intent Classification

Can the AI tell what the user wants? Book a flight, set an alarm, request a refund, transfer money — these are intents. We tested across 51 languages and 60 different intents, and matched or beat models 100× our size.

Technical detail:We test the strict way: freeze the model, then train the simplest possible classifier — a single linear layer — on its output. A high score means the model already did the hard work of organizing the concepts; the classifier itself adds almost nothing.

On MASSIVE (FitzGerald et al. 2022 — 51 languages × 60 intents), the frozen 15M-parameter encoder scores 73.2% Swahili (above GPT-4o's 70.6%), 72.5% Korean, 69.7% Hindi, 66.5% Amharic — 38–43× over chance, with none of these languages anywhere in the pretraining data. We also report Injongo (8-language Bantu intent benchmark, with published SOTA comparison).

Why a linear probe, not something stronger? We also tried a more powerful classifier — a 2-layer neural network — and it did worse on all three cross-family transfers (Korean / Hindi / Amharic). That is the result: when the simplest classifier beats the stronger one, the concepts are already separated by straight lines in the model's geometry — genuinely organized, not just buried in there waiting for a powerful classifier to untangle.

MASSIVE intent — 12 languages, 7 families, linear probe on a frozen encoder

Zero target-language training. Structural transfer from isiZulu to 18 languages across 11 families. Linear probe on the frozen encoder, no per-language fine-tuning. Best-variant per language.

LanguageFamilyAccuracy
EnglishGermanic73.5%
SwahiliBantu73.2%
KoreanKoreanic72.5%
TagalogAustronesian70.1%
HindiIndo-Aryan69.7%
UrduIndo-Aryan69.2%
AmharicSemitic66.5%
MongolianMongolic65.8%
JavaneseAustronesian64.6%
TeluguDravidian63.4%
KannadaDravidian61.7%
TamilDravidian61.1%
JapaneseJaponic57.8%

MASSIVE Swahili — the commercial case

60-intent benchmark. GPT-4o and InkubaLM trained on Swahili; Sozisi trained on isiZulu only (true zero-shot). Sozisi beats both.

ModelParametersScoreMethod
Sozisi (Bhala AI)15M73.2%Language-level zero-shot · pretrained on isiZulu only (true zero-shot)
GPT-4o≈1.8T70.6%Task-level zero-shot · Swahili in web pretraining corpus
InkubaLM422M79.2%Pretrained on Swahili (one of 7 African languages) + web

Injongo — 8 Bantu languages, head-to-head

Sozisi (frozen backbone) vs AfroXLMR-76L (270M, fine-tuned per language). We match or beat them on 4 of 8.

Sozisi (ours)
15M params
Frozen backbone, isiZulu pretraining only
AfroXLMR-76L
270M
Fine-tuned per target language
Efficiency
18×
Smaller model, matches or beats on 4 of 8 languages
LanguageSozisiPublic SOTASOTA ModelΔStatus
isiXhosa98.3%97.3%AfroXLMR+1.0ppSOTA
KiSwahili97.9%98.1%AfroXLMR-76L−0.2ppTied
Sesotho95.1%86.8%AfroXLMR-76L+8.3ppSOTA
isiZulu93.1%89.8%AfroXLMR-76L+3.3ppSOTA
ChiShona90.5%95.3%AfroXLMR−4.8ppbehind
Lingala89.5%94.6%AfroXLMR-76L−5.1ppbehind
Luganda81.7%91.3%AfroXLMR-76L−9.6ppbehind
Kinyarwanda78.3%89.4%AfroXLMR-76L−11.1ppbehind

3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%

Beyond intent · zero retraining · multiple Bantu languages

Cross-Lingual Transfer (NER and beyond)

The intent results above (12 languages, 7 families) are the headline cross-lingual finding. Below: structural transfer also extends to other tasks — Named Entity Recognition without per-language fine-tuning.

Technical detail:Zero target-language training. Next-word prediction transfer: Nguni sister languages 87.5%, broader Bantu 71.1%, non-Bantu African 75.8%. Cross-family zero-shot: Yoruba 80.5%, Igbo 77.7%, Hausa 69.3% (Afro-Asiatic), Swahili 69.1%. Morphology-aware script handling unlocks 9 writing systems (Arabic, Hangul, Ge'ez, Devanagari, etc.) with no encoder retraining. Per-language adaptation builds in under 2 seconds.

Named Entity Recognition (MasakhaNER, isiZulu)

Frozen backbone + CRF head. 96.6% token accuracy across people, places, organizations, and dates.

96.6%
Token Accuracy
77.7%
Span F1
78.2%
Precision
77.2%
Recall

See it on your data

Most pilots are live in under two weeks via REST API.