Benchmarks

Every claim, verifiable.

Reproducible numbers across bias removal, hate speech, sentiment, intent, and cross-lingual transfer. All benchmarks use public datasets. Every result is reproducible in under 90 seconds on a laptop GPU.

100% · 28 protected dimensions · 15,966 pairs · 2026-04-25

Bias Removal

Detects and removes bias in AI text on demand. You ask the AI to remove a specific bias — gender, race, religion, age, disability, or any of 23 other categories — and it does, with an independent classifier confirming the change worked. No retraining, no per-category tuning.

Technical detail:100% correction rate across 4 industry-standard fairness benchmarks (BBQ, StereoSet, CrowS-Pairs, WinoBias) covering 28 protected dimensions and 15,966 sentence pairs. A “remove bias” control is applied at inference time; an independently-trained classifier verifies every shift. Reproducible in under 90 seconds on a laptop GPU.

BenchmarkCue typeDimensionsTest pairsCorrection rate
BBQ
Bias Benchmark for Question Answering
Mixed
Demographic stated in disambiguated questions; inferred in ambiguous ones
76,864100.0%
StereoSet
Stereotype measurement dataset
Explicit
Direct stereotype associations with named demographic targets
86,010100.0%
CrowS-Pairs
Crowdsourced stereotype pairs
Explicit
Paired sentences with explicit demographic contrasts
91,508100.0%
WinoBias
Gender bias in coreference resolution
Implicit
No demographic terms; gender inferred from occupational stereotype via pronoun coreference
41,584100.0%
CombinedAll cue types2815,966100.0%

Implicit vs explicit — where is the bias actually located?

Implicit · 4 dims · 1,584 pairs
Demographic group is inferred from role / pronoun, never stated. WinoBias coreference only.
100.0%
Explicit · 17 dims · 7,518 pairs
Demographic group is named in the text. StereoSet + CrowS-Pairs combined.
100.0%
Mixed · 7 dims · 6,864 pairs
BBQ contains both explicit (disambiguated) and implicit (ambiguous) question types.
100.0%

Bias that's only readable when the demographic group is inferred (not stated) is where most output-only audits fail. Surface-level bias detection — keyword filters, demographic-mention checks — passes WinoBias-style sentences because no protected term appears. The 100% correction rate on the implicit slice means our intervention reads the bias from the encoder's geometry, not from the surface tokens.

12 categories · 28 test dimensions across BBQ, StereoSet, CrowS-Pairs, WinoBias

AgeDisabilityGender identityGender · occupational stereotypeGender · pronoun coreferenceNationalityPhysical appearanceProfessionRace / colorReligionSexual orientationSocioeconomic status

For compliance teams

  • ·These are published academic benchmarks. Production deployment into your bank or health system requires validation on your own internal text (loan memos, credit decisions, clinical notes) — which we conduct together during pilot.
  • ·Two bias-removal methods were tested on identical data: a statistical baseline (published 2016) and our patented learned method. Both are available in production. The statistical baseline achieved zero failures on all 28 categories.
  • ·Results reproducible by anyone with access to our model and the four public benchmarks. Full methodology is available under NDA.
Companion audit: we ran the same 8-lens probe across all 60 layers of Gemma 4 31B-IT — bias is suppressed at the output, not removed from the geometry.
Read the Gemma 4 audit →
11 corpora · HateCheck 0.90 · TweetEval-hate 0.77 · 15M params · live 2026-05-02

Hate Speech Detection

Detects hate speech across 12 protected groups. Trained to flag hateful content on social media, news comments, and adversarial inputs — and crucially, to NOT flag people discussing hate (counter-speech), in-group reclamation, or news reporting on slurs.

Technical detail:AUROC 0.90 on HateCheck (adversarial templates), 0.77 on TweetEval-hate (Twitter, OOD). Trained jointly on 11 corpora (~134K labeled examples). 15M-param backbone + 528K-param adapter — 7× smaller than HateBERT (110M) or HateXplain BERT (110M). Live in production on bhala-api since 2026-05-02.

11
Corpora trained simultaneously
HateCheck · CONAN · Civil · MHS · SBIC · TweetEval · DynaHate · Stormfront · HS18 · HateXplain · MLMA
0.90
HateCheck AUROC
Adversarial template benchmark — matches 110M-param baselines at 7× fewer params
0.77
TweetEval-hate AUROC
Twitter generalization — without any Twitter pretraining
528K
Trainable parameters
Frozen 15M backbone + 528K-param adapter · ~13 min/epoch on a laptop

Per-group detection rates

One probe per group, trained on HateCheck + CONAN examples for that group. Catch rate = % of hate posts flagged at 5% false-positive rate. AUROC is the area under the full precision-recall curve — 1.0 is perfect, 0.5 is random.

GroupCatch rate @ 5% FPRAUROCTest posts
Black people98.5%0.990866
Disabled people98.4%0.9875123
Women90.0%0.9821240
Migrants94.3%0.9810246
Jewish people93.6%0.9807109
Muslims88.4%0.9733319
LGBT+ people81.1%0.9596297
POC (other)76.0%0.930575
Generic (max-pool)30.1%0.8205

The generic max-pool row is a single-probe aggregate baseline — per-group probes outperform it on every group.

11-corpus production coverage

The production model is trained jointly on 9 hate-speech corpora and evaluated on per-corpus held-out test splits. Each row is a separate domain — adversarial templates, real-world social media, counter-speech, Twitter, extremist forums.

CorpusAUROCTest nDomain
CONAN0.92782,996counter-speech
Civil Comments0.91976,000real-world social
Berkeley MHS0.90908,996real-world social, 135K source
HateCheck0.90311,112adversarial templates
SBIC0.879611,992social bias frames
Stormfront0.83933,210extremist forum
TweetEval-hate0.76643,890Twitter — adversarial OOD broken
DynaHate0.727412,342human-and-model-in-the-loop adversarial
TweetEval-offens.0.70763,972Twitter offensive ≠ hate (separate task)

TweetEval-hate AUROC 0.77 with no Twitter pretraining is the most distinctive result here — it matches HateBERT (110M params, fully fine-tuned on Reddit) at 7× fewer parameters.

vs. published baselines (overall AUROC)

ModelParamsAUROCMethod
Bhala (ours)15M0.9031Frozen Bhala encoder + small task adapter · 11 corpora joint training · no Twitter pretraining
Detoxify (Unitary)110M0.91RoBERTa fine-tuned on Civil Comments + Jigsaw
Perspective APIproprietary~0.87Google Jigsaw, commercial baseline
HateBERT (Caselli 2020)110M0.85-0.88BERT-base fully fine-tuned on Reddit hate corpus
HateXplain BERT110M0.83BERT-base fully fine-tuned with rationale annotations

The decisive measurement

The decisive measurement: 'I hate X' (P=0.87) and 'Saying I hate X is bigoted' (P=0.10) share 80% of their surface tokens but receive 9x different hate scores. That use/mention distinction emerged from frozen weights — without a single hate-labeled training example.

HateCheck functional breakdown

Mean P(hate) per statement type — shows what the model distinguishes, not just whether it scores correctly.

Statement typeMean P(hate)True label
Direct hate ('I hate X')0.866hateful
Slurs (raw)0.855hateful
Threats0.835hateful
Spell attacks (typos, leet)0.945hateful
Counter-speech (saying 'I hate X' is bigoted)0.099non-hateful
Counter-reference (saying hate is wrong, not using slur)0.223non-hateful
Positive identity ('I love X')0.296non-hateful
Slur reclamation (in-group)0.308non-hateful
Slur homonym ('dyke' as sea wall)0.268non-hateful
Profanity not directed at group0.230non-hateful
Hate at non-protected target ('I hate pizza')0.419non-hateful
Negation ('I don\'t hate X')0.415non-hateful

Production threshold calibration

FPR targetTPRUse case
1%64.5%high-precision review queue
5%91.5%default production threshold
10%97.6%aggressive-recall mode

Caveats

  • ·HateCheck + CONAN are adversarial benchmark corpora — strong signal on robustness but not a substitute for live firehose validation. Bluesky production eval is in progress.
  • ·Per-group probes each use a separate linear head trained on that group's examples. The generic max-pool probe (AUROC 0.82) shows per-group specialization is worth the marginal cost.
  • ·Probe training takes ~5 minutes on a laptop CPU. Re-training against your own labeled data requires no GPU.
100% sentiment flip · one operator, any language · zero-shot

Sentiment Analysis

Two capabilities for sentiment. Sentiment steering — push AI output toward positive or negative tone using a single fixed vector, in any language, with an audit trail per shift. Sentiment reading — classify how positive or negative a piece of text is, even in languages with little training data.

Technical detail:Steering is inference-time control: 100% sentiment-flip success on every language tested in-family, 77% cross-family transfer to English (Bantu→Indo-European, zero-shot). The same control is applied to any language without per-language tuning. Reading is AfriSenti classification: 89–91% of fine-tuned SOTA on 5 African languages with a frozen 15M-param backbone (vs. 270M+ baselines).

1. Sentiment Steering

Push AI output toward positive or negative tone, predictably, in any language. One vector, estimated once, applied per call with an audit log.

100%
Sentiment flip
Across every language tested in-family
77%
Cross-family transfer
Bantu-estimated operator applied to English, zero-shot
1
Operator vector
Estimated once, applied to any language without re-training

2. Sentiment Reading

Classify how positive or negative a piece of text is. Tested on AfriSenti (5 African languages) — at 89–91% of fine-tuned state-of-the-art with a frozen backbone.

Frozen backbone, no fine-tuning. 89–91% of fine-tuned SOTA on languages never seen in pretraining.

LanguageSozisi (frozen, 15M params)SemEval SOTA (270M+, fine-tuned)% of SOTA
Swahili53.7% wF160.5% wF189%
Xitsonga50.0% wF154.9% wF191%
Igbo65.7% wF180.8% wF181%
Yoruba57.2% wF168.0% wF184%
Hausa61.9% wF180.9% wF177%

Inference-time control — flip accuracy

A named control applied at inference time. An independent classifier verifies the shift took effect.

TaskZuluSwahiliEnglishTest cases
Sentiment shift (negative → positive)100%100%263
Intent redirect (12 categories)94%77%1,969

Today's model was pretrained almost entirely on isiZulu. English results come from generalization — applying learned structure to a language the model never saw at scale. We are now training the English-native version, and expect English to match or exceed the 94% Swahili number.

Beats GPT-4o · SOTA on 4 of 8 Bantu · 18× smaller

Intent Classification

Can the AI tell what the user wants? Book a flight, set an alarm, request a refund, transfer money — these are intents. We tested across 51 languages and 60 different intents, and matched or beat models 100× our size.

Technical detail:Two benchmarks. MASSIVE (FitzGerald et al. 2022) — 51 languages × 60 intents. Injongo — 8-language Bantu intent benchmark with published SOTA comparison. **Linear probe on frozen encoder** (strictest version of the field's gold-standard representation-quality test), no per-language fine-tuning, zero target-language data anywhere in pretraining. 73.2% Swahili (above GPT-4o's 70.6%), 72.5% Korean, 69.7% Hindi, 66.5% Amharic at 15M parameters — none of these languages are in the pretraining data. The linear probe outperforms a 2-layer MLP probe on all three cross-family transfers (Korean / Hindi / Amharic), confirming the structure is fully linearly separable in the encoder's geometry. 38–43× over random across all four.

MASSIVE Swahili — the commercial case

60-intent benchmark. GPT-4o and InkubaLM trained on Swahili; Sozisi trained on isiZulu only (true zero-shot). Sozisi beats both.

ModelParametersScoreMethod
Sozisi (Bhala AI)15M73.2%Language-level zero-shot · pretrained on isiZulu only (true zero-shot)
GPT-4o≈1.8T70.6%Task-level zero-shot · Swahili in web pretraining corpus
InkubaLM422M79.2%Pretrained on Swahili (one of 7 African languages) + web

Injongo — 8 Bantu languages, head-to-head

Sozisi (frozen backbone) vs AfroXLMR-76L (270M, fine-tuned per language). We match or beat them on 4 of 8.

Sozisi (ours)
15M params
Frozen backbone, isiZulu pretraining only
AfroXLMR-76L
270M
Fine-tuned per target language
Efficiency
18×
Smaller model, matches or beats on 4 of 8 languages
LanguageSozisiPublic SOTASOTA ModelΔStatus
isiXhosa98.3%97.3%AfroXLMR+1.0ppSOTA
KiSwahili97.9%98.1%AfroXLMR-76L−0.2ppTied
Sesotho95.1%86.8%AfroXLMR-76L+8.3ppSOTA
isiZulu93.1%89.8%AfroXLMR-76L+3.3ppSOTA
ChiShona90.5%95.3%AfroXLMR−4.8ppbehind
Lingala89.5%94.6%AfroXLMR-76L−5.1ppbehind
Luganda81.7%91.3%AfroXLMR-76L−9.6ppbehind
Kinyarwanda78.3%89.4%AfroXLMR-76L−11.1ppbehind

3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%

40+ languages · 10 families · zero retraining

Cross-Lingual Transfer

Trained on one language, works on 40+. We pretrained the model on a single language (isiZulu, ~40M tokens, ~1 hour on a laptop GPU). Without any retraining or fine-tuning, the same model now handles 40+ other languages across 10 different language families — Bantu, Indo-European, Dravidian, Semitic, Austronesian, and more.

Technical detail:Zero target-language training. Next-word prediction transfer: Nguni sister languages 87.5%, broader Bantu 71.1%, non-Bantu African 75.8%. Cross-family zero-shot: Yoruba 80.5%, Igbo 77.7%, Hausa 69.3% (Afro-Asiatic), Swahili 69.1%. Proprietary script handling unlocks 9 writing systems (Arabic, Hangul, Ge'ez, Devanagari, etc.) with no encoder retraining. Per-language adaptation builds in under 2 seconds.

MASSIVE intent accuracy — zero-shot

LanguageFamilyAccuracy
SwahiliBantu73.2%
KoreanKoreanic72.5%
TagalogAustronesian70.1%
HindiIndo-Aryan69.7%
UrduIndo-Aryan69.2%
AmharicSemitic66.5%
MongolianMongolic65.8%
JavaneseAustronesian64.6%
JapaneseJaponic56.5%
TeluguDravidian63.4%
KannadaDravidian61.7%
TamilDravidian61.1%

Named Entity Recognition (MasakhaNER, isiZulu)

Frozen backbone + CRF head. 96.6% token accuracy across people, places, organizations, and dates.

96.6%
Token Accuracy
77.7%
Span F1
78.2%
Precision
77.2%
Recall

See it on your data

Most pilots are live in under two weeks via REST API.