April 28, 2026·Bhala AI Team·8 min read

We audited Gemma 4. The bias didn't go away — it went into hiding.

researchfairnessinterpretabilityproducthow-to

The short version

Ask Gemma 4 a leading bias question — who is more likely to be a criminal? — and it refuses or hedges. Most fairness audits read that response and conclude the bias is gone. We took a closer look.

We probed all 60 layers of Gemma 4 31B-IT across 28 protected dimensions on 15,966 sentence pairs from BBQ, StereoSet, CrowS-Pairs, and WinoBias. Three hours on a single H100. Every result reproducible from public benchmarks.

The bias is still in the model. RLHF taught it to stop saying it. It did not teach it to stop knowing it.

If you only ever talk to Gemma through chat, that distinction doesn't matter much. If you build embeddings, search, retrieval, classifiers, or you fine-tune it, it does. Three things in this post:

The chart that shows the bias hiding, in plain language.
A five-step recipe to run the same audit on any open model you ship.
A live API you can paste into a terminal — no key needed for the sandbox — that flips a biased sentence and hands you back a signed audit record.

The chart, in plain English

We ran two detectors over the same internal vectors inside Gemma 4. Think of them as two metal detectors with different sensitivity:

Linear probe (red, dashed) — only finds bias when stereotyped and anti-stereotyped sentences fall on opposite sides of a single straight line. Cheap, fast, almost every published fairness audit uses this one.
MLP probe (green, solid) — finds bias even when the separation is curved, twisted, or scattered into pockets. Slightly more expensive. Almost nobody runs it.

Figure · BBQ / Sexual orientation · Gemma 4 31B-IT

Simple bias detector says “clean.” Thorough one says “still there.”

Two detectors reading Gemma 4's internal state at every layer. The linear probe (red) only catches bias when stereotyped and anti-stereotyped sentences fall on opposite sides of a single straight line — what most published audits use. The MLP probe (green) catches it along curved or scattered patterns too. By the final layer, RLHF has muted the simple one. The thorough one keeps beeping.

At the input layer, the thorough detector is loud (0.99) — the bias is obvious and there. The simple detector at the same layer is barely above chance (0.56). RLHF can attack the simple detector's signal (it's a single linear direction); it cannot reach the thorough detector's signal (curved, multi-dimensional, baked into the encoding). Through the encoding and transition zones the thorough detector stays well above the simple one (peak gap +0.44 at layer 0). By the final layer both probes have decayed near chance — simple to 0.26, thorough to 0.21 — but the early-layer gap is what matters. Anyone who taps Gemma's hidden states for retrieval, classification, or fine-tuning is reading from layers where the thorough detector is screaming.

The same picture holds on all 28 dimensions we tested. Sexual orientation peak MLP–linear gap: +0.20. Religion: +0.11. Race-color is the worst case — see the CrowS finding below.

What this means for your product

Three deployment patterns, ordered from safe to dangerous.

1. Direct chat — safe. Users type, the model responds. RLHF's polish is doing real work here. The output won't tell anyone one race is more criminal than another.

2. Embeddings, RAG, semantic search, classifiers — vulnerable. Most enterprise deployments live here. You take Gemma's hidden states (or its dedicated embedding head) and build an index, a similarity engine, a ranker. You inherit the early-layer geometry directly — the part RLHF never touched. A query about a racialized topic retrieves from a structurally separated region of the index depending on how it's phrased. The chat product passes audit; the retrieval product fails the same audit's structural cousin and you have no way to know without running the geometric probe yourself.

3. Naive or small-dataset fine-tuning — actively dangerous. Fine-tuning rewrites the final layers — exactly where RLHF's polish lives. A small-dataset or non-safety-targeted fine-tune can undo the polish without ever touching the underlying bias, restoring exactly what alignment was meant to suppress. Fine-tuning can improve alignment (that's literally what RLHF is) — but only when it's adversarially designed against the right data and metrics. The default LoRA-on-domain-data fine-tune most teams actually run does the opposite.

If your product is in case 2 or 3, the rest of this post is for you.

Audit any open model yourself — five-step recipe

You can run this on Gemma, Llama, Qwen, or anything else with public weights. Tonight, on an H100, in three hours.

Pull a labeled benchmark. We used BBQ, StereoSet, CrowS-Pairs, and WinoBias — about 16,000 stereotype/anti-stereotype pairs across 28 dimensions, all public.
Encode every sentence at every layer. For each pair, save the model's internal vector at every transformer layer.
Train two probes per layer. A logistic regression (linear) and a 2-layer MLP (nonlinear). Same data, same train/test split.
Read the gap. If the MLP probe stays meaningfully above the linear probe at the final layer, the model has suppressed bias, not removed it. Most aligned models will look like Gemma.
(Optional) Subspace analysis. Run PCA on the stereotype embeddings, project anti-stereotype embeddings into the residual. High residual variance means the two groups occupy structurally different manifolds — a category of bias linear probes can't see at all. This is the test that catches the race-color failure below.

The full per-axis, per-layer probe results: gemma4_full_algebra.json (download). Methodology + code available on request.

Auditing tells you where the bias is and what kind it is. Removing it requires a different architectural choice — one that names each protected axis as a separate, addressable direction in the embedding space rather than hoping post-training alignment hid it well enough.

What we built that addresses this

We built our embedding model with separate, named directions for each protected dimension. Bias isn't a hidden direction we hope alignment hid well enough — it's an axis you subtract at inference time. Every call carries a signed audit receipt.

from bhala import client

result = client.embeddings.create(
    input="Senior infrastructure engineer with 12 years experience...",
    model="bhala-embed-v1",
    operators=[
        {"id": "remove_gender_bias", "alpha": 1.0},
        {"id": "remove_age_bias",    "alpha": 1.0},
    ],
)

print(result.embedding[:8])
# [0.0234, -0.0891, 0.1456, ...]

print(result.audit)
# { audit_id: "aud_01HXX...",
#   operators: ["remove_gender_bias", "remove_age_bias"],
#   alpha: [1.0, 1.0],
#   signed_at: "2026-04-28T22:14:22Z",
#   signature: "ed25519:..." }

The shift runs on the embedding, not on the text. No fine-tuning, no retraining. The receipt is per-call, signed, and is the artifact you hand a regulator under the EU AI Act — a record of which operators ran on which input, with what magnitude.

Across the same four benchmarks (BBQ, StereoSet, CrowS-Pairs, WinoBias) — 28 dimensions, 15,966 sentence pairs — 100% of held-out stereotype embeddings flip to the anti-stereotype side of an independent per-axis classifier after the operator runs. The classifier is a 2-layer MLP trained on a separate training split; the operator is estimated on the same training split; the test is held-out stereotypes the classifier already correctly flagged before the shift. Full numbers →

Try the flip yourself — right now, no install

The bias-flip API is live in sandbox. No key required, no signup. Paste this into a terminal:

curl https://www.bhala.ai/api/v1/embeddings/shift \
  -H "Content-Type: application/json" \
  -d '{
    "text": "The nurse asked the doctor about his prescription.",
    "lang": "en",
    "operators": [
      { "id": "gender_bias", "alpha": -1.0 }
    ]
  }'

You get back the shifted embedding, the original embedding for comparison, the L2 shift size, the cosine before-vs-after, and an audit_id:

{
  "embedding":          [0.0230, 0.0644, -0.1378, ...],
  "original_embedding": [0.0181, 0.0712, -0.1402, ...],
  "operators_applied": [
    { "id": "gender_bias", "alpha": -1.0,
      "shift_norm": 0.7356, "cos_before_after": 0.7294 }
  ],
  "model": "sci-v3-sandbox",
  "audit_id": "aud_sg8a55n1t9b3",
  "latency_ms": 18
}

Swap in your own sentence. Swap the operator id to race_bias or age_bias. Change alpha to dial the strength up or down (or push it positive to amplify bias as a stress test). Same call, different shift.

The sandbox runs deterministic seeded vectors — enough to feel the shape of the API and confirm the call works. To run against the real model, you'll want the production endpoint: real shifted vectors, the full 28-axis bias-audit endpoint, and per-call signed receipts suitable for EU AI Act and NIST AI RMF reporting. Self-serve, free credits to start: create an account → — or talk to us if your use case needs a custom plan.

Three tiers of bias — and what each one needs

This audit suggests how fairness work should categorize what it finds:

Tier 1 — Directional. A consistent direction in embedding space. Linear probes detect it. Targeted vector subtraction removes it. Present in 13 of 28 Gemma 4 dimensions.
Tier 2 — Nonlinear. Encoded in the curvature of the representation. Linear probes underestimate it; MLP probes and subspace analysis reveal it. Present in all 28 dimensions; dominant for CrowS-Pairs and StereoSet.
Tier 3 — Contextual. Encoded in token-distribution structure rather than embeddings. Requires generative evaluation. Suspected for WinoBias; not measured here.

A model with low output bias and high representational bias is not equivalently safe to one with both low. The first can be de-suppressed by adversarial prompting, by fine-tuning, or by moving downstream into an embedding-based system. The second cannot.

Most current fairness frameworks do not draw this line. The frameworks regulators are starting to write — the EU AI Act, NIST's AI RMF — increasingly will.

The CrowS finding standard audits miss

For 9 of the 28 dimensions, linear probes give Gemma 4 a near-clean bill of health. CrowS-Pairs / race-color is the textbook example.

Layer	Linear acc	MLP acc	Subspace residual	Bhattacharyya overlap
0	0.574	0.659	0.803	0.498
8	0.612	0.567	0.892	0.092
59	0.532	0.471	0.863	0.082

Linear accuracy peaks at 0.65 — most fairness frameworks call that "borderline." But the subspace residual is 0.892: 89.2% of the variance in anti-stereotype embeddings falls outside the principal subspace of stereotype embeddings. The two groups occupy structurally distinct manifolds — they aren't separated along a direction, they're separated across a manifold boundary. The Bhattacharyya overlap drops to 0.082, meaning the two distributions barely touch.

A linear probe fails because both groups have high within-class variance. The separation is geometric, not directional. Almost no published fairness work runs subspace analysis. Race-color bias on Gemma 4 is invisible to the standard toolkit and structurally severe in reality.

Reproduce

The full per-axis, per-layer probe results JSON:

gemma4_full_algebra.json (download)

Code, methodology, and the single-axis CPU demo notebook are available on request — we'll share with anyone trying to reproduce or push back on the result.

The 28-dimension probe takes about 3 hours on an H100. The single-axis demo runs in a few minutes on a CPU. If you arrive at different conclusions on the same data, we want to hear about it.

If you want to test the operator API on your own embeddings, talk to us.

This work was conducted by the Bhala research team in April 2026. We probed Gemma 4 because it is currently among the most-aligned open-weight models — if post-training alignment works structurally anywhere, it should work here. The fact that suppression remains the operative mechanism is a finding about the limits of post-training alignment, not a critique of any particular model.

Continue reading

May 4, 2026·22 min read

A 15M Zulu-only model beats GPT-4o on Swahili — and understands Korean without ever seeing it

A 15M-parameter encoder pretrained on isiZulu — and nothing else — reaches 73.2% on Swahili intent (above GPT-4o zero-shot at 70.6%) and 72.5% on Korean using only a linear probe on the frozen encoder. Korean has nothing in common with Zulu — different family, different script, never seen in pretraining. By the strictest version of the field's gold-standard test (frozen encoder + linear probe + zero target-language data), this is the strongest published cross-lingual transfer result we know of.

May 3, 2026·10 min read

Silver labels are noisy by design. Bhala's audit catches the worst of them — top-10 precision: 100%.

Every production NLP team is sitting on silver-labeled training data — auto-tagged at scale, noisy by design. Bhala's audit tool surfaces the real mislabels in those corpora using just 100 hand-curated seeds and zero sentiment supervision. Top-10 precision on held-out validation: 100%. AUROC: 0.732. The same seeds curated in one Bantu language transfer cross-lingually to surface clear errors and policy-boundary cases in another Bantu language with no extra supervision. The product play: 5–10× reviewer-time multiplier across the AI lifecycle.

May 1, 2026·9 min read

RLHF makes LLM bias invisible. Here's what it actually looks like underneath.

Most fairness audits read a model's outputs and conclude. We probed inside six open LLMs — LLaMA-2 7B (base and chat), Mistral 7B (base and instruct), Phi-2 2.7B, and InkubaLM 0.4B — at the encoder level, the actual structure that drives every downstream task. The model that looks cleanest at the output (InkubaLM 0.4B, biased on 7 of 28 axes via PLL) and the surface-clean RLHF-tuned models all show the same bias structure once you go past the output gate. RLHF hides bias from the audit; it doesn't remove it from the encoder. If your product wraps any of these models for retrieval, search, or classification, the encoder is what you ship — and that's where the bias still is.

See all research posts →