Research
Sozisi: 15M Parameters, No Compromises
A 15M-parameter model that matches or exceeds GPT-4 on intent classification, entity recognition, and sentiment analysis across 40+ languages — without any additional training per language. 1/10,000th the size. Runs on a $50 phone. Every decision produces a signed audit record.
Model Overview
Most AI learns token statistics. We learn the building blocks underneath.
Languages don’t share every feature — but they draw from the same toolkit: morphology, agreement, argument structure, phonological patterns. Only the combinations differ. Teach a model the statistics of one language’s vocabulary and it knows one language. Teach it the underlying patterns of human language and it transfers to thousands.
Today’s AI methods fail for over 6,000 languages (Stanford HAI, 2024) — not because the data is missing, but because the training is wrong. We pretrained a 15M-parameter model on one morphologically-rich language (isiZulu) and it works zero-shot across 40+ more, because the structural scaffolding is what they share.
Every benchmark uses the same rule: the language under test was not in our training data. We say so explicitly each time, because the normal assumption is “big multilingual model beat a specialist.” Here the specialist is the 422M-parameter competitor that trained on the target language. We’re the 15M- parameter model that didn’t.
Five units. Each one small, inspectable, and composable.
Every benchmark comes out of the same stack. Each unit is a small, inspectable component — swap, verify, or extend any piece. Together they are what makes 15M parameters beat trillion-parameter systems on the tasks that matter.
Sozisi Encoder
One model. Structure shared across human languages.
A compact neural encoder trained to capture the structural patterns shared across human languages — morphology, agreement, composition. New languages attach in seconds, not weeks.
- Zero-shot transfer to 17+ languages across 10 families
- Adapts to a new language in <2 seconds
- Scripts-agnostic: Arabic, Devanagari, Hangul, Cyrillic
- Stable under perturbation (robust by geometry)
Programmable Behavior
Named, composable controls applied at inference
A patented embedding space where semantic dimensions like sentiment, intent, and bias are accessible as named controls on any query or document. Every shift is auditable per call, at 100% flip accuracy on every language tested in-family.
- 100% sentiment-flip accuracy across every language tested in-family
- 100% intent-redirect accuracy across 4 tested transitions
- Cross-family transfer verified at 77% on English (zero-shot)
- Every shift logged for audit + compliance
Morpheme-Aware Tokenization
Inspectable tokens, not opaque subwords
Tokens carry linguistic meaning and you can read them, unlike opaque sub-word fragments. Downstream core units become more sample-efficient and their outputs more explainable.
- Human-readable tokens across all supported languages
- Compact vocabulary: ~5K tokens covers 23 languages
- Dramatically more sample-efficient than opaque sub-word tokenizers
- Enables downstream interpretability
On-Device Runtime
Linear-time architecture built for phones, not GPUs
A sequence model with linear complexity, so 15M parameters run on a smartphone, feature phone, or sensor. Paired with the encoder and tokenizer, it delivers GPT-4-class NLU at a fraction of the cost, fully offline.
- <50ms inference on commodity hardware
- No GPU required for production workloads
- Runs on Android, iOS, and embedded Linux
- 24MB footprint — fits alongside your app
Self-Healing Inference
Robustness as a core unit, not a bolt-on
A structural correction mechanism that snaps perturbed inputs back to clean representations during inference. The model degrades gracefully on noisy, OOD, or adversarial inputs.
- Graceful degradation on out-of-distribution inputs
- Resilient to typos, code-switching, and transliteration drift
- Stable inference under input perturbation
- No retraining needed per failure mode
In 1988, two scientists said AI could never truly understand meaning — only memorize patterns. Here’s the proof they were wrong.
Imagine you mix apple juice and orange juice. Now imagine you can separate them back out — perfectly — and swap the apple juice into a completely different fruit blend you’ve never tried before. That’s what understanding meaning actually requires: the ability to take concepts apart, move pieces around, and put them back together in new combinations.
Most AI models can’t do this. They memorize which sentences tend to be positive or negative. Ask them to move a meaning — take a negative sentence and shift it toward positive while keeping everything else the same — and they fail. They have no “positive” direction to point to. The meaning is locked inside a pattern they can’t manipulate.
Bhala’s embedding space is structured differently. Every sentence sits in a space where named directions actually mean something — “more positive,” “more formal,” “a question instead of a statement.” You can move a sentence along those directions, reverse the move, and combine multiple moves at once. Like algebra, but for meaning.
How it works in practice
- We measure the “positive sentiment” direction from a few hundred sentence pairs in one language (isiZulu) — the difference between how a negative sentence sits in the space versus how a positive one does.
- We apply that same direction to a KiSwahili sentence the model has never seen: “Chakula hiki ni kibaya sana” (“This food is very bad”).
- The shifted sentence now lives where positive KiSwahili sentences live. 100% accuracy across 547 held-out test sentences.
- The same direction also works for English, and for intent — turning “book a flight” into “cancel a reservation” — across 22 languages, without any additional training.
Why this matters beyond benchmarks
- Bias removal becomes a dial, not a retrain. To remove gender bias from a document retrieval system, you apply the “remove gender” direction. No new data. No retraining. One API call.
- It works across languages you never trained on. A direction learned in Zulu transfers to Swahili, Xhosa, and English — because the geometry of meaning is shared, not language-specific.
- You can undo it. Every operation is reversible. Apply positive sentiment; reverse it exactly. No information is destroyed.
Steerable meaning — proven across three languages
Bhala is the first model that lets you steer a sentence's meaning at inference time. Tell it to remove a specific bias, flip sentiment, or redirect an intent — and an independent classifier (the kind your team would deploy in production) confirms the change took effect. Reduce bias by 77–100% on the categories you care about, without retraining.
Other sentence models — including the ones from the largest US labs — are optimized for similarity search, not for being steered. You can compare two sentences in their space, but you cannot reliably change one. Bhala is built so that named directions in meaning behave like real controls. No other production-ready model offers this today.
| Task | Zulu | Swahili | English | Test cases |
|---|---|---|---|---|
| Sentiment shift (negative → positive) | 100% | 100% | — | 263 |
| Intent redirect (12 categories) | — | 94% | 77% | 1,969 |
| Bias removal (27 categories — gender, race, religion, age, disability, …) | — | — | 100% | 15,966 |
Why English is lower
Today's model was pretrained almost entirely on isiZulu. English results come from generalization — applying learned structure to a language the model never saw at scale. We are now training the English-native version, and expect English to match or exceed the 94% Swahili number.