Composable AI — Every Use Case

Compose intelligence for any use case

Language, translation, edge, sovereign, interpretable. Every use case composes the same five core units. Own every layer.

5 core units
composable building blocks
15M params
any phone, any IoT
<50ms
on commodity hardware
Sovereign
on-prem, inspectable
Real scenarios

What customers actually deploy this for

Four production use cases — each a different buyer, a different continent, the same underlying stack.

Scenario 1 — Financial services · EU

A European bank needs its CV-screening pipeline to pass an EU AI Act audit before go-live.

The bank's legal team has 60 days to show regulators that candidate ranking does not discriminate on gender or age. The compliance team has no ML budget and no labeled fairness data.

Without Bhala

  • · Commission a third-party bias audit (~€40K, 8 weeks)
  • · Get a PDF report — no way to act on findings in production
  • · Either pull the product or ship it with known risk
  • · No signed record to show the regulator

With Bhala

  • · CVs and job descriptions go through Bhala's encoder — Bhala is the embedding layer
  • · The bias operator removes gender and age signal from candidate embeddings before ranking
  • · Every ranking decision carries a signed audit receipt showing which operators ran
  • · Hand the regulator a machine-readable log, not a PDF

Scenario 2 — Healthcare · United States

A US health system wants to detect hate speech in patient intake notes — across 14 languages — without sending data to a cloud API.

HIPAA forbids sending patient text to third-party APIs. The patient population speaks Spanish, Haitian Creole, Somali, Vietnamese, and 10 other languages. Staff file incident reports inconsistently. The compliance officer wants automated flagging before a note reaches a clinician.

Without Bhala

  • · Off-the-shelf classifiers cover English only — or require cloud calls
  • · Building per-language models requires labeled data that doesn't exist
  • · On-premise deployment of a 7B+ model requires GPU infrastructure
  • · No solution ships in under 6 months

With Bhala

  • · 15M-parameter model runs entirely on-premise — no data leaves the network
  • · Zero-shot across all 14 languages with a single frozen model
  • · 81–98% catch rate per target group at a 5% false-positive rate
  • · Fits on a standard clinical workstation CPU — no GPU needed

Scenario 3 — Content moderation · Global platform

A social platform needs intent classification across 40+ languages without training a model per language.

The platform has 200M users across Southeast Asia, the Middle East, and Sub-Saharan Africa. Their current English-only intent classifier routes abuse reports to the wrong queue 60% of the time for non-English posts. Hiring labelers for 40 languages is not viable.

Without Bhala

  • · Translate everything to English first — adds latency, loses nuance, costs $0.06/1K chars at scale
  • · Or fine-tune 40 separate models — 40× the training cost, 40× the maintenance surface
  • · Translate-then-classify misroutes culturally-specific abuse that doesn't map to English categories

With Bhala

  • · One model, 40+ languages, no translation layer
  • · Intent redirect operator re-routes posts at inference time — no retraining per new category
  • · 73% intent accuracy on Swahili zero-shot, beating GPT-4o (70.6%) at 1/100,000th the size
  • · Sub-50ms on commodity hardware — fits inside existing CDN edge nodes

Scenario 4 — Feed middleware · AT Protocol / Fediverse

A social platform wants to let users control their own feed — not just mute words, but dial down toxicity, political content, or outrage bait as a personal preference.

The platform has committed to user-sovereignty over algorithmic feeds. Users want sliders, not keyword lists. The trust & safety team wants every moderation action to be auditable and reversible — including user-applied ones.

Without Bhala

  • · Keyword filters — gameable in seconds, no semantic understanding
  • · Per-user fine-tuned models — not feasible at millions of users
  • · Platform-level moderation only — users have no meaningful control
  • · No audit log: impossible to explain to a user why a post was suppressed

With Bhala

  • · User preferences map to named controls — less_outrage, less_political, more_local
  • · Controls applied to post embeddings at query time — one model, every user, no retraining
  • · Fully reversible: users can inspect and undo any active control
  • · Every suppression carries a signed receipt — AT Protocol compatible, GDPR explainability ready
Where it fits

Drop it between your embedding and your retrieval.

Drop the operator layer into your existing embedding pipeline — ranking, retrieval, classification, or recommendations. One REST call returns a shifted embedding and an audit receipt. No rewrites, no retraining.

Enterprise RAG

Buying now

Every query gets an audited pass before retrieval. Compliance teams get the receipt; hallucinations drop on sensitive prompts.

Examples · Glean · LangChain hosts · vertical AI

Semantic search

Natural fit

Users or regulators apply named dimensions (positive sentiment, verified sources, less advertising) to the query before it hits the index.

Examples · Kagi · Perplexity · legal / medical search

Feed ranking & moderation

AT Protocol

User-owned feed dimensions and moderator-owned toxicity flagging, each with a signed log per action.

Examples · Bluesky · Mastodon · forums · comments

Recommendation

EU DSA ready

Apply `safe_for_kids`, `less_political`, or `more_diverse_authors` as overrides on a user vector. Reversible per request, explainable per result.

Examples · Publishers · streaming · e-commerce

Intent routing

Production

Redirect a user query from one intent region to another. Counterfactual retrieval without retraining, tested at 100% flip accuracy across four transitions.

Examples · Chatbots · Slack · Intercom · Zendesk

Bias remediation

Model-risk

Subtract a bias dimension from any embedding at inference, reversibly. Quantify before-and-after and prove it to an auditor.

Examples · HR tech · lending · underwriting · screening

Today the operator layer runs on our embedding space. Q3–Q4 extends it to wrap OpenAI, Cohere, and customer-hosted embeddings, so you don't replace your foundation model to make it governable.

Flagship

The middleware that makes any embedding auditable.

Every embedding API is opaque. When a retrieval goes wrong, "the model did it" doesn't pass an EU AI Act review.

Drop a thin layer between your embedding call and your retrieval. Apply named actions (sentiment, intent, bias) and get back the shifted vector plus a signed audit record.

Your pipeline
query / user / post
Bhala operator layer
shifted vector + audit
Your retrieval
ranker, RAG, classifier
100% flip accuracy on sentiment and intent (held-out test data)
Every shift logged with operator id, parameters, timestamp, and result delta
Plugs into any embedding step — ours today, OpenAI / Cohere on the roadmap
Reversible per call — apply a negative coefficient to undo or debias
Sub-50ms latency on commodity hardware — no GPU required
Built for EU AI Act, SOC 2, and model-risk review out of the box

Target customers: Enterprise RAG platforms, regulated industries, AI-native social platforms, compliance-heavy search

Differentiator

AI you can inspect, audit, and defend in court

Regulators want to know why your model decided what it did. Your LLM gives them vibes and a confidence score.

With Bhala, every inference is a traceable sequence of operations. The full reasoning path is inspectable from input tokens to output decision.

Input
(with audit context)
Bhala Inspect
embedding + operator + receipt
Decision + Trace
(auditable, reproducible)
Inspectable reasoning trace on every inference
Human-readable tokens — not opaque sub-word fragments
Deterministic geometry, not stochastic token sampling
Compositional audit: replay any step
EU AI Act, model risk, and clinical use ready
Reproducible results across deployments

Target customers: Financial services, healthcare, regulators, critical infra

Enterprise

Own the full intelligence stack. Weights, data, audit trail.

Your citizen data flows to foreign cloud providers. Every API call crosses a border. Regulators and adversaries both notice.

Run the full stack on your own infrastructure. Adapt to national languages, run on existing hardware, keep every inference auditable. Nothing leaves your borders.

Citizen Data
(stays in-country)
Bhala Stack
All 5 core units, on-prem
Your Infrastructure
(sovereign control)
On-premise or private-cloud deployment
Own the model weights — no vendor escape hatch
Adapt to national languages in under 2 seconds
No H100 clusters required — runs on existing hardware
Full audit trail for compliance reporting
Bring-your-own core units — no lock-in

Target customers: Governments, central banks, defense, regulated industries

Available

High-performance intelligence on any device, fully offline

Your users are on feature phones and spotty networks. Cloud AI is not an option.

15M parameters, 24MB on disk (pre-quantization) — smaller than most app updates. Sub-50ms inference, fully offline. Intent, sentiment, and NER on-device. No data leaves the phone.

User
(any device)
Bhala Edge
2 composed core units
Your App
(fully offline)
24MB on disk (pre-quantization) — smaller than most app updates, vs 2GB+ for the smallest frontier models
<50ms inference on standard mobile hardware
Fully offline — no internet required
No data leaves the device — privacy by construction
Battery-efficient inference
Android, iOS, ONNX Runtime, embedded Linux

Target customers: Device OEMs, mobile fintech, health apps, IoT, defense

Core

Translation without huge parallel corpora

Classical MT needs millions of aligned sentence pairs per language. That data doesn't exist for most of the world.

Cross-lingual operators move meaning between languages as geometric transformations. New languages attach with adaptation, not retraining.

Source Text
(any supported language)
Cross-Lingual Operator
Bhala embedding shift
Target Text
(any supported language)
No massive parallel corpus required
English ↔ 23 languages in production
New language support in seconds, not months
Preserves cultural and grammatical nuance
Batch-throughput for enterprise workloads
Custom brand glossary composition

Target customers: Localization teams, customer ops, public sector, NGOs

Core

Native NLU for every language — even the ones nobody else serves

General-purpose LLMs hallucinate on most of the world's languages, charge cloud rates, and still miss. Fine-tuning is a six-month project.

Intent, sentiment, entities, and grammar across 23 languages out of the box. Zero-shot to 17+ more. No fine-tuning, no cloud dependency.

User Input
(any supported language)
Bhala NLU
3 composed core units
Your App
(intent, sentiment, entities)
Native NLU across 23 Bantu languages + 17 zero-shot
Beats GPT-4o on Swahili intent (73.2% vs 70.6%)
New SOTA on Zulu, Xhosa, and Sesotho (vs AfroXLMR-76L)
Deterministic inference — no hallucinations
Sub-50ms latency on commodity hardware
Works with Azure OpenAI, AWS, Google Cloud as a layer

Target customers: Fintech, healthtech, global SaaS, content platforms

Integration

Composes with your existing stack

Core units are decoupled by design. Drop them next to the platforms you already use.

Cloud Platforms

  • Azure OpenAI
  • AWS Bedrock
  • Google Cloud AI

Customer Service

  • Zendesk
  • Freshdesk
  • Salesforce Service Cloud

Communication

  • WhatsApp Business
  • USSD gateways
  • SMS platforms

Edge Platforms

  • Android (Kotlin/Java)
  • iOS (Swift)
  • ONNX Runtime

Compose the intelligence you need.

Start with one core unit. License the full stack when you're ready for sovereign deployment.