Humyn Labs Releases BRIDGE Benchmark for Global South AI Voice Testing

By Artūras Malašauskas May 11, 2026 4 min read Share:

Humyn Labs published BRIDGE, an independent benchmark revealing major accuracy gaps in AI speech recognition across 15+ Indic languages and Global South markets.

The AI data infrastructure firm Humyn Labs released BRIDGE (Benchmark of Regional & International Data for Global Evaluation), the first independent assessment of commercial speech-recognition tools on real conversational audio from India and the Global South. The benchmark evaluates 15 models across 17+ Indic languages, three Latin American Spanish dialects, Brazilian Portuguese, and Vietnamese.

According to the official BRIDGE report, the findings expose a fundamental problem: several widely deployed tools misheard one in three words on Indian language audio. Most enterprises building on these tools were unaware because the standard industry measure, Word Error Rate (WER), was never designed to catch the failures that define real Indian speech.

Where most benchmarks stop at WER and Character Error Rate (CER), BRIDGE applies a seven-metric stack: WER, CER, Semantic Similarity, Code-Switch F1, Loan Word WER, Phoneme-Informed Error Rate, and Word Information Lost. Each captures a different dimension of failure. Semantic Similarity measures whether the meaning of what was said is preserved, even when exact words differ. Loan Word WER tracks accuracy specifically on English words embedded in Indian language speech.

The most consequential metric for India is Code-Switch F1, which measures how accurately a model handles the natural mixing of Hindi or any Indic language with English mid-sentence. Most AI tools either drop the English words or convert them into transliterated script, breaking the meaning for anyone reading the transcript. This failure is invisible to word error rate.

The scores reveal the gap: Deepgram Nova-3 leads at 0.906. Amazon Transcribe scores 0.199. OpenAI's models fall below 0.4. The results also offer the first direct, independent comparison of global models against Indian providers. Sarvam AI's saaras v3 ranks third overall on word error rate at 20.2% — ahead of Google Gemini, Microsoft Azure, and AWS Transcribe, a strong result for a model built specifically for Indian languages.

On overall word error rate, ElevenLabs Scribe v2 leads at 10.6%, with a margin over second place wider than the entire spread between second and eleventh. The broader finding, however, is that a single leaderboard number is not a reliable basis for deployment decisions. The model that leads on Spanish does not lead on Vietnamese. The model that leads on code-switching does not lead on word accuracy.

BRIDGE was built on field-collected audio, real two-person conversations, human-verified, across 22 Indian states, not scripted speech or data scraped from the internet. The full dataset is available on Hugging Face. Humyn Labs has indicated it will open-source the evaluation methodology subject to demand from the research community.

"The models are grading their own work. ASR providers published their own accuracy scores using benchmarks built on English-first, internet-trained datasets, with little independent validation," said Manish Agarwal, Co-founder, Humyn Labs. "Meanwhile, enterprises are making million-dollar deployment decisions on numbers that rarely reflect how their users in Global South actually speak."

"The models aren't the only problem the metrics are. You cannot evaluate non-English speech with a scoring system designed for English phonology and call it rigorous," said Ishank Gupta, Co-founder, Humyn Labs. "The performance leaderboard for Hindi is not the leaderboard for Tamil, Bengali and Marathi. A single aggregate benchmark score cannot support cross-regional deployment decisions."

Independent reporting from The Hindu Business Line corroborates the timeline and scope of the changes, noting the study was published on May 11, 2026.

Enterprises need to evaluate the language, dialect, and speech pattern that matches their actual users. Speaker overlap is the biggest acoustic stressor. Cross-state pairs are harder than same-state. Gender and age have modest effects. Duration and gap patterns show minimal impact — language and accent dominate (which should come as no surprise to anyone who's tried to explain their regional accent to a call center bot).

The physical reality of this gap matters. When a model drops English words mid-sentence or converts them to transliterated script, the transcript becomes useless for anyone trying to parse meaning. The friction isn't theoretical — it's the difference between a customer service transcript that works and one that requires manual correction.

Whether enterprises actually pay for this level of granularity remains the real question. The benchmark exists. The data is public. Whether vendors adjust their deployment strategies based on it is another matter entirely.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

Humyn Labs Releases BRIDGE Benchmark for Global South AI Voice Testing

Comments