The Benchmark Setup
We ran a structured evaluation across 10,000 hours of Indian audio: news broadcasts, IVR call recordings, field interview recordings, and conversational WhatsApp voice notes. The dataset was deliberately noisy โ real-world audio with background noise, variable microphone quality, regional accent variation, and heavy code-mixing.
Four systems were evaluated side-by-side: Rama STT, OpenAI Whisper large-v3, Google Cloud Speech-to-Text v2, and AWS Transcribe. All were tested on current production APIs as of February 2026 with no vendor-specific fine-tuning allowed.
Results Across the Scheduled Languages
Across all 22 scheduled languages, Rama STT achieved an average Word Error Rate (WER) of 8.3%, compared to Whisper at 14.1%, Google at 11.7%, and AWS at 16.2%. The gap was most dramatic in less-resourced languages: for Odia, Konkani, and Maithili, Rama STT stayed under 12% while competitors ranged from 22% to 38%.
Hindi and Tamil โ the two most data-rich Indian languages globally โ showed the smallest gaps. The real differentiation was in the 16 languages where English-first models had limited pre-training exposure.
The Hinglish Problem
Code-mixed Hinglish is where results became most stark. On a 500-hour explicitly code-mixed subset, Whisper's WER climbed to 31.4% and Google's to 24.8%. Rama STT held at 11.2%.
The reason is architectural: Rama STT's tokenizer treats code-mixed content as native input, not two separate language channels. The language model has genuine priors about Hinglish grammatical structures โ not just Hindi and English modeled separately.
Latency and Cost
Real-time factor (RTF) matters enormously for live use cases. Rama STT achieved 0.09ร RTF โ a 10-second clip processes in under 900ms. Google and AWS were comparable at 0.11ร and 0.13ร. Whisper at comparable quality ran at 0.34ร โ too slow for real-time IVR without specialized hardware.
On cost, Rama STT's engagement-hour pricing translates to roughly 60โ70% savings versus Google and AWS at typical Indian enterprise volumes above 50,000 minutes per month.
What This Means for Your Stack
If your use case is Hindi or Tamil transcription of clean audio, the gap is meaningful but not transformative. If your use case involves regional Indian languages, noisy field audio, or code-mixed content, the benchmark data makes a clear case for Rama STT.
The full benchmark dataset and methodology are available on request for enterprises conducting their own evaluations. We believe in transparency โ the numbers should speak for themselves on your audio samples.