Associate Sponsors

Co-sponsor

Saaras V3 beats Gemini, GPT-4o on Indian speech benchmarks, says Sarvam AI

Sarvam said that its new speech recognition model recorded lower word error rates on IndicVoices and Svarah benchmarks, extending its earlier benchmark results in document and language tasks

Saaras V3 AI model
Benchmark results shared by Sarvam AI show its Saaras V3 speech recognition model recording lower word error rates than Gemini, GPT-4o and other systems on IndicVoices and Svarah tests
Harsh Shivam New Delhi
5 min read Last Updated : Feb 12 2026 | 10:50 AM IST
Indian AI startup Sarvam AI has released a new version of its speech recognition model, Saaras V3, and says it outperforms several widely used global systems, including Google’s Gemini 3 Pro, OpenAI’s GPT-4o Transcribe, Deepgram Nova-3, and ElevenLabs Scribe v2, on benchmarks focused on Indian languages and Indian-accented English.
 
The company’s co-founder Pratyush Kumar shared the results in a post on X, alongside benchmark charts comparing Saaras V3 against competing models on the IndicVoices and Svarah datasets. According to him, Saaras V3 recorded a lower word error rate than the other models across the most widely used Indian languages in the IndicVoices benchmark and also led on the Svarah benchmark, which focuses on Indian-accented English.
 
On the subset of the 10 most popular languages in the IndicVoices dataset, Sarvam reports that Saaras V3 achieved a word error rate of about 19.3 per cent, compared to higher error rates for Gemini 3 Pro, GPT-4o Transcribe, Deepgram Nova-3, and Scribe v2. The company also said the performance gap widens on the remaining languages in the dataset, which include several lower-resource Indian languages.
Saaras V3 on IndicVoices benchmark (Source: Sarvam)
On the Svarah benchmark, which is built around Indian-accented English speech from speakers across multiple states, Saaras V3 again recorded the lowest word error rate among the compared systems, according to the figures shared by Sarvam. 
Saaras V3 on Svarah benchmark (Source: Sarvam)

Saaras V3: What’s new

Sarvam says Saaras V3 is built on a new architecture and expands support to all 22 scheduled Indian languages, along with English. A key change in this version is native support for real-time, streaming speech recognition, where the model begins producing text while audio is still playing, instead of waiting for the full clip to finish.
 
According to the company’s technical blog, Saaras V3 is trained on more than one million hours of multilingual audio covering different Indian languages, accents, and recording conditions, with a focus on code-mixed and noisy speech. Training involved large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, and additional post-training steps aimed at reducing long-tail errors and improving consistency across languages.
Sarvam says the streaming version of the model is designed to keep accuracy close to the batch mode while reducing latency, making it suitable for use cases such as live captions, voice assistants, call-centre tools, and real-time transcription.

Beyond basic transcription

Sarvam says Saaras V3 is positioned as more than a simple speech-to-text system. The model supports automatic language detection, word-level timestamps, and speaker diarisation, which allows it to separate and label different speakers in a conversation. These features are aimed at use cases such as call analytics, meeting transcripts, media subtitling, and customer support tools, where structure and speaker attribution matter in addition to raw text.
 
The company has also exposed different operating modes that trade off latency and accuracy, ranging from a “fast” mode focused on low time-to-first-token to more accuracy-focused settings for applications where transcription quality is the priority.

Sarvam Vision and earlier benchmark claims

The Saaras V3 results follow earlier benchmark claims by Sarvam around its document-focused models. In previous disclosures, the company said its Sarvam Vision model posted higher accuracy scores than several general-purpose systems on tests focused on document OCR, layout understanding, and multi-script Indian documents. Those evaluations covered tasks such as reading order detection, table parsing, and handling complex page layouts, areas where models trained mainly on Western and English-language data often struggle with Indian scripts and formats.
Sarvam has positioned Sarvam Vision as a vision-language system built specifically for documents rather than for general image understanding, combining a core model with separate components for layout and structure analysis. The company has argued that this task-specific approach, along with training on Indian-language and Indian-format data, explains the performance differences seen in those benchmarks. The Saaras V3 results extend that same argument into speech recognition, particularly for Indian languages, code-mixed inputs, and Indian-accented English.

What is Sarvam AI

Sarvam AI is a Bengaluru-based startup focused on building speech, language, and multimodal AI systems for Indian use cases. Instead of training a single general-purpose chatbot, the company has been developing a set of task-specific models aimed at areas such as speech recognition, speech synthesis, translation, and document understanding, where performance depends heavily on how well systems handle local languages, scripts, and formats.
 
Alongside Saaras, its speech recognition line, Sarvam’s portfolio includes Bulbul, a text-to-speech system for Indian languages; Saarika, a speech-to-text model focused on transcription; Mayura, a text translation model; and Sarvam-M, a multilingual reasoning language model. On the vision side, Sarvam Vision is its document understanding model designed for OCR and layout-aware reading of scanned and photographed documents. The company has also built applications such as Samvaad, a voice-based conversational system that runs on top of its speech and language models.
 
It is one of the 12 startups working with Indian government under the IndiaAI mission to develop indigenous multilingual and multimodal large language models.

More From This Section

Topics :artifical intelligenceAI technologyAI ModelsGemini AI

First Published: Feb 12 2026 | 10:50 AM IST

Next Story