Tuesday, February 10, 2026 | 05:05 PM ISTहिंदी में पढें
Business Standard
Notification Icon
userprofile IconSearch

India's Sarvam AI reportedly beats ChatGPT, Gemini in key benchmark tests

Indian AI startup Sarvam AI reports strong benchmark results in document OCR and Indic language understanding, outperforming several global models on layout and multi-script tests

Sarvam AI

Sarvam Vision is part of Sarvam AI’s India-focused stack of language and document models

Harsh Shivam New Delhi

Listen to This Article

Indian AI startup Sarvam AI has reported strong performance on a set of benchmarks focused on document understanding and Indian languages, putting its models ahead of several widely used systems on those specific tests. The results come from evaluations covering optical character recognition (OCR), document layout understanding, and Indic language processing, areas where global models often face accuracy issues with non-Latin scripts and complex page structures.
 
The benchmarks include document OCR and layout understanding tests such as olmOCR-Bench, along with internal and public evaluations covering multi-script Indian documents, tables, and mixed-layout pages. In these tests, Sarvam’s document model posted higher accuracy scores than several general-purpose vision and language models on the same tasks. 
The benchmarks were released ahead of the India AI Impact Summit, which begins on February 16 in New Delhi, where the startup is expected to showcase the working and capabilities of its sovereign AI models in the expo zone.
 

Sarvam AI

Sarvam AI is a Bengaluru-based startup founded in 2023 that builds language and multimodal AI systems focused on Indian use cases. The company works on models for document processing, speech, and language understanding, with training data drawn from Indian languages, scripts, and real-world material such as documents, textbooks, newspapers, and scanned records.
Instead of building a single general-purpose chatbot, Sarvam has focused on task-specific systems for areas like OCR, document parsing, speech synthesis, speech recognition, and translation, where performance depends heavily on how well models handle local languages and formats.

Sarvam Vision: Document-focused AI model

The main model behind the document benchmark results is Sarvam Vision, a vision-language model designed for document understanding rather than basic text extraction. Unlike traditional OCR systems that output plain text, Sarvam Vision is built to interpret layout, reading order, tables, charts, and structured elements in scanned or photographed documents.
 
Sarvam said the model is a three-billion-parameter system trained on a mix of real and synthetic documents, including textbooks, financial records, government documents, magazines, newspapers, and historical material, across multiple Indian languages and English. The system combines a core vision-language model with separate components for layout parsing and reading-order detection, which are used to reconstruct documents in a structured form.
 
On olmOCR-Bench, a practical benchmark designed to test real-world OCR and document understanding performance, Sarvam Vision reported an accuracy score of 84.3 per cent. This was higher than the scores posted by several general-purpose models, including Google Gemini 3 Pro (80.20) and OpenAI’s GPT 5.2 (69.80), evaluated on the same benchmark, particularly on pages with complex layouts and non-Latin scripts.
Sarvam Vision on olmOCR benchmark (Source: Sarvam)
The company has also published results on Indic-language document tests covering multiple scripts, where it reports higher word-level accuracy across a wide range of Indian languages compared to other OCR and vision-language systems.

What the benchmarks indicate

The recent results place Sarvam AI’s models ahead of several widely used systems on document OCR and layout understanding tasks, particularly for Indian scripts and complex page structures. The tests focus on how accurately a model can read text, follow reading order, and interpret structured content such as tables and multi-column layouts, rather than on general conversational ability.
 
This distinction matters because general-purpose AI models are trained to handle a wide range of tasks, while Sarvam’s systems are trained more narrowly on documents and Indian languages. That targeted training shows up in benchmarks that measure OCR accuracy, layout parsing, and script-level recognition.

Sarvam AI’s wider model portfolio

Alongside Sarvam Vision, the company has built a wider set of language models:
 
Bulbul: Text-to-speech model
 
Bulbul is Sarvam’s speech synthesis system that converts text into spoken output. Bulbul V3 is designed to handle multiple Indian languages, accents, and code-mixed speech patterns, and is aimed at use cases such as voice interfaces, assistants, and accessibility tools.
 
Saarika: Speech-to-text model
 
A speech recognition model that converts spoken Indian language audio into text. Saarika supports transcription in around 11 Indian languages and is the base option for speech-to-text tasks.
 
Saaras: Speech-to-text translation model
 
Saaras combines speech recognition with direct translation, transcribing spoken input and outputting translated text, such as Indian language speech into English, in a single step.
 
Mayura: Text translation model
 
Mayura handles translation between languages, trained on conversational and real-world data across Indian languages. It is designed for more colloquial and contextual translation needs.
 
Sarvam-M: Multilingual reasoning language model
 
Sarvam-M is the company’s reasoning and conversational language model. Built as a multilingual text model with hybrid reasoning capabilities, it is tuned for better performance on Indian language benchmarks as well as tasks involving logic, maths and extended context.
 
On the application side, Sarvam runs Samvaad, a voice-based conversational system built on top of its speech and language models. Samvaad is designed to handle spoken interactions in multiple Indian languages and is aimed at use cases such as customer support, information access, and voice-driven services, particularly in settings where text-first interfaces are less practical.
 

Don't miss the most important news and views of the day. Get them on our Telegram channel

First Published: Feb 10 2026 | 5:01 PM IST

Explore News