Why multilingual and multimodal AI is central to India's AI 'impact' agenda
As the India AI Impact Summit nears, initiatives like BharatGen, BHASHINI and Adi Vaani highlight why multilingual and multimodal AI is becoming central to how India is building public digital systems
)
Representative image generated using Gemini AI
Listen to This Article
Multilingual and multimodal artificial intelligence is set to be one of the core agendas for India at the upcoming AI Impact Summit, which will kick off from February 16 in New Delhi. Over the past years, multiple government-backed projects and platforms have been rolled out around building AI systems that can work across Indian languages and across different formats such as text, speech and documents.
Among these projects are the Adi Vaani platform under the Ministry of Tribal Affairs, the BharatGen programme backed by the Department of Science and Technology and the Ministry of Electronics and Information Technology (MeitY), and the BHASHINI language platform under Digital India. Together, they point to a policy push that treats language and voice as core parts of India’s AI infrastructure rather than as add-ons.
What “multilingual” and “multimodal” AI means
“Multilingual” AI refers to systems that can understand and generate content in more than one language, including Indian languages that are often poorly represented in global datasets.
“Multimodal” AI refers to models that can work across different types of input and output, such as text, speech and images or documents, instead of being limited to only text-based interactions.
For public-facing systems, this matters because many government services and information flows rely on a mix of spoken queries, scanned documents and text forms. A multimodal system can, in principle, take a spoken question in an Indian language, read a document, and return an answer in speech or text. Several of the projects being backed by the government are designed around this idea, rather than around English-first, text-only models.
Also Read
Why multilingual and multimodal AI matter for India
For India, the case for multilingual and multimodal AI is largely practical rather than theoretical. Government services, courts, welfare systems and local administrations operate across dozens of languages, and much of the information citizens interact with is not in a single, standardised format. Applications range from scanned forms and notices to spoken queries at service centres and helplines, making text-only and English-first systems a poor fit for large parts of the population.
India has hundreds of languages and dialects in active use, and a significant share of citizens rely primarily on regional or local languages for day-to-day interactions with the state. This creates a gap in access when digital systems are designed mainly around English or a small set of major languages. Multilingual AI systems are meant to reduce that gap by allowing the same service or interface to work across different languages without requiring separate, manual translations for each one.
The “multimodal” aspect addresses a different constraint. In many government workflows, information is not limited to typed text. It includes scanned documents, images, and spoken inputs. A system that can only process text leaves large parts of this information outside the digital workflow. Multimodal models are intended to handle this mix by combining text, speech and document or image understanding in a single pipeline.
This is also why many of the current government-backed projects are being framed around public services rather than consumer applications.
Platforms such as BHASHINI are being positioned for use in citizen-facing portals and administrative processes, while programmes like BharatGen are being funded to build a broader stack of text, speech and document-vision models for Indian languages.
The underlying policy logic is that without language and modality coverage, large sections of the population remain effectively excluded from digital systems, even if connectivity and devices are available.
India’s sovereign AI push: Platforms, programmes and startups
Much of India’s current work on multilingual and multimodal AI is being deployed through publicly funded platforms and research programmes, alongside a growing set of domestic startups working on language and vision models for Indian use cases.
One example is Adi Vaani, a translation platform for tribal languages launched by the Ministry of Tribal Affairs in beta last year. Developed by a consortium led by IIT Delhi with BITS Pilani, IIIT Hyderabad, IIIT Naya Raipur and several State Tribal Research Institutes, the platform currently supports languages such as Santali, Bhili, Mundari and Gondi, with more under development. According to the ministry, the system is meant to handle both text and speech translation and is being positioned for use in areas such as education, governance communication and documentation of oral traditions.
At a broader level, the government-backed BharatGen programme is aimed at building a full-stack, multilingual and multimodal AI system covering text, speech and document understanding. The project is being led by IIT Bombay with a consortium of other institutions and is supported through the National Mission on Interdisciplinary Cyber-Physical Systems, with Rs 235 crore routed via the Technology Innovation Hub at IIT Bombay. In addition, the programme has received further funding of Rs 1,058 crore under the IndiaAI Mission, taking total government support to over Rs 1,200 crore. BharatGen has already released multiple models, including a text model, speech recognition and text-to-speech systems, and a document-vision model designed to work with Indian-language content and formats.
Alongside this, the Ministry of Electronics and Information Technology is running BHASHINI as a language AI platform for public services. BHASHINI currently supports more than 36 languages in text and over 22 in voice, with hundreds of language models deployed across government websites and applications. The focus here has been on translation, speech recognition and text-to-speech tools that can be integrated into citizen-facing systems rather than standalone consumer products.
Outside the government system, several Indian startups are also working on what they describe as “sovereign” or India-focused AI models.
Bengaluru-based Sarvam AI, for instance, has published results on optical character recognition and speech models for Indian languages.
Krutrim, backed by Bhavish Aggarwal, is building a large multilingual language model focused on Indian contexts, while other companies and research groups, including initiatives such as AI4Bharat at IIT Madras, are working on open and commercial language models for Indian languages.
From policy to deployment: Where these systems are being used
So far, most of these multilingual and multimodal AI systems are being positioned first for use inside government workflows and public service delivery rather than as mass consumer products.
Platforms such as BHASHINI are being integrated into government portals and service interfaces, where translation, speech-to-text and text-to-speech tools can be used to make forms, advisories and help desks accessible in multiple languages. The Digital India BHASHINI Division has said the platform is already linked to hundreds of websites and live use cases across departments.
Similarly, BharatGen’s early demonstrations have focused on applications such as voice-based advisory systems for farmers, document question-and-answer tools for government records, and image-to-text or image-to-description systems for small businesses. These are designed to work with Indian languages and with inputs such as scanned documents or images, which are common in government and small-business workflows.
Adi Vaani, meanwhile, is being positioned more narrowly around tribal languages, with the stated aim of enabling translation, documentation and access to government information in languages that are often missing from mainstream digital platforms. At this stage, it remains in beta, with limited language coverage, but it reflects a similar approach of starting with public-sector and community-facing use cases.
The constraints: Data, quality and scale
Despite the scale of funding and institutional backing, these projects face practical challenges. One of the main issues is data. High-quality, labelled data in many Indian and tribal languages remains limited, especially for speech and for specialised domains such as legal or administrative documents. This affects both accuracy and reliability, particularly outside a small set of better-resourced languages.
Another constraint is variation. Indian languages differ widely in script, grammar, pronunciation and regional usage, which makes building a single system that works consistently across regions difficult. Even within the same language, dialect and accent differences can affect performance in speech systems.
There is also the question of scale and cost. Multimodal models that handle text, speech and documents require significantly more computing resources than text-only systems. This makes deployment across large government systems more complex, especially when these tools are expected to work in real time and at population scale.
More From This Section
Don't miss the most important news and views of the day. Get them on our Telegram channel
First Published: Feb 10 2026 | 4:14 PM IST