Specialised SLMs better suited for India needs than LLMs, says Gnani AI CTO
In an interview at the India AI Impact Summit, Ananth Nagaraj discusses Gnani.ai's shift from speech systems to voice-to-voice models, sector-specific AI and its plans under the IndiaAI Mission
)
Ananth Nagaraj, co-founder and CTO of Gnani.ai
Listen to This Article
Gnani (pronounced: Gyani) releases a 5bn-parameter voice-to-voice artificial intelligence model as a release preview in a run up to a 70bn-parameter multimodal AI model that it plans to release soon. Its focus on voice-first AI and sovereign models has put it among the startups building India’s AI stack under the IndiaAI Mission. At the India AI Impact Summit 2026, Ananth Nagaraj, co-founder and CTO of Gnani.ai, spoke to Business Standard’s Khalid Anzar and Harsh Shivam about the company’s journey from early speech recognition systems to its new voice-to-voice models, how government support is shaping its roadmap, and why it believes smaller, domain-focused models will matter as much as large frontier systems. Edited excerpts:
What exactly is Gnani.ai?
Gnani means knowledge. We started the company in 2017, before AI was “hot”. We thought, why don’t we name the company Gnani.ai, which stands for knowledge. We have been building AI before it became mainstream.
We were among the first in India to build speech-to-text systems for Indian languages. We launched it for Kannada, Telugu, Tamil, Hindi, and others. From speech-to-text, we evolved into speech intelligence, then into automation for contact centres. Today, we are at a stage where we have launched a voice-to-voice model that supports 14 Indian languages.
The journey started in 2017, when frontier models were not available. How did that shape your work?
Also Read
At that time, research was around models like Bidirectional Encoder Representations from Transformers (BERT) and earlier language models. We were operating in the era of speech recognition, text-to-speech and using models like BERT to understand what was spoken and respond.
This was the time when Alexa, Bixby, and Google Assistant were coming up. Then frontier models arrived with text-first systems, and now those are moving towards multimodal systems. We were already working on voice understanding and processing, and now we have brought frontier model approaches into our voice systems. We have also built a large language model with around 14 billion parameters of which a 5BN-parameter model is now available in preview release.
Did the rise of frontier models change the company’s trajectory?
Yes, I would say over the last two or three years, the journey has accelerated. Earlier, we had to convince businesses that certain things could be done with technology. Today, they already know it can be done. The question now is how fast it can be deployed.
What was the business earlier, and what is it now?
Earlier, we were also doing voice-bots, but for more constrained use cases. For example, collections for loans using rule-based or NLP-based systems. Today, those same systems can be more efficient and handle tasks closer to what a human agent would do.
Earlier, we were handling more L1-type queries. Now that has moved to L2 and L3. Over time, it can go beyond that.
Was the company always sector-focused?
Not sector-focused, but industry-focused. We looked at call centres as the biggest use case for voice. We worked across inbound customer support, outbound collections, telemarketing and sales.
Some of our early customers included insurance companies, collections firms and automotive companies like Tata Motors. For example, when someone registers interest on the Tata Motors website, a voice bot calls, qualifies the lead, answers questions, checks for trade-in, financing, and can even book a test drive. That brought more structure and accountability to the process.
Around this phase, Samsung also came in as a strategic investor, when it was working on taking Bixby to Indian languages using our speech layer, though that product itself did not scale in the way initially expected.
Today, this has expanded further. People are using these systems for recruitment interviews, full inbound support, and even booking hotel rooms through voice, like in our work with OYO.
Tata and Samsung still part of the journey?
Tata Motors is still our customer. Our first customer was TVS Credit, and they are still with us. Bajaj Life Insurance was our first insurance customer and is still with us as well. Our customer churn is under one or two per cent.
Samsung was not a customer, but a strategic investor. They invested when they were working on Bixby for Indian languages. Samsung is still an investor as of today.
When did the government start showing interest in Gnani.ai?
The IndiaAI Mission started around 2024. We were selected as one of the startups under it, in the first cohort. Before that, there was an India AI white paper process, where we were involved. We applied under the voice AI category and went through the selection process.
What kind of support are you getting from the government?
We get access to GPUs — up to around 1,500 GPUs through the IndiaAI Mission. That is critical for training these models.
Are the models being contributed to AI Kosh?
It is a mix. Some models are commercial, and some are open source. The final voice-to-voice model will be open source.
You have launched a voice-to-voice model. How is this different from multimodal?
In India, almost no one is doing multimodal at scale yet. Globally, maybe four or five companies are doing it. Our current model takes audio and text. The reason this matters is that many people in India can speak in their native language but cannot read or write comfortably.
India still has a large base of feature phone users. So how do you give them access to AI? Voice-first systems can play a role there.
Is multimodal in the pipeline?
Yes. Today, we are working with voice and text. Video is the next mode. In our demos, you can see an avatar system where, for example, someone can show an Aadhaar card, the system reads it, does face recognition, fills a form through voice and completes an enrolment. The idea is to fuse text, voice and video into one system over time.
How do you define the current model then?
What we have launched is a voice-to-voice model, but we also have multimodal capabilities in development. The final model will combine text, voice and video. Today, video is handled in a cascaded way, but later all three will be fused.
What is the roadmap in terms of model size?
We have a five-billion-parameter model as a preview. The next one is 14 billion parameters, and then we plan to go to 32 billion, and eventually around 70 billion over the next year.
How do you plan to deploy such large models, especially in low-connectivity areas?
We are building a range of models — from small ones that can run on a phone or laptop to larger ones that need servers. You could have a sub-one-billion-parameter model of a few hundred megabytes for edge use, and scale up to larger models when infrastructure allows.
Are these all sovereign models?
Yes. All are built in India, by Indian engineers, using proprietary data. We have around 14.5 million hours of audio data.
When can we expect the larger models?
The 14-billion-parameter model should be ready in about six months. The 70-billion-parameter one in about a year.
Who are these models being built for?
We are opening access to the enterprises we already work with — around 100 to 200 organisations — and rolling it out based on their use cases. Government deployment depends on procurement processes, so I won’t comment on that.
You mentioned that you have several distilled models depending on where the model is used. Are there also distilled models based on use cases or sectors?
Yes. We already build smaller language models for specific domains. For example, we have SLMs for finance and telecom. There are sub–three-billion-parameter models that can be further distilled and quantised to run at the edge.
If you take a use case like deploying a kiosk in a village without reliable internet, you could have an agriculture-focused model that handles citizen or farmer services, with voice and text support, running on a laptop or even a phone. That can be deployed almost anywhere in the country.
So the idea is to have models that range from something that can sit on a phone to much larger models that need server infrastructure, depending on the use case and constraints.
Global models already do many of these things. How do you see Gnani.ai’s position?
Global models show these things can be done well in English. But when you come to Indian languages, performance drops. We are building models for Indian languages and Indian use cases. There is also an opportunity to serve other countries with similar needs.
Not every use case needs a trillion-parameter model. Many industry and sovereign applications can be solved with 20 billion, 50 billion, or even smaller models. Larger models increase infrastructure and inference costs. We believe the world will move towards more specialised, smaller, high-performance models rather than relying only on one massive foundation model.
Are there any consumer-facing products today?
We are not building a ChatGPT-style consumer app. But we do have tools for text-to-speech, voice cloning and translation. For example, you can upload audio and get it in different Indian languages. That is available through our Inya platform.
More From This Section
Don't miss the most important news and views of the day. Get them on our Telegram channel
First Published: Feb 18 2026 | 11:10 AM IST