ElevenLabs, a platform for voice AI (artificial intelligence) backed by Sequoia Capital and Andreessen Horowitz (a16z), is among the earliest firms to have crossed the valley of synthetic speech and make AI-generated voice sound natural and human. The company, which raised $500 million at a valuation of $11 billion, counts India as one of its strategic markets. Mati Staniszewski, cofounder and chief executive officer of the company, visiting India for the AI Impact Summit, tells Aashish Aryan and Shivani Shinde India will be perhaps the first country in the region where we see the wider scale of voice AI, making voice default for all the digital experiences. Edited excerpts:
How do you solve the voice AI challenge for a country like India, where the sheer volume and diversity of data required to get diction, tone, and voice right are so complex?
Across markets, we’ve seen that two factors determine whether a voice agent works: Latency — how fast it responds — and quality of voice. Voice quality varies by region, language, and use case. From a modelling standpoint, our first step was to build a foundational architecture that is abstracted enough to scale across languages. We’ve rolled out support for 11 Indian languages and are working toward covering 22 over the coming months. That expansion has been a key driver of the enterprise adoption we’ve seen in the last six months, because companies can now deploy across regions rather than being limited to one or two languages.
The second layer is dialect and voice variation. It’s not enough to support a language — you need different accents, age groups, genders, tonal styles, and use-case-specific voices. To address that, we created a “voice marketplace” where people can create their voice.
We authenticate it and as that voice starts getting used, the creator earns compensation.
Today, we have around 1,500 professional-grade voices across styles available in India. We have paid back $1 million to this community.
What has been the operational impact of your voice AI models? Are they working alongside humans, or in some cases replacing them?
The largest adoption over the past six months has been at the L1 layer of customer care — first-line, high-volume interactions.
Companies such as Meesho, Cars24, TVS Motor, and IDFC Bank are automating inbound and outbound voice workflows where speed and consistency matter most.
What’s emerging now — and this is the more strategic shift — is the move from reactive to proactive voice engagement. Instead of voice being limited to “call us when there’s a problem”, it’s becoming embedded across the customer journey. Imagine logging into an ecommerce platform and interacting with a voice concierge that helps you navigate, recommends products based on your history, and guides you through checkout.
That’s a very different paradigm. That shifts from reactive to proactive and entire customer journey will be one of the biggest ones.
The pattern we see is, AI handles structured, repeatable L1 workflows and humans step in for complex, nuanced, or exception-based cases.
Where do you see voice AI over the next three years?
As a company started four years ago, the vision was so much around transforming the interactions of technology, and we think now this couldn’t be more real. I think India is going to be one of the first regions where we see the wider scale.
What this means is all the digital experiences will have voice as a default — whether you call into the customer care, whether you open a website. You will have that voice agent, voice concierge available to help you out.
Technology has historically required humans to adapt to it. Voice flips that equation. It allows technology to adapt to human communication patterns. As you think about India with 22 languages, all people will finally be able to access intelligence when they can access it through voice via that new interface, and that’s what we are kind of excited about.
With most leading LLM companies becoming multimodal — offering image, video, and voice — how does a company like ElevenLabs, which is audio-first, position itself against trillion-dollar players building everything under one roof?
We made a deliberate choice early on: To specialise deeply in voice. From day one, we built foundational voice models — text-to-speech, speech-to-text, and conversational orchestration — and then layered a full platform on top of that.
We’ve moved from model to platform to application. That vertical integration is important. Voice isn’t a feature for us; it’s the core competency.
On benchmarks, our text-to-speech and conversational models consistently outperform broader multimodal systems because we optimise specifically for audio realism, latency, and linguistic depth.