It supports over 20 controllable speaking styles, including natural patterns like hesitation, excitement, and warmth. Maya1’s dataset offers users the option to insert over 20 emotion tags such as laugh, sigh, whisper, anger, and giggle. The model changes speech patterns accordingly. All this can be composed, allowing a switch of tone mid-sentence and a natural, emotional speaking style. Equally important, Maya1 does this without discernible lags, scanning text and speaking with less than 100ms (millisecond) latency. This makes it indistinguishable from human speech since the TTS model reads text at the same speeds as educated humans.
This versatility makes it ideal for a wide range of use cases. When it comes to podcasts, audiobooks, and video content, Maya1 can narrate long-form content with an emotional range, using different voices for different personalities. It can work similarly for video-game characters with emotional delivery. It can also be used as an AI voice assistant for accessibility tasks to aid users who need visual assistance, and for customer services since it is low-latency and offers responsive interaction. Maya1 would be described technically as a three-billion-parameter decoder-only transformer, finetuned from a Llama base. It’s available under the Apache 2.0 licence, and it’s free to download, tweak, and deploy for commercial use. Given the moderate hardware requirements, it can run locally on a device with a single graphic-processing unit so there is no Cloud dependency, allowing it to be easily deployed in rural and low-bandwidth settings.
Maya Research, a startup, developed its TTS system at minimal cost, using only free Cloud credits from Amazon Web Services and Google Cloud. There was no backing of venture capital, no data centre. The team spent eight months collecting speech in rural India, paying people to record real conversation. Right now, the model works only in English, but Maya Research is training what it claims will be a 10-15 times larger Indic speech dataset than what exists online. This upgraded version is targeted for release by June next year. Maya Research is, therefore, betting that in future, AI will be spoken, not typed, and by doing this in India, the voice layer will be stored locally and the models built on domestic accents and everyday sounds. This could be a big boost for India, since, despite the plethora of rich local languages, research into “speech AI” is sparse and public datasets like Bhashini are limited in scope. Maya1 demonstrates that innovative voice AI algorithms can be developed cheaply with high production quality, and emotion-rich, real-time deployment. It could inspire many new projects.