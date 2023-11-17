Microsoft has recently announced its latest text-to-speech features that come with vision capabilities, enabling users to create talking avatar videos with the help of text inputs. The new feature will also help to build an interactive bot trained using human images.

The latest text-to-speech avatar system has features with vision capabilities allowing customers to develop synthetic videos of a 2D photorealistic avatar speaking. The neural text-to-speech model is trained by deep neural networks based on human video recording samples. The voice of the avatar will be provided by a text-to-text-to-speech voice model.

This text-to-speech avatar will help the users to create more engaging digital interactions and also to build conversational agents, chatbots, virtual assistants and more.

This is designed with the aim of protecting individual and society's rights, fostering transparent human-computer interaction, and counteracting the proliferation of harmful deepfakes and misleading content.

Why did Microsoft build a text-to-speech avatar?

According to Microsoft's text-to-speech avatar:

Traditional video content generally takes a lot of time and budget, which includes setting up a video shooting environment, filming videos, editing, etc. This Microsoft avatar will reduce your dependency on traditional ways of video creation and help you create videos efficiently. The avatar will also help users build training videos, customer testimonials, product introductions, etc., with the help of text input.

The release of Azure OpenAI Service and neural text-to-speech, the interactive conversation is much more natural than before. This avatar helps in creating more engaging digital interaction. The user can also use this to build conversational agents, virtual assistants, chatbots and more.

According to the official website, there are three workflows of content generation, i.e., TTS audio synthesiser, text analyser, and TTS avatar video synthesiser.

The company offers two separate text-to-speech avatar features at this time. One is a prebuilt text-to-speech avatar and the other is a custom text-to-speech avatar.

According to the company website, “Microsoft offers prebuilt text-to-speech avatars as out of box products on Azure for its subscribers. These avatars can speak different languages and voices based on the text input. Customers can select an avatar from a variety of options and use it to create video content or interactive applications with real-time avatar responses.”

Video content creation through text-to-speech avatar

Start with a talking script for your avatar or you can even use plain text format or Synthesis Markup Language (SSML). SSML helps you in tuning the voice of your avatar which includes pronunciation, and expression of terms like brand names, along with gestures like waving and pointing to an item.

Once you are ready with your talking script, you can use Azure TTS 3.1 API to synthesise your video. Besides the inputs of SSML, you can also specify your avatar character and its style and even the format of your desired video.

If you want, you can also add content images, videos with text, animations, illustrations, etc. to come up with the final video.

Combine all your assets including avatar video, content and option background music and compose your rich video experience.