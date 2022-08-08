



Earlier this summer, at the re:MARS conference (an Amazon-sponsored event focused on machine learning, automation, robotics, and space), Rohit Prasad, principal scientist and vice president of Alexa AI, discussed the paranormal I tried to wow the audience with a parlor trick. : Talk to the dead. “AI can’t take away the pain of that loss, but it can definitely make their memories last longer,” he said, showing a short video of a cute boy asking Alexa.

A woman’s voice reading a few sentences from a book sounds like a grandmother enough. Ars Technica called the demo “morbid”. But her Prasad revelation of how the “tricks” were done was truly breathtaking. Scientists at Amazon were able to summon Grandma’s voice based on just one minute of her voice sample. And they can easily do the same with almost any voice, a prospect that can be evocative, terrifying, or a combination of both.

Fear of “deepfake” voices that can fool humans and voice recognition technology is not unfounded. I am eager to And consumers seem to want technology that sounds more human (although Google’s voice assistant, which mimicked the “hmm”, “hmm” and other tics of human speech, was criticized for being too realistic). was).

This is driving a wave of innovation and investment in AI-powered text-to-speech (TTS) technology. A Google Scholar search reveals over 20,000 of his research papers on text-to-speech synthesis published since 2021. The global text-to-speech market is projected to reach $7 billion by 2028, up from about $2.3 billion in 2020. Emergent research.

Today, TTS is most widely used in digital assistants and chatbots. But new speech recognition applications in games, media, and personal communications are easy to imagine. Custom voices for virtual personas, read-aloud text messages, narration by absent (or deceased) actors, and more. The metaverse is also changing the way we interact with technology.

Frank Chang, founding partner of Flying Fish (Seattle), an AI-focused venture fund, said: “Everyone thinks voice recognition is hot, but ultimately, if you’re talking to something, don’t you want it to speak to you? Your voice or the person you want to hear?” To the extent that it can be personalized with voice, all the more so.” Providing accessibility for people with visual impairments, limited mobility, and other cognitive disabilities is another driving force in the development of voice technology, especially for e-learning. It’s a factor.

Whether you like the idea of ​​”Grandma Alexa” or not, this demo highlights how quickly AI has impacted text-to-speech, making a compelling fake human voice more powerful than we think. suggests that it may be much closer.

The original Alexa, released with Echo devices in November 2014, is believed to have been based on the voice of Boulder-based voice-over artist Nina Rolle (never confirmed by either Amazon or Rolle). , relied on techniques developed by Polish texts. In 2013 she launched her Ivona, a text-to-speech company that was acquired by Amazon. In 2017 VentureBeat wrote:

Early versions of Alexa used a version of “concatenated” text-to-speech. It works by compiling a large library of audio fragments recorded from a single speaker that can be recombined to produce complete words and sounds. Imagine a ransom note that cuts and pastes characters to create new sentences. This approach produces realistic-sounding, intelligible audio, but requires hours of recorded voice data and a lot of fine-tuning. It also relies on a recorded sound library, making it difficult to modify audio. Another technique, known as parametric TTS, does not use recorded speech, but starts with statistical models of individual speech that are assembled into sequences of words or sentences and processed by a speech synthesizer called a vocoder. increase. (Google’s “standard” Text-to-Speech voice uses a variation of this technique.) It gives you more control over the voice output, but results in a muffled, robotic sound. You don’t want to read out bedtime stories.

Amazon, Google, Microsoft, Baidu and other major text-to-speech companies have adopted some form of “neural TTS” in recent years to create new, more expressive and natural-sounding voices. The NTTS system models speech waveforms from scratch using a deep learning neural network trained on human speech to dynamically transform text input into fluid speech. Neural systems can learn not only pronunciation, but patterns of rhythm, accent, and intonation, which linguists call “prosody.” You can also learn new speaking styles and switch speaker ‘identities’ with relative ease.

Google Cloud’s Text-to-Speech API now develops over 100 neural voices in languages ​​ranging from Arabic to Vietnamese (and regional dialects), in addition to “standard voices” that use the old parametric TTS available to anyone (you can listen to it here). Using Microsoft’s Azure, developers have access to over 330 neural voices across over 110 languages ​​and dialects, with different speaking styles (newscast, customer service, shouting, whispering, angry, excited, cheerful, sad, frightened, etc.) ) (try it). ). Azure Neural Voice has also been adopted by companies like ATT, Duolingo, and Progressive. (In March, Microsoft completed the acquisition of Nuance, a leader in conversational AI and a partner building Apple’s Siri, whose Vocalizer service provides voices for over 120 neural chatbots in over 50 languages. Amazon’s Polly text-to-speech API supports 30 neural voices in 20 languages ​​and dialects (listen to an early demo here.

The technology underlying the Grandma’s Voice demo was developed by scientists at Amazon’s text-to-speech lab in Gdansk, Poland. In a research article, developers describe a novel approach to replicating new voices from very limited samples. Basically, they divide the task into her two parts. First, the system converts the text into “generic” speech using a model trained on 10 hours of her speech from another speaker. Then a “voice filter” trained on her 1-minute samples of the target speaker’s voice gives a new speaker ID and changes the general voice characteristics to sound like the target-her speaker. Few training samples are required to create a new voice.

Rather than building a new text-to-speech model for each new voice, this modular approach turns the process of creating new speaker IDs into a computationally trivial task of changing one voice to another. In objective and subjective measurements, the quality of synthesized speech produced by this method was comparable to that of a model trained on 30x more data. However, it is not possible to perfectly imitate the way a particular person speaks. In an email to Fast Company, Alexa researchers explain that voice filters only change the timbre, or fundamental resonance, of your speaking voice. Vocal prosody (rhythm and intonation) is derived from common vocal models. I mean, it sounds like Grandma’s reading aloud, but if she didn’t have a peculiar way of doing it, she’d stretch out certain words and put long pauses in between others.

Amazon has not disclosed when the new voice duplication feature will be available to developers or the general public. A spokesperson wrote in an email: We are working to improve on the basic science we demonstrated at re:MARS, looking for use cases that delight our customers with the necessary guardrails to avoid potential misuse. “

You can imagine offering something like Reading Sidekick (an Alexa feature that lets kids take turns reading to Alexa) with the ability to customize it with a loved one’s voice. And it’s easy to see how the “Grandma Voices” demo heralds an expanding cast of more adaptable celebrity voices for Virtual Her Assistant. Alexa’s current celebrity voices — Shaquille O’Neal, Melissa McCarthy and Samuel L. Jackson — required about 60 hours of studio recording to answer questions about the weather, tell jokes, tell stories, and specific questions, but defaults to the standard Alexa voice for requests outside the system’s comfort zone.

Google Assistant “celebrity voice cameos” by John Legend and Issa Rae – Introduced in 2018 and 2019 but not currently supported – Similarly pre-recorded audio and WaveNet technology We combined several extemporaneous responses synthesized in . The ability to develop more robust celebrity voices that can read out text input after a short recording session could be a game changer, and could even boost stagnating smart speaker sales. (U.S. smart speaker shipments fell nearly 30% last year compared to 2020, according to research firm Omdia, including a nearly 51% drop in Amazon Alexa smart speaker shipments. )

As big tech companies continue to invest in text-to-speech, one thing is certain. That means it’s getting harder and harder to tell if the voice you’re listening to is made by a human or by a human-made algorithm.

