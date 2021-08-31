



Posted by: Ye Jia, Software Engineer, Julie Cattiau, Product Manager, Google Research

On June 2, 2021, Major League Baseball in the United States celebrated Lou Gehrig’s Day, 1925, when Lou Gehrig became first baseman of the Yankees, and 1941, who died of amyotrophic lateral sclerosis (ALS). I commemorated the day of the year. , Also known as Rougerig’s disease) 37 years old. ALS is a progressive neurodegenerative disease that affects the motor neurons that connect the brain to the muscles throughout the body and controls muscle control and voluntary movements. When spontaneous muscle control is affected, people can lose the ability to speak, eat, move, and breathe.

In honor of Lou Gehrig, former NFL player and ALS advocate Steve Gleason lost his ability to speak for ALS and recreated his voice produced by machine learning at the June 2 event. Using Gehrig’s famous “luckiest man” speech (ML) model. Gleason’s voice recreation was developed in collaboration with Google’s project Euphonia, which aims to enable people whose speaking ability has deteriorated due to ALS to communicate better with their own voice.

Steve Gleason, who lost his voice to ALS, worked with Google’s Project Euphonia to make a speech in his own voice in honor of Lou Gehrig. Part of Gleason’s speech was broadcast at stadiums nationwide four times on June 2, 2021.

Today, I’m going to talk about PnGNAT, the model adopted by Project Euphonia to recreate Steve Gleason’s voice. PnG NAT is a new text-to-speech synthesis (TTS) model that integrates two state-of-the-art technologies, PnGBERT and Non-Attentive Tacotron (NAT), into one model. It demonstrates significantly better quality and fluency than previous technologies and represents a promising approach that can be extended to a wider range of users.

The VoiceNon-Attentive Tacotron (NAT) re-creation is the successor to Tacotron 2, the inter-sequence neural TTS model proposed in 2017. Tacotron2 used an attention module to connect the input text sequence and the output audio spectrogram frame sequence. The model knows what part of the text to pay attention to when generating each time step in the synthesized speech spectrogram. The Tacotron 2 was the first TTS model to synthesize speech that sounds as natural as humans speak. However, extensive experimentation has shown that the inherent flexibility of attention mechanisms makes models less likely to suffer from robustness issues such as murmuring, repetition, and skipping of parts of the text.

NAT improves Tacotron 2 by replacing the attention module with a period-based upsampler. This upsampler predicts the duration of each input phoneme and upsamples the encoded phoneme representation so that the output length corresponds to the length of the predicted speech spectrogram. Such changes solve the problem of robustness and improve the naturalness of the synthesized speech. This approach also allows precise control of the utterance time of each phoneme in the input text, while maintaining a very natural synthetic quality. Since ALS human recordings often show fluent speech, this ability to perform phoneme-by-phoneme control is key to achieving reproduced speech fluency.

Careless Tacotron (NAT) model.

NAT addresses the issue of robustness and allows precise duration control with neural TTS, which in turn further improves the natural language understanding of TTS inputs. For this, we apply PnG BERT, which uses a similar approach to BERT, but is specially designed for TTS. It is pre-trained by self-monitoring in both phoneme and grapheme representations of the same content from a large text corpus and used as an encoder for TTS models. This greatly improves the prosody and pronunciation of the synthesized speech, especially in difficult cases.

For example, consider the following audio synthesized from a regular NAT model that takes only phonemes as input.

By comparison, the audio synthesized from PnG NAT with the same input text contains an additional pause that makes the meaning clearer.

The input text for both models is: “Press one to cancel the payment, or two to continue.” Note that the two versions have different pause lengths before ending with a “2”. The version of the word “2” output by a regular NAT model can be confused with “too”. Since the pronunciations of “too” and “two” are the same (and therefore the same phoneme expression), the usual NAT model does not understand which is appropriate and assumes that it is the word that follows the comma. “. In contrast, the PnG NAT model takes grapheme in addition to phonemes as input for better pauses, making it easier to tell the difference.

The PnG NAT model integrates the pre-trained PnGBERT model into the NAT model as an encoder. The hidden representation output from the encoder is used by NAT to predict the duration of each phoneme and then upsampled to match the length of the audio spectrogram, as outlined above. .. In the final step, a careless decoder transforms the upsampled hidden representation into an audio-audio spectrogram. This is eventually converted into a voice waveform by the neural vocoder.

PNGBERT and pre-training objectives. The yellow box represents the phoneme and the pink box represents the grapheme. PnG NAT: PnG BERT replaces the original encoder in the NAT model. Random masking for pre-training of masked language models (MLM) has been removed.

To reproduce Steve Gleason’s voice, we first trained the PnG NAT model with recordings from 31 professional speakers, and then fine-tuned with a 30-minute Gleason recording. These latter recordings show unclear signs because they were made after he was diagnosed with ALS. The fine-tuned model was able to synthesize audio that closely resembled these recordings. However, the symptoms of ALS were already present in Gleason’s speech, showing similar fluency.

To alleviate this, we took advantage of NAT phoneme duration control and a model trained with professional speakers. First, we predicted the duration of each phoneme for both professional speakers and Gleason, and then guided the NAT output using the geometric mean of the two durations of each phoneme. As a result, the model can speak in Gleason’s voice, but more fluently than the original recording.

This is the full version of Lou Gehrig’s speech synthesized with Gleason’s voice.

PnG NAT not only reproduces the voices of people with ALS, but also enhances the voices of various customers through Google Cloud Custom Voice.

Many millions of people around the world with neurological conditions that can affect speech, such as Project Euphonia ALS, cerebral palsy, and Down’s syndrome, are difficult to understand and face-to-face communication. There is a possibility. Voice-activated technology can be frustrating because it doesn’t always work reliably. Project Euphonia is a Google Research initiative focused on deepening the understanding of people with language disabilities. Teams see how to improve speech recognition for individuals with voice disabilities (see recent blog posts and segments from the TODAY show), and customized text-to-speech technology (Age of AI documentary featuring former NFL player Tim Shaw). ) Is being studied.

Acknowledgments Many people from the Google Research, Google Cloud and Consumer Apps, and Google Accessibility teams contributed to this project and event, including Michael Brenner, Bob MacDonald, Heiga Zen, Yu Zhang, Jonathan Shen, Isaac Elias, Yonghui Wu, and Anne Keck. Did. Danielle Notaro, Kevin Hogan, Zack Kaplan, KR Liu, Kyndra Price, Zoe Ortiz

