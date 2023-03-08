



An overview of our approach. Training he divided into three stages. (i) In the first stage, we train the conformer his backbone on a large unlabeled speech dataset and optimize it for the BEST-RQ objective. (ii) while continuing to train this phonetic representation learning model, multiple objectives, the BEST-RQ objective for unlabeled speech, modality matching, loss of duration modeling for supervised ASR and paired speech and transcript data, and RNN; Optimize the text reconstruction objective using . -T Decoder for unlabeled text. (iii) In the third stage, we fine-tune this pre-trained encoder with an ASR or AST task. Credit: arXiv (2023). DOI: 10.48550/arxiv.2303.01037

In November, Google announced it was embarking on an initiative leading to the development of machine learning models that can recognize and translate the 1,000 most spoken languages ​​in the world. Over the past few months, the company has been working toward that goal, publishing blog entries by members of the team working on the project. Our team also published a paper describing the implementation of the Universal Speech Model (USM) on the arXiv preprint server.

The updates Google provides are part of a more overarching goal, to create a language translator that can translate any language in the world on demand using automatic speech recognition (ASR). To that end, we have chosen to temporarily limit (at 100) the number of languages ​​we intend to support due to the small number of people who speak uncommon languages. Such rare languages ​​do not have datasets for training.

As part of the announcement, Google outlined the first steps in decomposing USM into a family of speech models trained on billions of hours of recorded speech across over 300 languages. His USM from the company is now already being used to translate closed his caption language on YouTube. It also outlines common models for each family.

Google explains that the model is created using a training “pipeline” that includes three different datasets: unpaired audio, unpaired text, and paired ASR data. doing. They also note that they use conformer models to handle the expected 2B parameters required for their project, and have used unsupervised pretraining, multipurpose supervised pretraining, and supervised ASR. We plan to do this using three main steps of training. In the end, two types of models are produced: a pre-trained model and an ASR model.

Google further claims that in its current state, its USM performs as well or better than the Whisper model, a general-purpose speech recognition model created by the GitHub community. In addition to using his USM for YouTube, Google is expected to combine the model with other AI applications, including augmented reality devices.

Further information: Yu Zhang et al., Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages, arXiv (2023). DOI: 10.48550/arxiv.2303.01037

Magazine information: arXiv

