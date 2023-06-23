



Posted by Zaln Borsos, Research Software Engineer, Marco Tagliasacchi, Senior Staff Research Scientist, Google Research

Recent advances in generative AI have opened up the possibility of creating new content in several different domains such as text, vision, and audio. These models often rely on the fact that raw data is first converted into a compressed form as a series of tokens. For audio, neural audio codecs (such as SoundStream and EnCodec) can efficiently compress the waveform into a compact representation, which can be inverted to reconstruct an approximation of the original audio signal. Such representations consist of a sequence of discrete audio tokens that capture the local properties of a sound (eg phonemes) and its temporal structure (eg prosody). By representing audio as a series of discrete tokens, audio generation can be performed using a Transformer-based sequence-to-sequence model. This allows speech continuation (e.g. AudioLM), text-to-speech (e.g. SPEAR-TTS), and general audio and music generation (e.g. AudioGen and MusicLM). Many generative audio models, including AudioLM, rely on autoregressive decoding to generate tokens one by one. Although this method achieves high acoustic quality, inference (that is, computation of the output) can be slow, especially when decoding long sequences.

To address this problem, SoundStorm: Efficient Parallel Audio Generation proposes a new method for efficient, high-quality audio generation. SoundStorm relies on two new factors to address the problem of generating long audio token sequences. 1) an architecture adapted to the specific nature of audio tokens produced by the SoundStream neural codec, 2) a decoding scheme inspired by the recently proposed MaskGIT, an image generation method tailored to manipulate audio tokens. Compared to AudioLM’s autoregressive decoding approach, SoundStorm can generate tokens in parallel, reducing inference time for long sequences by a factor of 100, resulting in more consistent speech and acoustic conditions and the same quality of audio. to generate Furthermore, by combining SoundStorm with the text-to-semantic modeling stage of SPEAR-TTS, it is possible to synthesize high-quality, natural-sounding dialogues, including what was said (via transcripts) and the speaker’s voice (via short voice prompts). indicates that you can control the Speakers take turns (via transcript annotations), as shown in the example below.

Input: Text (in bold the transcript used to drive the audio generation) Something really interesting happened this morning. | Oh what? | Well, I woke up as usual. | Hmmm | I went downstairs to eat breakfast. | Yes | I started eating. Then 10 minutes later I realized it was midnight. | No, that’s very interesting! I didn’t sleep well last night. | Oh my god. what happened? | I don’t know. I couldn’t sleep and tossed and turned all night long. | It’s a pity. Maybe try going to bed early tonight or reading a book. | Yes, thank you for your suggestion. i hope you are right. | No problem.Hope you have a good night’s sleep Input: Voice Prompt Output: Voice Prompt + Generated Voice SoundStorm Design

Our previous work on AudioLM showed that audio generation can be decomposed into two steps: 1) Semantic modeling. Earlier semantic tokens or conditioning signals (e.g. transcripts like SPEAR-TTS, or 2) acoustic modeling. Generates acoustic tokens from semantic tokens. SoundStorm specifically addresses this second acoustic modeling step, replacing slow autoregressive decoding with fast parallel decoding.

SoundStorm relies on a two-way attention-based Conformer. Conformer is a model architecture that combines Transformer and convolution to capture both the local and global structure of a set of tokens. Specifically, the model is trained to predict the audio tokens produced by SoundStream given a set of semantic tokens produced by AudioLM as input. When doing this, we take into account the fact that SoundStream represents audio using a method known as Residual Vector Quantization (RVQ) with up to Q tokens at each time step t, as shown on the right side below. important to consider. A key intuition is that as the number of tokens generated at each step goes from 1 to Q, the quality of the reconstructed audio gradually improves.

During inference, given a semantic token as an input conditioning signal, SoundStorm starts with all audio tokens masked, starting with coarse tokens at RVQ level q = 1 and proceeding level by level, over multiple iterations. Fill in the masked tokens. Use finer tokens until you reach level q = Q.

SoundStorm has two key aspects that enable fast generation. 1) Tokens are predicted in parallel during one iteration within the RVQ level. 2) The model architecture is designed so that complexity is lightly sensitive to numbers. Q. To support this inference scheme, we use a carefully designed masking scheme during training to mimic the iterative process used during inference.

SoundStorm model architecture. T indicates the number of time steps and Q indicates the number of RVQ levels used by SoundStream. Semantic tokens used as conditioning are aligned in time with SoundStream frames. Measuring SoundStorm performance

We demonstrate that SoundStorm matches AudioLM’s sound generator quality and replaces both AudioLM’s stage 2 (coarse acoustic model) and stage 3 (fine acoustic model). Moreover, SoundStorm generates audio 100 times faster than AudioLM’s hierarchical autoregressive sound generator (upper half of the figure below), with comparable quality and improved consistency in terms of speaker identity and acoustic conditions. (lower half of figure below).

Runtimes for various stages of SoundStream decoding, SoundStorm, and AudioLM on TPU-v4. Acoustic consistency between prompts and generated speech. Shaded areas represent interquartile ranges.Safety and risk reduction

We recognize that the audio samples generated by the model can be subject to unfair biases present in the training data, for example regarding accents and voice features expressed. The samples we generated demonstrate that speaker characteristics can be reliably and responsibly controlled through prompts, with the goal of avoiding undue bias. Thorough analysis of training data and its limitations is a future area of ​​work in line with our Responsible AI Principles.

Additionally, the ability to mimic voice can be used for a number of malicious applications, such as bypassing biometrics and using models for impersonation purposes. Therefore, it is important to take safeguards against potential abuse. To this end, we confirmed that sounds generated by SoundStorm are still detectable by a dedicated classifier using the same classifier as described in his original AudioLM paper. Did. Therefore, as a component of a larger system, we believe SoundStorm is unlikely to pose additional risks to those described in his previous papers on AudioLM and SPEAR-TTS. At the same time, relaxing AudioLM’s memory and computational requirements will make research in the field of audio generation more accessible to a wider community. In the future, we plan to explore other approaches for detecting synthetic speech, with the help of audio watermarking, to ensure that any potential product use of this technology strictly follows our Responsible AI Principles. .

Conclusion

Introduced SoundStorm, a model for efficiently synthesizing high-quality audio from discrete conditioning tokens. Compared to AudioLM’s sound generator, SoundStorm is two orders of magnitude faster and achieves high temporal consistency when generating long audio samples. Combining a text-to-semantic token model similar to SPEAR-TTS with SoundStorm extends text-to-speech synthesis to longer contexts, generates natural dialogue with multiple speaker turns, and You can control both generated content. . SoundStorm is not limited to sound generation. MusicLM, for example, uses SoundStorm to efficiently synthesize long outputs (as seen in I/O).

Acknowledgments

The work illustrated here was written by Zaln Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour and Marco Tagliasacchi. We appreciate all the discussions and feedback we have received from his Google colleagues on this effort.

Sources 1/ https://Google.com/ 2/ https://ai.googleblog.com/2023/06/soundstorm-efficient-parallel-audio.html The mention sources can contact us to remove/changing this article

What Are The Main Benefits Of Comparing Car Insurance Quotes Online

LOS ANGELES, CA / ACCESSWIRE / June 24, 2020, / Compare-autoinsurance.Org has launched a new blog post that presents the main benefits of comparing multiple car insurance quotes. For more info and free online quotes, please visit https://compare-autoinsurance.Org/the-advantages-of-comparing-prices-with-car-insurance-quotes-online/ The modern society has numerous technological advantages. One important advantage is the speed at which information is sent and received. With the help of the internet, the shopping habits of many persons have drastically changed. The car insurance industry hasn't remained untouched by these changes. On the internet, drivers can compare insurance prices and find out which sellers have the best offers. View photos The advantages of comparing online car insurance quotes are the following: Online quotes can be obtained from anywhere and at any time. Unlike physical insurance agencies, websites don't have a specific schedule and they are available at any time. Drivers that have busy working schedules, can compare quotes from anywhere and at any time, even at midnight. Multiple choices. Almost all insurance providers, no matter if they are well-known brands or just local insurers, have an online presence. Online quotes will allow policyholders the chance to discover multiple insurance companies and check their prices. Drivers are no longer required to get quotes from just a few known insurance companies. Also, local and regional insurers can provide lower insurance rates for the same services. Accurate insurance estimates. Online quotes can only be accurate if the customers provide accurate and real info about their car models and driving history. Lying about past driving incidents can make the price estimates to be lower, but when dealing with an insurance company lying to them is useless. Usually, insurance companies will do research about a potential customer before granting him coverage. Online quotes can be sorted easily. Although drivers are recommended to not choose a policy just based on its price, drivers can easily sort quotes by insurance price. Using brokerage websites will allow drivers to get quotes from multiple insurers, thus making the comparison faster and easier. For additional info, money-saving tips, and free car insurance quotes, visit https://compare-autoinsurance.Org/ Compare-autoinsurance.Org is an online provider of life, home, health, and auto insurance quotes. This website is unique because it does not simply stick to one kind of insurance provider, but brings the clients the best deals from many different online insurance carriers. In this way, clients have access to offers from multiple carriers all in one place: this website. On this site, customers have access to quotes for insurance plans from various agencies, such as local or nationwide agencies, brand names insurance companies, etc. "Online quotes can easily help drivers obtain better car insurance deals. All they have to do is to complete an online form with accurate and real info, then compare prices", said Russell Rabichev, Marketing Director of Internet Marketing Company. CONTACT: Company Name: Internet Marketing CompanyPerson for contact Name: Gurgu CPhone Number: (818) 359-3898Email: [email protected]: https://compare-autoinsurance.Org/ SOURCE: Compare-autoinsurance.Org View source version on accesswire.Com:https://www.Accesswire.Com/595055/What-Are-The-Main-Benefits-Of-Comparing-Car-Insurance-Quotes-Online View photos