



Google recently introduced a new model called SoundStorm: Efficient Parallel Audio Generation in a recent paper. It provides a new approach for efficient, high-quality audio generation.

Read the full paper here.

SoundStorm tackles the problem of generating long audio token sequences through two innovative components.

An architecture tailored to the unique nature of audio tokens produced by SoundStream neural codecs. Inspired by MaskGIT, a recently introduced image generation method, the decoding scheme is specifically designed to work with audio tokens.

Compared to AudioLM’s autoregressive decoding approach, SoundStorm achieves parallel generation of tokens, reducing inference time for long sequences by a factor of 100. In addition, SoundStorm increases the consistency of speech and acoustic conditions while preserving voice quality.

Furthermore, this paper demonstrates that the combined text-to-semantic modeling stages of SoundStorm and SPEAR-TTS make it possible to synthesize high-quality, natural-sounding dialogues. This allows you to control what is said through transcripts, speaker voice using short voice prompts, and speaker alternation using transcript annotations. The examples provided serve as proof of SoundStorm’s capabilities and her integration with SPEAR-TTS to produce compelling dialogue.

what’s under the hood

In previous work on AudioLM, researchers demonstrated a two-step process for generating audio. The first step involved semantic modeling, generating semantic tokens based on previous semantic tokens or conditioning signals. His second step, known as acoustic modeling, focused on generating acoustic tokens from semantic tokens.

In SoundStorm, however, researchers specifically addressed the acoustic modeling step, aiming to replace slow autoregressive decoding with faster parallel decoding methods.

SoundStorm used a two-way attention-based Conformer. This is a model architecture that combines convolutions and Transformers. This architecture allows a set of tokens to capture both local and global structure. This model was given a set of semantic tokens produced by AudioLM as input and trained to predict the audio tokens produced by SoundStream. The SoundStream model employs a method called Residual Vector Quantization (RVQ), in which up to Q tokens are used to represent the audio at each time step. The reconstructed audio quality improved gradually as the number of tokens generated per step increased from 1 to Q.

During inference, SoundStorm started with all audio tokens masked out and filled in the masked tokens over multiple iterations. We started with coarse tokens at RVQ level q = 1 and continued with finer tokens level by level until we reached level q = Q. This approach enabled fast audio generation.

Two key aspects of SoundStorm contribute to its fast generation capabilities. First, tokens were predicted in parallel within one iteration at each RVQ level. Second, the model architecture was designed such that the computational complexity was only slightly affected by the number of levels Q. To support this inference scheme, a carefully designed masking scheme was used during training to simulate the iterative process used during inference.

Compared to AudioLM, SoundStorm is significantly faster, two orders of magnitude faster. It also provides excellent consistency over time when generating long audio samples. Combining SoundStorm with a text-to-semantic token model similar to SPEAR-TTS enables extending text-to-speech synthesis to handle longer contexts.

In addition, it will be possible to generate natural dialogues with multiple speakers taking turns, giving you control over both the speaker’s voice and the generated content. Note that SoundStorm is not limited to speech synthesis only. For example, MusicLM efficiently uses SoundStorm to synthesize longer musical outputs.

why is this important?

The problem solved is the slow inference time associated with generating long sequences of audio tokens using autoregressive decoding methods. Autoregressive decoding guarantees high acoustic quality, but since tokens are generated one by one, inference is computationally expensive, especially for long sequences. SoundStorm addresses this challenge by introducing an architecture adapted to audio tokens and a decoding scheme inspired by MaskGIT, proposing new ways to enable parallel generation of tokens. This allows SoundStorm to significantly reduce inference time and make audio generation more efficient without compromising the quality or consistency of the generated audio.

Many generative audio models, including AudioLM, use autoregressive decoding to generate tokens one by one. This method guarantees high acoustic quality, but can be computationally slow, especially when dealing with long sequences.

