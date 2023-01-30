



Google researchers have revealed an AI that converts text into music. This AI creates songs that last as long as 5 minutes.

The team published a paper summarizing their work and findings, introducing MusicLM to the world with many examples that look strikingly similar to text prompts.

The researchers claim that their model outperforms previous systems in both audio quality and adherence to text descriptions.

An example is a 30 second snippet of a song with an input caption like this:

The main soundtrack of the arcade game. It has a catchy electric guitar riff and is fast-paced and upbeat. The music is repetitive and memorable, but contains unexpected sounds such as cymbal crashes and drum rolls. A fusion of reggaeton and electronic dance music, featuring an otherworldly and spacey sound. Designing music that evokes the experience of getting lost in space, danceable yet mysterious and awe-inspiring. Backed by pads, sub-basslines and soft drums. The song is full of synth sounds that create a calm and adventurous atmosphere. We may be playing festivals between two songs for build-up.

Using AI to generate music is nothing new, but no tools have yet been published that can actually generate decent music based on simple text prompts. According to the team behind MusicLM, it is until now.

In their paper, the researchers describe various challenges facing AI music generation. First, there is the problem of missing data for audio-text pairs. Unlike text-to-image machine learning, huge datasets have largely contributed to recent progress, they say.

For example, OpenAI’s DALL-E tool and Stable Diffusion sparked both increased public interest in this area and ready-to-use cases.

Another challenge in AI music generation is that music is structured along the time dimension. In other words, a music track exists over a period of time. Therefore, it is much more difficult to capture the intent of a music track using basic text captions, as opposed to using captions for still images.

MusicLM is the first step in overcoming these challenges, the team says.

It is a hierarchical sequence-to-sequence model for music generation that uses machine learning to generate sequences of different levels of a song, including structure, melody, and individual sounds.

To learn how to do this, the model is trained on a large dataset of unlabeled music and a music captions dataset of over 5,500 examples prepared by musicians. This dataset is open to the public to support future research.

The model also allows for voice input such as whistling and humming, which can help convey the melody of a song. This will render with the style described in the text prompt.

Although not yet published, the creators have acknowledged that creative content can be misappropriated if the songs generated are not sufficiently different from the source material the model learned from. .

