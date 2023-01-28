



Researchers at Google have created an AI that can generate minutes of music from a text prompt. Just as systems like the DALL-E generate images from text prompts, whistling and humming melodies can also be translated into other instruments (via TechCrunch). The model is called MusicLM and although you can’t operate it yourself, the company has uploaded a number of samples created using this model.

Examples are impressive. 30-second snippets that sound like real songs created from paragraph-length descriptions that prescribe genres, moods, and even certain instruments, and generated from one or two words like melodic techno. There’s a five-minute song of him. Probably my favorite is the story mode demo. In this mode, the model is basically given a script to morph between prompts. For example, the prompt:

Electronic songs played in video games (0:00-0:15)

Meditation song by the river (0:15-0:30)

Fire (0:30-0:45)

Fireworks (0:45-0:60)

It’s now audio that you can listen to here.

It might not be for everyone, but I could totally see this being made by humans (I listened to it dozens of times on a loop while writing this article). The demo site also provides an example of what the model produces when asked to generate a 10-second clip of him for an instrument such as a cello or maracas (the latter example shows that the system is relatively not working well). Certain genres, music suitable for prison escape, and even what sounds like novice and advanced piano players, also included interpretations of phrases such as Futuristic Club and Accordion Death His Metal. I’m here.

MusicLM can also simulate human vocals. The tone of voice and overall sound seem to be handled well, but there is a definite difference in quality. The best way I can describe it is that they sound grainy or static. That quality isn’t quite as clear as the example above, but I think this shows it pretty well.

By the way, that’s the result of a request to make music that flows at the gym. Also, you may have found the lyrics to be nonsense, but like if you’re hearing someone singing in Simlish or if a song that’s supposed to sound like English doesn’t. , may not necessarily be understood if one is not paying attention. .

We don’t pretend to know how Google achieved these results, but if you’re the type of person who can understand the numbers, there’s a published research paper outlining it.

Diagram illustrating the hierarchical inter-sequence modeling task used by researchers with AudioLM, another Google project.chart: google

AI-generated music has a long history dating back decades. There is a system that is credited with composing pop songs, copying Bach better than man in the 90s, and accompanying his performances live. One recent version uses the AI ​​image generation engine StableDiffusion to turn text prompts into spectrograms, which are then transformed into music. According to the paper, MusicLM outperforms other systems in terms of its quality and adherence to captions, as well as its ability to capture audio and copy melodies.

That last part is probably one of the coolest demos the researchers have put out. On this site, you can play the input audio of someone humming or whistling a tune and hear how the model reproduces it as an electronic synth lead, string quartet, guitar solo, etc. From the examples I’ve heard, it does the task very well.

As with other forays into this kind of AI, Google has been much more cautious with MusicLM than its peers using similar technology. The paper concludes that there are no plans to release the model at this time, citing the risks of potential misappropriation of creative content (read: plagiarism) and potential cultural appropriation or misrepresentation.

It’s always possible that at some point this technology will show up in one of Google’s fun music experiments, but for now the research is only available to other people building music AI systems. Google says it publishes a dataset containing about 5,500 music-text pairs. This could be useful for training and evaluating other music AIs.

