Google has developed an AI model, MusicLM, that can compose music based on text inputs, much like how ChatGPT can transform a text command into a story and DALL-E can compose images based on written prompts.
The AI model can quickly transform a user’s written words into music lasting several minutes, or convert hummed melodies into other instruments. The company has published its findings on Github, providing a number of samples created with the aid of the model. These samples, dubbed MusicCaps, are essentially a dataset made up of 5.5k music-text pairs with rich text descriptions provided by human experts.
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff”. MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes.
The Google team demonstrates that their system can improve upon preexisting melodies, whether they were previously hummed, sung, whistled, or played on an instrument. In addition, MusicLM is capable of transforming a series of written descriptions into a musical “story” or narrative of up to several minutes in length, ideal for use as a film score.
The demo site also features examples of the model’s output when asked to generate 10-second clips of instruments like the cello or maracas, eight-second clips of a certain genre, music that would fit a prison escape, and even what a beginner piano player would sound like compared to an advanced player. Words and phrases like “futuristic club” and “accordion death metal” are explained as well.
MusicLM can even mimic human singing, and while it sounds fairly accurate in terms of pitch and volume, there’s still something off about it. I would say that they have a grainy or staticky quality to them.
The most recent version uses the artificial intelligence image generation engine StableDiffusion to convert text prompts into spectrograms, which are then transformed into music. Based on its ability to take in audio and mimic the melody, the paper claims that MusicLM can outperform competing systems.
The final bit is arguably the most impressive demonstration the researchers have made. Here, you can listen to the input audio, in which a person hums or whistles a melody, and then hear how the model transforms that melody into different musical styles, such as an electronic synth lead, a string quartet, a guitar solo, and so on. Listening to its examples in action, I can attest that it performs admirably.