Arstecnica
On Thursday, Google researchers unveiled a new generative AI model. MusicLM You can create 24 KHz musical audio from text descriptions such as “A soothing violin melody backed by a distorted guitar riff.” It can also convert humming melodies to another music style and output music for several minutes.
MusicLM is Google’s “a With a caption from “Large Datasets of Unlabeled Music” MusicCaps is a new dataset consisting of 5,521 music-text pairs. MusicCaps takes text descriptions from human experts and fetches corresponding audio clips from Google. audio setis a collection of over 2 million labeled 10-second sound clips extracted from YouTube videos.
Generally speaking, MusicLM works in two main parts. First, it takes a set of audio tokens (sound fragments) and maps them to semantic tokens (meaningful words) in the training captions. The second part takes the user’s captions and input audio and generates acoustic tokens (sound fragments that make up the resulting song output). The system relies on a previous AI model called AudioLM (introduced by Google in September) along with other components such as sound stream When Mulan.
Google claims MusicLM is better Before AI music generator in audio quality and text description adherence. on MusicLM Demonstration page, Google provides many examples of AI models in action, creating audio from “rich captions” that describe the feel of music, and even vocals (so far gibberish) . Below are some examples of rich captions they offer.
Slow tempo, bass & drum driven reggae song. sustain electric guitar. A high-pitched bongo with a ringing sound. Vocals are laid back, relaxed, and very expressive.
Google also has MusicLM’s “Long Generation” (which creates five-minute music clips from simple prompts), “Story Mode” (which takes a sequence of text prompts and transforms them into a morphing series of musical songs), “Text and melody conditioning” (takes human humming or whistling audio input and modifies it to match the style laid out in the prompt), and generates music that matches the mood of the image caption.

Google research
Further down the example page, Google searches for specific instruments (e.g. flute, cello, guitar), different musical genres, different musicians’ levels of experience, locations (jailbreak, gym), time periods (clubs 1950s), etc.
AI-generated music isn’t exactly a new idea, but how AI-generated music in the past decades often created musical notation that was later played manually or via a synthesizer. MusicLM generates raw audio frequencies of music. Also, in December Refusion, is a hobby AI project that can also create music from textual descriptions, but with lower fidelity. Google references Riffusion in MusicLM academic papersaying that MusicLM surpasses it in quality.
In the MusicLM paper, its authors attributed it to “potential misappropriation of creative content” (i.e., copyright issues), potential bias for underrepresented cultures in training data, and potential cultural biases. It outlines the potential impact of MusicLM, including the issue of diversion of As a result, Google stresses the need for further work to address these risks, withholding code, stating, “We have no plans to release the model at this time.”
Google researchers are already looking ahead for future improvements. “Future work may focus on lyrical generation, along with text adjustments and voice quality improvements. Another aspect is modeling higher-level song structures such as introductions, verses, and choruses. Music at high sample rates is an additional goal.”
It’s probably not an exaggeration to suggest that AI researchers will continue to improve their music generation technology and be able to create studio-quality music in any style just by describing it. . Please look forward to future developments.