AI Music Generation: MusicGen

Researchers have recently released a new paper and subsequent model, “Simple and Controllable Music Generation”, where they highlight it “is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models”. What this essentially means in practice is the music generation can now be completed in less steps, and is getting more efficient as we make progress on various different types of models.

I expect AI to hit every industry in an increasingly rapid pace as more and more research becomes available and progress starts leapfrogging based on other models. MUSICGEN was trained with about 20K hours of unlicensed music, and the results are impressive.

Here are some interesting generations I thought sounded nice. As more models from massively trained datasets hit the public, we will see more community efforts and models as well just like with art.

Medium Model

I used the less performant medium model (1.5B parameters and approx 3.7 GB) to demonstrate how even on relatively poor hardware you could achieve reasonable results. Here is some lofi generated from the medium model.

Large Model

A step up is the 6.5 GB model. This produce slightly better sounding results.

What is that melody?

There is also a ‘Melody’ model that is a refined 1.5B parameter version.


There are a few limitations on this model, namely the lack of vocals.


  • The model is not able to generate realistic vocals.
  • The model has been trained with English descriptions and will not perform as well in other languages.
  • The model does not perform equally well for all music styles and cultures.
  • The model sometimes generates end of songs, collapsing to silence.

However, future models and efforts will remedy these points. It’s only a matter of time before a trained vocal model is released with how fast machine learning advancements are accelerating.

Published 2023-06-10 18:36:40