Hacker News new | past | comments | ask | show | jobs | submit login
MusicLM: Generating music from text (arxiv.org)
291 points by georgehill on Jan 27, 2023 | hide | past | favorite | 107 comments




I love that there's a whole section for accordion examples. I think the accordion rap has a bad word in it. Accordion techno works surprisingly well.

They really need to bump up to 48 KHz so all the music doesn't sound like it's being played over a telephone. A factor of two in the cost shouldn't be prohibitive. So much of the audio generation stuff I've seen has fatal flaws like this baked into the dataset and/or training process that ensure the output can't sound good even in theory, it's kinda frustrating.

It's also frustrating that nobody AFAIK has trained a music model on an actually large dataset. We're training large language models on a significant fraction of the text of the whole internet! Where are the audio models trained on a significant fraction of all recorded music? This one was trained (in part) on a dataset of 280k hours of music, if I read the paper correctly. I don't think that comes anywhere close to being a significant fraction of all recorded music.


This is the best music that incorporates AI that I have heard by a long shot - process is here:

https://ooo.ghostbows.ooo/about/

Music is on all the usual streaming platforms, the album is called Shadow Planet, band is The Cotton Modules. Also avail on their website:

https://ooo.ghostbows.ooo/


The "problem" is copyright, the music industry has way more power than the people that were copyright infringed by Stable Diffusion et al. And probably no artist that makes a living from their music will voluntarely give his music to AI.


They should be able to get a significant amount of royalty free music. Granted it will generally be of a lower quality and/or be older but it should still be more than significant.


Pirates will gladly provide that content to the AI. Look what's happening with deepfakes.


are there upscalers in the wavenet world like there are in convnets?


Epic soundtrack using orchestral instruments is somewhat interesting as electronic music but to me everything else is pretty bad.

There is a million Industrial techno sounds, repetitive, hypnotic rhythms IDM tracks that absolutely no one listens to anymore. Most the founders of the whole genre have moved on from lack of interest.

We have been able to do really good generative music in Reaktor for the last 20 years. Much better than anything on this page but no one cares.

People in general want to hear the same slight variation in music over and over as background noise.

The people that are saying how great all this is will be bored with it in 2 weeks or less. Pure technological kitsch.


Let's be clear, this is a tech demo, the application is not going to be prompting infinite streams of low quality music. The application of this stuff is to enable creative tools (got a beat and some vocals, give me a bassline and backing chorus), real time computer jam partners, to provide "good enough" music for indie games/movies.

These models are still in their early phases - If AI was combustion engines, we wouldn't even be to railroads yet. Despite that, they're still fun to play with and useful when used correctly.


There are tons of relatively cheap music libraries with human-composed and human-produced music that is vastly higher quality than MusicLM's output - they are used for documentaries/commercials/low-budget projects and are loaded with metadata that make it easy for directors and editors to quickly find what they are looking for. Low-quality, weird-sounding AI-generated music with wonky rhythms, an inconsistent beat, bizarre digital artifacts and a generally unpleasant lo-fi sound that probably has to be patiently generated and listened to over and over again to find a usable result is highly unlikely to beat out already very cheap, much higher-quality and easily searchable human-written music for even the lowest budget projects.


Yeah, as far as electronic music goes, maybe people don't realize just how good (and automated) the current tools are. And have been for years.


The first version of Dall-E was mediocre. Look where we are now.


It's still pretty mediocre if you're being objective. The true hit rate is very low. Granted, it can produce some amazing things from time to time.


Holy hell those are insanely good. Best generated music I've heard so far.

I was born too soon to explore the stars, born too late to be a 17th century pirate... but born just in time to explore these incredible neural networks come to life.


The vocals (on second page of table) are really interesting. If you walked into a club, you might not notice they are nonsense, but it's just good enough BS to be convincing.


I just popped in to mention that part. It kinda feels like when Dalle generates text - like from a distance you recognize it, but it is utterly gibberish. It's interesting because there are certain parts of machine generated outputs we zero in on - for image generation it's typically eyes and, if it has it, text. It will be interesting to watch what evolves as tell-tale sign for computer generated things.


Eyes or rather hands?



Totally fails on "swing" and "relaxing jazz".


This should replace the OP link


dope, can it queue up weird al? I am curious how generative would work at coming up with parody


Note that there's a second thread on the front page right now that links to the examples:

https://news.ycombinator.com/item?id=34541693

Perhaps they could be merged?


I’m writing my second video game (old school arcade style side-shooter) and hope I can use this to generate background chip tune music!


May I recommend to learn how to use a tracker software? It is very fun to play with!


Can you recommend a tracker software package to use?


Sorry for the late reply, I would recommand Famitracker for begginers as it is the most user friendly out there. For a modern tracker try Renoise, it is basically your every day DAW in tracker form.


what's a tracker software?


It's software used to create music. Typically you build sections that you loop and intermingle. It's kind of like writing small sections of sheet music for multiple instruments at a time, but unlike many Midi composers, the notes aren't shown as notes. It's basically what came after manually programming all the chiptunes as part of the game's code. Instead using general parsing of a format that described the audio in terms of pitch, volume, dropoff, echo, etc. of samples (at least samples were usually used with later trackers anyway).


A 'tracker' is a usually-sample-based music making software. AFAIK they were originally developed by demoscene groups (concurrently to game studios that used the code privately whereas the demogroups would release binaries for others to use for music-making with few limitations) These 'sample mixing engines' were developed in order to be able to add music to their productions (demos and games), and it became it's own whole-world. Fully agree with sibling comments, module trackers can be huge fun! Near zero barriers-to-entry.. and there's still some great stuff being made in this format too!


https://youtu.be/roBkg-iPrbw

I think this showed up on my recommendation. I didn't know anything about tracker music, but this video explains it chronologically and very well.


yes, fits perfect for those kind of games


We have a AI platform finding music from video/images/game screen recording. And support musicians at the same time. Check it outhttps://avmapping.co/


I hope you also consider getting some music from an actual producer!


Why is there this need for humans to be involved at all stages of production?

Do we still need humans to knit and handwash our clothes? To live in a caboose or lighthouse? To operate our elevators? To deliver the milk?

Why should anyone have to learn to draw in the future when machines promise to do a better job? There are so many better uses for our time, and too many things to do for our short lives to handle.

I want to compose music, despite no practice or formal training. I want to make an entire movie by myself with no other humans involved. Tech that enables these things will be empowering.


It may be empowering, but it is not yet clear that the result will be appealing. Not trying to be an AI-skeptic, but I find art to still be mostly about communication; when I listen to some music I enjoy I feel like I have some shared experience with the author which they are able to communicate via music. I have yet to see this effect in AI-produced stuff, but even if the effect would be fully imitated, it is still not clear that it would have the same appeal.

An analogy: chess AI's are clearly superior to human chess by any reasonable measure. I still enjoy playing with other people (even online, even anonymously) infinitely more than playing with an AI, no matter how well calibrated/tuned to my level.


Yes, nothing communicates the lived human experience like chip tune music playing in the background of a video game.

I hear a lot of romantic notions yet I don't see a lot of people flocking to solo acoustic singer-songwriters!


These AIs will only ever be as good as what gets fed into them. It follows that without something going in, they will output nothing. That source material will always be human in nature, at least until these neural networks get large enough for emergent consciousness to exist, at which point we might see actual creativity from a machine.

An AI without new inputs to consume will not evolve creatively, and you will get bored of its output quite quickly, I think.


This might be something people arrive at at different ages, but at the end of the day other humans is ultimately why you’re doing what you do.

Yes, today ML can write you some run of the mill music and design assets, tell you how to develop the game, it can even play your game and write a review on it, but what’s the fun and/or point in that?


Just keep in mind, that movie will come out the same time when possibly 6 billion others will release theirs.


Sorry not clear: you want to compose music yourself or you want someone else to do for you?


This AI did and share profits https://avmapping.co/


Why?


I want to do the reverse: pass in some music and have it describe what I'm hearing


https://samplab.com is doing something like that.


Yes! And not just notes or chords or time signatures, but real analysis of how the piece works.


I believe parts of this model could be adapted to accomplish this. The "mulan" network generates embeddings from music and text so that they "match" each other, i.e. audio segments and text captions that go well together would be mapped to the same embedding. At run-time, they use embeddings from the text to generate the music. So conceivably the opposite could be done, using the "mulan" embeddings from audio only to generate text, by replacing the "soundstream" decoder with a text LLM. That said, there's probably simpler methods that could generate decent results, and I'd be surprised if there isn't already some out there. It's worth noting that along with this work they released a dataset of music with captions that could also help training new models for this task.


MusCaps: Generating Captions for Music Audio: https://arxiv.org/abs/2104.11984


Working on the same dream: https://parture.org


I think you’re missing the point. Music is an experience, not a checkbox or text book needing summarized.


Awesome, this is probably the best one to date even compared to the recent Riffusion, Jukebox and the older MIDI generating MuseNet.

especially the conditioning on humming and whistling examples are cool, but to bad they use very common melodies for that so it's easier job for the model and harder for us to judge how well would it work on less common melodies.


> especially the conditioning on humming and whistling ... but to bad they use very common melodies

100%. Now if you add the detailed "Painting Caption Conditioning" to the mix, you can create a melody in the style of… an image, which, in a way, is kind of a controlled artificial synaesthesia [1].

[1] https://en.wikipedia.org/wiki/Synesthesia


As a musician I don’t see these tools as competing with what I do but enabling me to do so much more. It takes a lot of time and money to create a professional recording for my songs and drum machines and synths just don’t work for Americana. These tools offer the possibility of backing tracks that sound like Willie Nelson’s band from the 70s but at a fraction of the time and effort.

I can’t wait until they get to the point where they’re more composable or auto-accompany given an acoustic guitar and vocal input.


AI music is the next thing coming, wait for copyright lawsuits to fall like bombs.

I find it fascinating that MidJourney can make a 3D model of my face from a low quality image, rotate it in space, apply it on someone's else body, add coherent shadows and backgrounds, with a very credible result, and yet an AI cannot generate a decent song, which is 1-dimensional and has probably much less internal modelling to care about?

One reason I can think of is because eyes are "integrators" and ears are "derivators". That is, that human ear is very sensitive so small differences, whereas vision cares more about the ensemble? I don't know, but I think that AI music will come one day. It may not be as great as human music, but it will suffice for, say, putting a music background for your startup cheap marketing ad.


Possibly just anecdotal evidence, but it seems like the pool of good music composers is much much smaller than the pool of good visual artists. Maybe that is an indicator of how "hard" each field is.


Think that depends heavily on what you consider to be a "composer" and what you consider to be "good" in both art and music.

Writing somewhat novel music to a formula arguably has a much lower skill bar than producing good representative art (especially if you allow sequencers as composition and disallow basic digital retouching of photos as visual art), but producing something that genuinely stands out may be harder


> which is 1-dimensional and has probably much less internal modelling to care about

There is an ocean between “I like that sound” and a final, produced piece of recorded music. Much of that ocean being ineffable.

Trying to reduce it to a set of parameters around the final waveform and you’ve missed the entire point.

> but it will suffice for, say, putting a music background for your startup cheap marketing ad.

We need less of that—not more. It’s like a climate disaster of the soul.


The ineffable will inevitably be effed by AI.


Eff it indeed.

The only thing it will really F is that bridge that art built to bring people in closer touch with themselves.

Thinking we can shortcut self-discovery, individuation and essential human connection with an algorithm is insane hubris. Icarus’ wings are melting.


In my opinion, good music is multidimensional.

Just like a 3D model is really only 2D on a screen, music can encompass so much more than what is heard on the surface level with no imagination.


It will be as good as the music it's trained on, no better and probably no worse.


I don't really understand why this approached is pushed for music. You can overpaint an image, but you can't do that with a song. Cutting an image to reintroduce coherence is easy too. For a song you need midi, or another symbolic representation. That was the approach of pop2piano (unfortunately it is limited to covers, not generating from scratch). And even if a song generated this is OK, listening to half an hour full of AI mistakes is really tiring. With a symbolic representation you could at least fix the mistakes if there is one good output.


I understand what you're saying, although it could be argued that at least for some types of image tasks one would prefer something like an SVG output with layers to make it easier to edit.

For music, I think it's partly an academic question of "can we do it" rather than trying to maximize immediate practical usefulness. There's already quite a bit of work on symbolic music generation (mostly MIDI), a lot of it quite competent, especially in more constrained domains like NES chiptunes or classical piano, so a full text-to-audio pipeline probably seemed a more interesting research problem.

And for a lot of use cases, where people might truly not care too much about tweaking the output to their liking, the generated audio might be good enough; the examples were pretty plausible to my ear, if somewhat lo-fi sounding (probably because it's operating at 24kHz, compared to the more standard 44-48kHz).

In the future a more hybrid approach probably makes sense for at least some applications, where MIDI is generated along with some way of specifying the timbre for each instrument (hopefully something better than general MIDI, though even that would be fun; not sure if it's been done). I'm sure that in the near we'll see a lot more work in the DAW and plugin space to have these kind of things built-in, but in a way that they can be edited by the user.


awesome-sheet-music lists a number of sheet music archives https://github.com/ad-si/awesome-sheet-music

Other libraries of (Royalty Free, Public Domain) sheet music:

- https://musopen.org/

Explainable artificial intelligence: https://en.wikipedia.org/wiki/Explainable_artificial_intelli...

FWIU, Current LLMs can't yet do explainable AI well enough to satisfy the optional Attribution clause of e.g. Creative Commons licenses?

"Sufficiently Transformative" is the current general copyright burden according to precedent; Transformative use and fair use: https://en.wikipedia.org/wiki/Transformative_use


I would offer a counter point: Chat GPT uses text, which is analogous to the MIDI of spoken language. Now, what's to prevent a speech to text and text to speech AI slapped on both ends to make it audibly conversational? Or train the franken-model model end to end in one monolithic model in hopes that it might make other interesting connections.

That being said I have seen very little in the AI transcription space (Listening to music, and outputting legible sheet music while identifying instruments) especially when it comes to multi-instrument music. I've also seen very little in the midi generation space aside from what you mentioned.

It would be nice to see midi to midi AI remixes, analogous to how Chat GPT can (sort of) turn wikipedia into rap lyrics.


Oh, there's already all sorts of work mashing together GPT text with text-to-speech and speech-to-text. There were some viral videos featuring "Steve Jobs" talking about AI and one being interviewed by "Joe Rogan". I also saw somewhere here on HN a website featuring an "infinite" conversation between "Zizek" and "Herzog", in a similar vein.

Though something end-to-end would probably make more sense; text is a lossy way to encode speech, as it doesn't capture a lot of nuance that goes into spoken words. Interestingly, the same could be said of sheet music and even MIDI in the music realm.

As for transcription for multi-instrument music, there's been a lot of work on that front, and a lot of progress that's already finding its way into commercial products. The latest I've seen in this domain is https://samplab.com/, which has some compelling videos of not only transcribing multi-instrument audio but editing it on a note-by-note basis.


The difference is that for images almost everything is done with raster graphics, vector is the niche exception. And for most uses where it makes sense (e.g. a logo) you can easily convert a raster to SVG automatically generally without requiring any manual tweaking.

If the argument is that "maximize immediate practical usefulness" is not something interesting, I would say it makes sense to say "I don't understand why this approach is pushed". Anyway.


If it could output one track per instrument (aka stems) that’s probably be the most useful thing.


You can absolutely "overpaint" sound, as sound can indeed be represented as an image and be extended in frequency of in time. In fact this is exactly what Riffusion is doing.


You sure _can_, but that's not how a music producer works. I meant it as how you could actually use the output into a professional workflow. For images it's definitely easy since it integrates in a meaningful way into the pipeline. And in many cases it improves over photobashing.

In music, 'fixing' problems in sound is way more costly, so a producer would say that it's not something you can work with in any reasonable manner.


Oooh, using LLMs for "inpainting" audio to remove pops from a vocal take! Or to fix a late snare hit! Of course these tasks can be done right now they just take minutes instead of what could be seconds.


“For a song you need midi”

Pretty sure the Beatles never handed George Martin any midi files. What’s the symbolic representation that captures the tone of every bend in a Hendrix solo? Did Daft Punk go back and grab the raw master stems of the old vinyl recordings they used to assemble their tracks?

Music producers have been astonishingly creative given inputs in a vast range of formats. Sheet music and midi are one tool, but ultimately it’s about combining sounds in the mix isn’t it?


Why do you strawman my argument?

The issue with combining sounds in the mix is the issues add up in a super-linear way. Because of that you could even say that modern music producers tend to care too much about perfect sounds.

The Beatles did not have the same production requirement modern music have since there is high-fidelity equipment. The music industry as a whole changed a lot. Modern music production have multiple passes just to clean up one "perfect vocal take", like de-essing, breath removal, etc. This is insanely more advanced than what was the standard in the 60s. I'm not saying it's better either.

>What’s the symbolic representation that captures the tone of every bend in a Hendrix solo?

That's not how music works. You refer to the score, and the precise way you bend your tones is up to your interpretation. Even if you wanted to copy one-to-one Hendrix, you'd copy the style, not one performance exactly, so capturing every little bend does not really matter.

Which could be an AI project: Hendrixify a guitar track.


Pretty sure the only score Jimi ever referred to was scoring some weed.

My point is that performances are as valid an input to music production as anything else.

And that these generative snippets are potentially a source of ‘performances’ that producers will tap in creative and interesting ways.


Yet another cherry picked attempt at music gen with a "demo" page that only contains the outputs that happen to not sound like incoherent noise. There's a reason ChatGPT and Midjourney are so popular, but no music-gen tools have even come close: you can actually create stuff with them that is useful and/or enjoyable. Good music gen is much harder, and the reasons for this (still unclear) are pretty important to the future of AI, imo.


That's quite a harsh critique that doesn't seem warranted. There are quite a lot of different examples with varying quality. Granted, it's likely that these are to some extent cherry picked, but it's unlikely that these are rare and that most output sounds like incoherent noise, as you seem to insinuate.


Their current demos seem worse than riffusion results I get on average. Music gen is hard because music is inherently a composed of many many different instruments, each with unique sound and function. Simply training end-to-end will almost always end up badly.


I'm sorry, I like riffusion as much as the next guy, but in what world is any riffusion example better than the literal first example on this demo page?


Might be personal preference, but I don't think those examples are good at all. The only thing I am impressed with is the story mode and conditioning. I typically use riffusion to generate swing/electropop music, could be I'm too biased.


> Music gen is hard because music is inherently a composed of many many different instruments, each with unique sound and function. Simply training end-to-end will almost always end up badly.

Yes! That's what I figured when I started my project to create an AI assistant for melodies (http://melodies.ai/). I'm quite sure that splitting it up is the way to go, and that the first AI hit song will not be created as a full song but by combining instruments/vocals.


FYI, they don't actually train end-to-end, they have three separate models that are each trained via self-supervised learning independently (I think, I haven't read the paper very carefully).


Makes one wonder if there are models that rely on loops.


Well... no all releases have to be toys for your average twitter user to play with. Most of them are just academic stuff to show team progress.


I'm not saying it about this one, but some are akin to Nikola rolling their truck down a hill. Like the Google phone assistant one that would call and make appointments for you and stuff. Very likely cherrypicked at the time. Or Microsoft's Milo, which was almost entirely staged.


They said they're not going to allow people to use it based on fear of plagiarism accusations / music industry lawsuits. Ugh, typical.

I want someone to train it on public domain music. Kind of like the YouTube Audio Library but I assume that's not exactly the right license for this. But with sufficient effort someone could make a lot of recordings of public domain music for this purpose and build something that the RIAA thugs couldn't actually touch.


I love the current rate of progress in generative music ml research. Riffusion model ( https://www.riffusion.com/about ) was trending here just a month ago. Pity Google never shares the actual models, would love to hear artists go wild with it.

btw, some of you might find interesting, diffusion-based / generative concept album: https://wielu.bandcamp.com/album/vectorstep-ep


Any plans on releasing the actual audio data and not just a csv with links to YouTube IDs?

Also maybe a silly question, but what's the legal ramifications of downloading these YouTube videos and training on them yourself? Google must have some rights, but what about people outside of Google?


I wonder if this technology will eventually revolutionize music the same way synthesizers did. Or at least lead to music and effects/filters that are simply not possible with current DAWs and plugins.

Custom generation of samples from text alone seems revolutionary.


Google: kings of releasing papers but never shipping anything


Interesting concept! I wonder how the hinders around copywriting will be solved.


Where can I get the MusicCaps dataset? I can't seem to find it



Yet Another ClosedAI research project with no intention of releasing..

what a joke


I hate how good this is.


Weights when though. :-(


For this specific paper, probably never. From the final section:

"""We acknowledge the risk of potential misappropriation of creative content associated to the use-case. In accordance with responsible model development practices, we conducted a thorough study of memorization, adapting and extending a methodology used in the context of text-based LLMs, focusing on the semantic modeling stage. We found that only a tiny fraction of examples was memorized exactly, while for 1% of the examples we could identify an approximate match. We strongly emphasize the need for more future work in tackling these risks associated to music generation — we have no plans to release models at this point."""


Could someone translate this into layman's terms?


It actually copied some stuff. Not much, but it did, and that's bad.


Knowing Google, they will somehow never released it by invoking a "think of the children", like every single one of their ML papers.


Can't wait to generate infinite music.


Anyone knows how to reach out the team?


It would be cool to have a music service trained on your music playlist and just generates music you might like.

Also, is this technically a vocoloid? :)


I'm absolutely flabbergasted by rapid progression of music related machine learning models. I've always seen music as the "final frontier". With images you can have tiny errors and noise that isn't that noticeable by the human eye, but with music everything has to be impeccable and if a note is slightly off, you instantly hear it. Machine learning models that will be able to create outstanding music, imo, will mark the end-game of creative AI.

Anyone want to share their experiments with MusicLM, feel free to join community-fan discord and subreddit:

https://reddit.com/r/MusicLM/

https://discord.gg/pjVcsyfCJR

BTW, I really hope this gets integrated in software like Ableton similarly how image generators get slowly integrated into Adobe Creative Suite.


> Anyone want to share their experiments with MusicLM, feel free to join

How can we experiment with MusicLM?

The paper’s last sentence says, “We have no plans to release models at this point.”


Makes you wonder if they even read one sentence of the paper before creating a Discord & Reddit community on throw-away accounts, just to be first and make sure that they'll control a new AI model's community :)

But, I guess, one way to make money currently.


> if a note is slightly off, you instantly hear it

I don't know about this. I know professional musicians and they say that they make little mistakes all the time, but part of being a professional is being able to pretend those mistakes didn't exist because the audience doesn't know what it's supposed to sound like.


> but part of being a professional is being able to pretend those mistakes didn't exist because the audience doesn't know what it's supposed to sound like.

Very true, however it's something musicians have to specifically train for, because to a trained ear they can become painfully obvious and it's tempting for the musician to (at least) cringe in a way that the audience will detect. Musicians in the audience, though, can often tell when you flub a note, regardless of how you play it off. That's really the bar we are looking to meet.


The fact that mistakes in music are so obvious comes down the the fact that underneath music is a lot of structure and pattern. That there can be a ‘wrong’ note implies that there is redundancy in the signal and predicting the right next sound is precisely the kind of thing these sorts of generative AIs excel at.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: