Hi there! One of the authors here (Prem). Thanks for reading, and for mentioning Spleeter! Spleeter is awesome. The idea behind this book is to give people the tools and code to train their own versions of Spleeter.
I was playing with spleeter a bit earlier in the year for drum practice and it was great. You can do drum covers (or whatever instrument) without the other (typically famous and amazing ) musician stepping all over your brilliant amateur playing !
Nice, I just tried Demucs on my go to track (Fleetwood Mac - Dreams, mainly because it happened to be on my computer and I thought I could hear it multiple times without going crazy, but also it does seem to be a reasonably good test as there are some effects which seem to make separating the voice non-trivial etc) and the end result was very impressive. Definitely much cleaner sounding (sharper and fewer FFT artefacts) than Spleeter, especially on the drums and vocal.
I also tried it on a dance track (BT - Mercury and Solace) and the results were much less impressive. I guess this maybe reflects what it was trained on?
There's also Spleeter Web [1], a self-hostable version of a Spleeter web app. A fun little project I've been working on! I'm thinking of supporting additional models such as Demucs [2], which seems to perform better than Spleeter.
I always wondered why music files don't have separate channels for each instrument or voice. All the sounds are mixed into one or two channels. Why? Wouldn't it be better to do this as late as possible?
Watching waveforms that actually match what I'm hearing is so pleasant...
One is the problem of distribution of the data -- until recently when bandwidth has become inexpensive enough it was prohibitive to send all the tracks. 30+ tracks is not uncommon (though not all tracks are active for the entire song). 74 minutes of music fit on a standard CD. If they shipped on CD a 20 track song you'd get under 10 minutes of music.
Second, most songs have a great deal of post production -- effects are applied to dry tracks; often groups of tracks are mixed down to a "bus" track, which is then mixed into the next level of the mixing tree. Eg, there might be six mics on a drum kit, and the engineer will work hard to shape the sound mic by mic, changing EQ, reverb, panning, etc, to get the desired effect. Then that final drum bus is mixed into the next level. Often these mixing and effects parameters is dynamic, changing throughout the song.
There is no standard for these effects, nor for the mixing parameters, so to achieve the final result would require, essentially, shipping the workflow for a specific DAW application.
Depending on how granular someone wants to get, you can break a session for any given pop record into upwards of 60+ channels. Drum kits in particular are among the most complicated instruments and can have anywhere from 3-16 channels dedicated to it. It's not uncommon to use three microphones to capture different aspects of the kick drum alone: 1) Inside the kick by the beater, 2) outside the kick by the resonant head and 3) a mic that captures the subby "chest" frequencies that you only feel in your body.
Additionally, certain groups of instruments have interdependent relationships that react to one another. For example when the kick drum hits, the bass might "duck down" out of the way (temporarily) so that they aren't occupying the same sonic space. This dynamic relationship between instruments helps to make a song groove more and have greater impact. Any given song could have numerous dynamic relationships like this that are set up by the audio engineer. Some of the magic or "vibe" of a given recording can go away when you selectively mute or disable certain instrument groups.
Sometimes the only option you have to preserve that exact balance between the instruments is to print the stereo buss. Printing the instruemnts individually can kill the vibe.
What I read is: if we had a sufficiently advanced format which can embed both the recording and ... "sound shaders" (effectively) we could totally do that. And I'd love to see that.
We have that already, it's the native save format of the DAWs. But there we don't just dive into the specifics of audio, but also back to an analogy that's closer to software developers.
Bouncing a full track to stereo is a lot like compiling software into bytecode/assembly. You get something usable and distributable to end users that you can reverse only partially. Part of that is intended. Just as with proprietary software, the exact combination of synthesizers, instruments, plugins and settings are closely guarded secrets of musicians, mixing and mastering studios.
And just like in software, some of those "shaders" aka plugins are actually proprietary themselves and can cost multiple hundred dollars each — so redistributing the "source code" can be not just difficult but actually illegal or impossible.
As you see, the incentives generally aren't aligned towards this form of distribution.
Just like in software though, I think there's a set of people interested in actually distributing their source tracks, made with open source software — and IIRC, that's exactly what's happening here as the base of these source separation efforts!
Yeah, DAWs are pretty close, but I was thinking of distribution itself. Basically a PDF, but for music. You don't necessarily need to ship all the original bits to generate the optimised postscript - just the postscript which will display the same thing everywhere.
> Yeah, DAWs are pretty close, but I was thinking of distribution itself. Basically a PDF, but for music. You don't necessarily need to ship all the original bits to generate the optimised postscript - just the postscript which will display the same thing everywhere.
I think this could be done, probably easier than doing the same for video production. An audio signal is as simple as just a list of values...
Basically, we would just have every "channel"/"track" from a digital audio workstation rendered into a file. For very similar tracks only the differences need to be stored. Then we would also need one "master" channel which stores mastering information.
Now such a format would pose a couple of challenges. First of all, size. Lossless audio in flac takes quite a bit of space already, an entire album is usually around 100mb. This would use even more space.
The next problem is that there would have to be a "standard" across digital audio workstations... how would tracks be classified? For example, the software Ableton Live allows one to group tracks and also group groups themselves. Effects can be added to groups but this people also use this feature just for organisational reasons. Other software might do this differently...
Then the next thing which is probably the biggest challenge... the "player" would need to add the individual signals back together during playback and doing this in real time can cost performance.
All in all, I have been dreaming for years about managing songs and audio projects with something like git. It would open up a whole new world for collaboration and I think theoretically it would be possible.
That is one way to do it. I wonder if you couldn't apply the effects to the tracks themselves.
Take tracks A and B. If the final audio track is composited as λ(αA+βB), perhaps you could save λ(αA) and λ(βB) as separate tracks in the file to be composited as λ(αA)+λ(βB). That assumes λ is linear, but you can likely cheat a bit to make it work on non-linear functions as well.
But once you start going down that route, you need to assume that the player will mix those perfectly, and looking at history, I wouldn't take that for granted. Some shader compilers are buggy, imagine what would happen on low-power HW, I guess it can only be done reliably if you control the player as well.
I wonder how long would adoption take. The hardware and software progress around sound seems to be way slower than around image. Which is understandable, as thanks to the lower data rates stuff has been "good enough" for a long time.
By "music files" I assume you mean the records that you can buy off the shelf in a record store.
1) Bandwidth. The file size would grow proportional to the number of separated tracks (less silence in the audio file).
2) Probably low demand. Beyond music nerds and audio engineers, I don't think too many people would be interested in hearing the separated files.
3) It undermines artistic intent: Giving people the ability to remix a song that is outside of the artist's control would probably run counter to their vision of what the song should be. Artists and engineers spend a LOT of time perfecting every detail, and anything that works against that intent may be unwanted.
4) IP concerns. It takes control of the music away from the creator and the business interests once alternate versions exist. It's common for record companies to release "special editions" with the instrumental or acapella versions included as a bonus. By retaining these, labels can have extra revenue opportunities in their pocket for later.
Native Instruments tried to establish a standardized music format a few years back called "Stems". It's a file format that allows the listener to listen to the entire song or to any combination of it's four constituent stems (Drums, Bass, Instruments, Vocals). It was designed with DJs in mind and people who want to make their music easier to remix. As cool and as welcome as the idea is, it had low-adoption and there didn't seem to be a large enough demand for such a format.
As others will no doubt mention, you can procure the multi-track recordings for a lot of famous song sessions if you know where to look online. Audio engineers also trade these amongst themselves as they are leaked because it allows a detailed look into the craft of a finished recording. Like the source code leaking for a piece of software.
> 3) It undermines artistic intent: Giving people the ability to remix a song that is outside of the artist's control would probably run counter to their vision of what the song should be.
I'm conflicted about that point. On one hand side I understand they want to control the end product. On the other, the two "albums" I love just for their sound are DMB's Lillywhite Sessions and some barely mixed Deep Purple garage recordings. I suspect it's mainly due to lack of compression, since I also prefer the older raw records to modern remasters, but there could be something more I can't consciously identify.
Secret Chiefs 3 come to mind. They recently started selling ‘Geek Pack’ downloads for some of their songs. It’s a folder containing the hi-res stems for a given song. It also includes detailed notes on the song’s production. For a fan like me, it’s a very welcome look into their process.
A much more common approach is for artists to release an instrumental or vocal acapella to a song. It’s not uncommon for hip hop records to release entire instrumental versions of their albums.
One effect of this is that other rappers are encouraged to take the opportunity to rap their own verses over someone else’s instrumental. This occurs a lot in the tradition of rap mixtapes, which are halfway between albums of original music and riding on someone else’s beat.
Worth noting that the mix is also typically not the final stage of production, or at least of the signal flow. At minimum, there’s almost always some sort of dynamic processing, such as a multiband compressor and/or limiter, applied to the full mix, which makes recreating the final product from the stems much less straightforward.
Agree with these points; oddly I love instrumentals of a song than song w/vocals; and although Deezer is good, not good enough. Back to searching through sounds w/o vocals.
Because the end product will have only left and right channels, and 'mixing and mastering' is a huge part of the creative process, not something that can be done by the 'player'. And of course 'IP' - though some somemakers do release content in this format in the hopes of remixes, many do not want that out there.
David Byrne’s book “How Music Works” has an interesting bit about how the common contemporary way of recording recording (recording different tracks separately) can feel very unnatural for bands that are used to playing live, in an actual acoustic space. In addition to listeners having two audio channels in, the performers do also, which influences their perception of how the music sounds, which influences how they perform, and so on. As others discuss in this thread, there is an exacting art involved in applying effects to recreate a sense of acoustic space from the dry tracks of individual musical performances. Which is a long way of saying: it is unclear why the consumer would want those dry individual tracks, when the performers themselves want the effects applied.
I like Kristin Hersh's description of studio recording:
"What we did was, we took the song apart, played each piece separately, then stuck the pieces back together. In my opinion, it didn't work. We'd torn the song's limbs off, sewed them back on and then asked it to walk around the room. Of course, there's no song left anymore; we killed it."
Same here. I'm guessing there are two reasons. One, legacy: that in the early days of digital sound, composing tracks in real-time on end-user machines was too computationally expensive. Two, IP: having separate channels for every instrument or voice makes it easier to play with and remix the music, which is something IP holders don't want.
I agree with these sibling comments about why this isn't common.
But I also really love to see separated tracks as well! My favorite version of this is the stem player from Lawrence: https://stems.lawrencetheband.com/ - I wish more artists would do this.
Most people this thread talk about effects. That’s just part of the story.
Another part is the level/gain of each instrument during each part of the piece. Remixing the levels can completely change a song. You can get a small taste of this by experimenting with extreme EQ levels on a parametric EQ.
Take one of your favorite song and change the EQ levels to emphasize different sections.
Sometimes an original release vs a “2019 remaster” (or similar year)has such differences and you can hear parts of a piece differently than before, almost like hearing new layers that were always there but not emphasized.
They exist, and if you know where to look you will find many of them, almost always not legal. The storage requirements are vast and you will need some pretty specialized hardware to do playback. The biggest reason why stuff is downmixed and flattened is because of our historical distribution media: radio, tape, records, in theory you could distribute the masters but most people would have no use for them.
Some of the youtubers that talk about and analyze music seem to be able to access multi-track recordings of various well-known songs, I was always wondering where they got them from.
Most of them are ripped from Guitar Hero/Rock Band, because the games needed to mute individual instruments when you were not playing the correct notes. (With 10+ games and lots of DLC, there's actually a lot of songs that were included)
There was also a service called jammit marketed to musicians where you could buy multitracks from their partnered artists. There are also artists sharing them in deluxe editions of albums, for example I think Dream Theater did it for Black Clouds and Silver Linings?
There's also the possibility of just knowing someone in the industry that could have access to them, maybe? I know Rick Beato who does song analyses on YouTube is a producer, so he could get some this way.
There was a great channel on YouTube dedicated to sharing these isolated tracks called digital split, unfortunately it got banned from YouTube twice, so I don't think it's coming back...
It wouldn't surprise me if popular channels got access to the raw tracks from the studio (for whatever the normal license terms are to use the full copyrighted song in your video - a share of the advertising).
Popular is key - you need to get popular doing analysis of work nobody has heard of first. Once you are popular on YouTube everyone wants you to advertise for them.
They can make their way into the public's hands in a couple of different ways:
1) The multi-track recordings are deliberately released by an artist who wants to encourage others to remix their music. Sometimes it's just the instrumental. Sometimes it's only the acapella vocals, and other times they release the entire multi-track, or a simplified version of it.
2) There are certain audio tricks that can be applied to separate certain components from a recording like isolating the L or R channels. If you have an instrumental recording, you can use phase cancellation to nullify the instrumental against the full recording end up with just the lead vocals. This technique has it's limitations and isn't always perfect.
3) Box sets for popular albums sometimes deliberately include isolated tracks. The Beach Boys' 'Pet Sounds' box set is a noteworthy in this way.
4) Audio engineers and producers with access to a famous song session or tape reel will copy and leak the entire multi-track recording. This is usually illegal. Nonetheless these are highly sought-after by audio engineers and producers who trade them amongst themselves for their insights into the recording process that they offer.
Rick Beato’s channel is great for this, he’s got multi-track for lots of music from the 60’s/70’s and really breaks them down.
Here’s one episode where he breaks down Ramble On by Led Zeppelin...he says it himself but you can just feel how much he loves the songs as he goes through it:
https://github.com/etzinis/sudo_rm_rf - SUccessive DOwnsampling and Resampling of Multi-Resolution Features which enables a more efficient way of separating sources from mixtures
Wow, great guide ... I just skimmed it. Izotope RX,which is mentioned, in the appendix, is amazing and I use it to separate noise from vocals for my podcast.
One of the authors here (justin), thanks for the feedback!
The goal of this tutorial is to show how you can use open-source tools to create your own RX :) For example, you can train it to remix/separate a specific set of instruments you care about. Cheers!
Deezer does not offer an official hosted service but there are unofficial services like Acapella Extractor [4], melody ml [5] and moises [6].
[1] https://news.ycombinator.com/item?id=21431071
[2] https://news.ycombinator.com/item?id=23228539
[3] http://archives.ismir.net/ismir2019/latebreaking/000036.pdf
[4] https://www.acapella-extractor.com/
[5] https://melody.ml/
[6] https://moises.ai/