Hacker News new | past | comments | ask | show | jobs | submit login
Million Song Dataset (millionsongdataset.com)
196 points by commons-tragedy on Oct 29, 2019 | hide | past | favorite | 39 comments



I did my master's thesis (2017) using this dataset. I trained a neural network to predict musical features from the raw audio of the songs.

Unfortunately the 7digital ids were out-of-date, so in order to get access to the audio (30 second clips) I had to email another researcher who'd recently published work using the audio data and politely ask them to rsync me the audio XD


For anyone else who was curious about the song selection process:

<quote>

How did you choose the million tracks?

Choosing a million songs is surprisingly challenging. We followed these steps:

1. Getting the most 'familiar' artists according to The Echo Nest, then downloading as many songs as possible from each of them 2. Getting the 200 top terms from The Echo Nest, then using each term as a descriptor to find 100 artists, then downloading as many of their songs as possible 3. Getting the songs and artists from the CAL500 dataset 4. Getting 'extreme' songs from The Echo Nest search params, e.g. songs with highest energy, lowest energy, tempo, song hotttnesss, ... 5. A random walk along the similar artists links starting from the 100 most familiar artists

The number of songs was approximately 8950 after step 1), step 3) added around 15000 songs, and we add approx. 500000 songs before starting step 5. For more technical details, see "dataset creation" in the "code" tab.

</quote>[1]

What I really wanted to know was if it was a worldwide-music dataset or a more narrowly focused one. My guess based on the above is that it's mostly English-language music, mostly American - can someone who's worked with the data confirm/deny that?

[1] http://millionsongdataset.com/faq/#how-did-you-choose-millio...


I think this Dataset is probably unfortunately most famous for an incredibly flawed but very headlineable attempt, which you've almost certainly seen somewhere, by a group of researchers from AI and related fields (none of which had musical qualifications) to "objectively" determine if music has gotten worse over the decades. As usual, they arrived at the conclusion that it did by computing a vague number ("timbral diversity" and "harmonic complexity") for each song, and then showing a scary graph of that number going down over time.

Not sure what my point is exactly. I guess it's just another reminder of how easy it is to arrive at any conclusion you want using complex algorithms on big data.


For virtually any kind of art newer artworks will be on average worse than _surviving_ older artworks: the older artworks have undergone a selection process that weeded out the less popular ones.

This is IMO a more fundamental problem with any comparisons across long time intervals.


Selection bias is also apparent in ranking sites for shows where seasons/sequels are ranked individually.

For example in anime, Gintama appears 8 times within the top 50: https://myanimelist.net/topanime.php

It's not because it's a popular show. It's just that people who didn't like the first few episodes have already stopped watching! And it's polarizing enough that the only people who stick around for so many seasons are the ones really love it. So it will get rated a 10 even for mediocre content.


True for a naive approach, but I'd think looking at the chart toppers for, say, each week would give decently comparable data.


I actually did something similar! I wrote a small blog post about it here https://www.popnalysis.com/blog/lyrics-over-time/

But I came to the same conclusion as the op comment. Music can't really be judged by any one metric! But that doesn't mean that you don't gain insight into how music has evolved!

also, I did release my entire lyrics dataset that I scraped (about 500k) for free.


Yes, assuming you can find all the old chart toppers (and that this term has a consistent meaning over the whole time interval you're looking at). If, understandably, you have some gaps in old chart toppers, you need to assume something about them (probably that they were worse than all surviving ones, which helps if you're doing percentile-based comparisons).


I wonder if digital recording and processing has made music cleaner leading to less harmonic garbage.

Also... What if the poetry is better, and someone is singing acapella? How do you capture that?


Yeah, this is really the core issue, even within the music.

As a thought experiment, let's say someone invented a brilliant revolutionary new complex rhythm. Everyone went wild using it for a year. Then, once everyone knows it, artists start to mix it up by leaving out parts of it, leaving it implied, relying on listeners familiarity with the rhythm to make things work.

If you now tried to naively measure the amount of rhythmic complexity per song by counting percussion hits or similar, you'd see complexity take a nosedive. You'd also see people who missed out on the year complain about how bland the new rhythms are. But the songs actually got more complex. It's just that the complexity is only apparent to people familiar with the hypothetical revolutionary rhythm.

At the same time, people familiar with the new music will look back at the old music and be incredibly bored. They're used to finding enjoyment in the complexity of the implied rhythm, but there's just nothing there, it's all painfully spelled out and predictable.


[flagged]


No, I think the poster is bothered by how absurd and unscientific such a pursuit is. It proposes that perhaps the people who made this dataset don't actually know anything about music, and are just nerds. This behavior is a common form of malpractice in composing data sets such as these.


"Worse" might be subjective but if they analyzed the data and found decreasing complexity of harmony and song structure, then that's not what I would call "unscientific".

For the past 30 years music has definitely become more "bland" (for lack of a better word). The data backs this up. Whether you like that or not is subjective.


If all you listen to is the radio, sure. However in the broader sense of music being created, this is patently false. There are more genres and subgenres of music than ever before, experimentation and synthesis techniques weaving together ever more intricate and sophisticated timbres.

Movies are a better example of a medium getting bland, but there is a material reason for that; production costs vs return on investment leading to less riskier films being made.

Music doesn't suffer from that same problem, in fact 'good' music can be made for nothing except the composer's time.


I am talking about popular music since the vast majority of people do only listen to the radio.

Of course there's a lot of underground and independent music being made in many genres. It just doesn't have much of an audience.


This is such nonsense it's just incredible that people with such seemingly strong conversational strengths in music could say something so perversely false.

If you are looking at the kind of music that makes it onto Top 40 radio stations, then this is undoubtedly true. But that's not what is meant when you say

> For the past 30 years music has definitely become more "bland"

When you suggest that, you appear naive and ignorant. Because it means you don't know about Tyshawn Sorey [0], Mary Halvorson [1], Anthony Braxton [2], John Zorn [3], etc.

Pop music, a tiny sliver of all the important music being composed and performed today, has gotten progressively shittier. Fine, I'll grant you that. But who cares? That's not what we're talking about when examining the vast complexities being propagated and explored by today's musical masters. We are in a golden age of harmonic and compositional innovation.

[0] https://www.youtube.com/watch?v=Rqw-HE9szrs [1] https://www.youtube.com/watch?v=PEfK0ZfbLRA (featuring Nels Cline, who is actually much more important for his participation in the NYC avant-garde than for his involvement with Wilco) [2] https://www.youtube.com/watch?v=HNJfU0AJK2c [3] https://www.youtube.com/watch?v=RYis3q0vv7g


I don't even think it's necessary to point out specific examples, though you're welcome to do so. All it really takes is showing the kind of numbers music databases like discogs (for both generalist and specialist consumers alike) are posting - millions of records added PER YEAR [1]. This alone shows "a million contemporary popular music tracks" is not representative of modern music's "state" whatsoever.

[1] https://blog.discogs.com/en/discogs-year-end-report-2018/


> Pop music, a tiny sliver of all the important music being composed and performed today, has gotten progressively shittier. Fine, I'll grant you that. But who cares?

I guess if by "pop music" you mean everything that gets played on the radio, gets aggressively promoted by the recording industry, and has the widest audience I'd argue that many people should care. It's great that there are awesome musicians who are putting out amazing music in their basements or at local shows but if 99.99% of the population will never hear it does it matter that what they will hear is watered down crap? I'd argue that it does. Even your own examples show only a few thousand views. Compare that to what a currently popular song (https://www.youtube.com/watch?v=zlJDTxahav0) has seen in views over just one week. The song isn't even bad, but I can see how it could be described as "simple".


I wouldn't say pop music has gotten worse. Maybe popular music, but not pop as a genre. There's a lot of great pop out there.


Thanks for mentioning Wilco, because I wouldn't just stop at harmonic and compositional innovation. Songwriters and big-tent performers are as good as they ever were as well. I had the pleasure of seeing them Oct 8 here in Toronto and it was fantastic. I'd also mention Willie Nelson/Avett Brothers/Matt Mays and Tame Impala all this summer. I'd dreamed of making it down to Red Rock, CO to see Jason Isbell last month, but it didn't materialize.

There's no shortage of worthy and soulful music. And it's more accessible than ever.

Maybe there's just more music in general, and that is easily mistaken as a consensus rather than just all-around growth.


I was pointing out that Wilco is a poor example of what I am talking about.


I'm sorry, where did you say that Wilco was a poor example of anything?

It sounds like you didn't read my comment in full.


There has always been underground music and there always will be (thankfully). I'm talking about mainstream music, which most definitely has become simplified in the past 30 years.


It's tempting to think one can come up with objective measures of musical complexity, but the problem is the number of dimensions musical complexity exists in:

Many people who listen to classical (in the more general sense of the word that encompasses baroque and romantic eras as well) dislike folk music as being too simple. They are listening to harmonic complexity - the changing of chords over time. Irish music usually has two or three chords repeated in a predictable pattern. But it has micro complexity - subtleties of emphasis, timing, melodic variation, grace notes, etc. - that make it, to my mind, far more complex than it is given credit for. These intricacies are particularly difficult to measure objectively.

Other kinds of music, like EDM, rap, etc. have rhythmic complexity. Fans of this music often think classical music is boring because classical tends to be rhythmically predictable.

Within classical music, there is Classical (Mozart era), which values balanced form, strictly following structures like AABA, with the complexity hiding within the endings and transitions between repeated sections.

This is very different from baroque (Bach, etc.), which often has rambling forms, and lots of micro complexity - dazzling changes of chords and keys, very intricate harmonies.

I won't even go into modern "serious" music, which stretches the definition of music to its breaking point.

The point being, there are as many forms of complexity possible as there are variables you can control in creating a piece of music.

I suspect this actually brushes up on fundamental limits of measuring complexity such as the Halting Problem in a currently only vaguely defined sense.

Edit:

To give a more direct listing of variables an algorithm might want to attempt to measure complexity in, off the top of my head, we have:

Chords, melody, polyphany, timbre, phrase structure, rhythm, rubato, dynamics, movement/song structure, whole piece/album structure, rhyme scheme, assonance, alliteration, onomatopoeia, vocabulary size, word repetition, references to other musical or verbal phrases in other works...

I mean, people get PHDs in analyzing this stuff.

Edit again:

Here's one really specific example that highlights both the fun and difficulty of doing this stuff algorithmically: there's a piece Bach wrote where one person reads the page right side up, and the other reads it upside down (starting from the other end, all notes inverted), and it works out. Noticing patterns like that would take a highly sophisticated algorithm akin to what is used to analyze genetic sequences.


Very well put!


I'm actually kind of going to agree with you.

First of all, I don't like the term "bland", because it implies a judgement. I think a better word is "sparse".

But now, the question, and the thing the researchers got fatally wrong is: does sparseness really mean less complexity? When you phrase it like that, it sounds obviously silly, to me at least. You wouldn't judge say Picasso for not using enough colors, and complain about it being so bland. To continue on that path, Picasso only works because you know what things actually look like. Instead of having to painfully spell out every last facial hair, he relies on you knowing what a human looks like, and uses the space gained to express his intent more clearly. In comparison, older art might seem boring and predictable. I already know what a face looks like, why waste space showing me?

Modern music works on a similar principle. The sparseness of modern music is enabled by listeners familiarity with other music, which enables composers to simply sufficiently imply their intent, letting our brains fill in the rest, making space to innovate in other areas of the music.


Very interesting. I probably should have clarified that I was talking about popular music and not niche genres. I actually think the effect is often less about sparseness bringing the composer's vision to light and more to do with the trend toward group songwriting.

For instance Beyonce had 72 songwriters [1] listed on her Lemonade album. Compare that with Madonna [2] (and other contemporaries) who often write their own music or co-write a song with one other person.

1. https://www.thedailybeast.com/does-beyonce-write-her-own-mus...

2. https://www.quora.com/How-much-of-her-music-does-Madonna-act...


There is nothing that makes music inherently "good" or "bad". Music is an artistic expression, and different values can be obtained from it on a person to person basis. Just because music is complex does not mean it is guaranteed to be popular.

That is exactly what the researchers were doing.

They obviously do not understand music, its values, or its influence on society.


And good or bad is not related to popular.


I was so excited, until I realized the list hadn't been updated in 8 years... :|


Probably still good if you want to practice with machine learning on music.


I geolocated the artist list: https://geocode.xyz/874101666267029,share?export=GeoCluster

Next up, geoparse all song lyrics and compare those locations to the artists'.


That's cool, perhaps also link it from a centralized location like Kaggle? https://www.kaggle.com/datasets When I look for datasets, I first go to a place like that.


Found this dataset on Kaggle: https://www.kaggle.com/c/msdchallenge/data


Used this dataset in a university project to try and predict genre's from a number of features.

https://medium.com/modeling-music


In case this is not clear: the dataset does not include the raw audio of the songs, just extracted features.


Does this data set have a collection of chords in text? I'd love something like it.


The dataset is a bunch of SQLite files, so shouldn't be too tricky to interrogate.

You can get a subset (static.echonest.com/millionsongsubset_full.tar.gz), which is 1.8Gb compressed. The full dataset is 280Gb, and AFAICT, this does not contain the full audio.

There's a script on GitHub from like 8 years ago that apparently can get you the audio (but I would be super-impressed if that actually still works).

You can see a description of one song here: http://millionsongdataset.com/pages/example-track-descriptio...


It looks like they have data about many of the individual notes in the song. I wonder if it could be possible to turn that data back into some sort of horrible midi version.


i just got rick rolled by technical documentation




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: