How to process a million songs in 20 minutes

joebo · on Sept 6, 2011

It would be interesting to know how long it takes to run locally on a single instance

hogu · on Sept 6, 2011

agreed, actually, I want to know how long it takes to run on say... an i7, I've gotten pretty poor performance out of those amazon small instances when doing alot of cpu oriented tasks.

thirdhaf · on Sept 7, 2011

Let's pessimistically assume you have to read all information associated with each file from disk. I suspect this sort of query is IO bound since processing is minimal and assuming we can mostly avoid fragmentation on disk we could feasibly read data at 100MB/s with a consumer level magnetic drive. The article says the data set is 300GB so we could feasibly get this done in an hour.

joshu · on Sept 7, 2011

Yeah. That can't possibly be the right thing to do...

KirinDave · on Sept 6, 2011

So... MapReduce? Kinda figured that when you had 6 zeros after your first zero.

This looks like a fun project, but I can't help but feel like Hadoop experience reports are a little late to the party at this point. Is there anyone out there who doesn't immediately think MapReduce when they see numbers at scale like this? If anything, the tool is overused, not neglected.

slewis · on Sept 6, 2011

Its a well-written tutorial on using mapreduce to process a specific large music dataset. Now someone can jump right in and start writing their own processing code without worrying about all the boring details.

Maybe the headline is a bit misleading but I find the post interesting and valuable.

_kotv · on Sept 6, 2011

MapReduce wasn't my first thought, but that has more to do w/ my inexperience than anything else.

The cost though... that was shocking! It comes out to about $42/month to host the data and a $2 per analysis run. That's very affordable if you're producing results someone might want to pay for.

dvcat · on Sept 6, 2011

Can anyone clarify if song data is dense? If it is dense, I am not even sure if Mapreduce is the right paradigm to use mainly because you will eventually get to a situation where transfer time overwhelms compute time.

brianwhitman · on Sept 6, 2011

the EN song data is dense in the sense that there is far more "columns" than rows in almost any bulk analysis -- average song unpacks to ~2000 segments, each with ~30 coefficients + global features.

however, in paul's case here he's really just using MR as a quick way to do a parallel computation on many machines. There's no reduce step, it's just taking a single average from each individual song and not correlating anything or using any inter-song statistics.

revorad · on Sept 6, 2011

Can you dynamically adjust the number of EC2 instances to optimise for processing time or price?

plamere · on Sept 6, 2011

yes, the number and size of EC2 instances can be set by switches on the command line when you launch the job.

revorad · on Sept 6, 2011

I saw that but I mean how do you know you need 100 instances? Is there a way to estimate the optimum number or at least set boundaries, such as max price or max processing time?

Do you get back info such as how long each instance took to do its job?

rb2k_ · on Sept 6, 2011

Since it scales somewhat close to linear, you can just calculate it yourself.

And since you pay per minute on EC2, you can either chose to have 10 instances calculate a minute or 1 instance for 10 minutes. Either way, you pay for 10 minutes of processing time.

p.s. simplified but probably somewhere in a reasonable ballpark

mattj · on Sept 6, 2011

It's actually rounded up to complete hours, so 10 instances for a minute is 10x the price of 1 instance for 10 minutes.

For fast iterative development this isn't an issue - you can re-use the job flow (the instances you spun up), so you can launch the job several times over the course of that hour.

plamere · on Sept 6, 2011

There may be some info returned, but I didn't dig into in too much. Once I figured I could run through the whole set in less than an hour for less than 10 bucks, I had all the info I needed.