agreed, actually, I want to know how long it takes to run on say... an i7, I've gotten pretty poor performance out of those amazon small instances when doing alot of cpu oriented tasks.
Let's pessimistically assume you have to read all information associated with each file from disk. I suspect this sort of query is IO bound since processing is minimal and assuming we can mostly avoid fragmentation on disk we could feasibly read data at 100MB/s with a consumer level magnetic drive. The article says the data set is 300GB so we could feasibly get this done in an hour.
So... MapReduce? Kinda figured that when you had 6 zeros after your first zero.
This looks like a fun project, but I can't help but feel like Hadoop experience reports are a little late to the party at this point. Is there anyone out there who doesn't immediately think MapReduce when they see numbers at scale like this? If anything, the tool is overused, not neglected.
Its a well-written tutorial on using mapreduce to process a specific large music dataset. Now someone can jump right in and start writing their own processing code without worrying about all the boring details.
Maybe the headline is a bit misleading but I find the post interesting and valuable.
MapReduce wasn't my first thought, but that has more to do w/ my inexperience than anything else.
The cost though... that was shocking! It comes out to about $42/month to host the data and a $2 per analysis run. That's very affordable if you're producing results someone might want to pay for.
Can anyone clarify if song data is dense? If it is dense, I am not even sure if Mapreduce is the right paradigm to use mainly because you will eventually get to a situation where transfer time overwhelms compute time.
the EN song data is dense in the sense that there is far more "columns" than rows in almost any bulk analysis -- average song unpacks to ~2000 segments, each with ~30 coefficients + global features.
however, in paul's case here he's really just using MR as a quick way to do a parallel computation on many machines. There's no reduce step, it's just taking a single average from each individual song and not correlating anything or using any inter-song statistics.
I saw that but I mean how do you know you need 100 instances? Is there a way to estimate the optimum number or at least set boundaries, such as max price or max processing time?
Do you get back info such as how long each instance took to do its job?
Since it scales somewhat close to linear, you can just calculate it yourself.
And since you pay per minute on EC2, you can either chose to have 10 instances calculate a minute or 1 instance for 10 minutes. Either way, you pay for 10 minutes of processing time.
p.s. simplified but probably somewhere in a reasonable ballpark
It's actually rounded up to complete hours, so 10 instances for a minute is 10x the price of 1 instance for 10 minutes.
For fast iterative development this isn't an issue - you can re-use the job flow (the instances you spun up), so you can launch the job several times over the course of that hour.
There may be some info returned, but I didn't dig into in too much. Once I figured I could run through the whole set in less than an hour for less than 10 bucks, I had all the info I needed.