I applaud the effort but is this really "big data" - the largest data sets they seem to test are ~150GB, that would fit comfortably on my Mac Book Pro a number of times over. Many of these systems being tested are designed to scale efficiently when the data starts peaking > 5TB and therefore I am dubious about the median response time results - things that work well for small datasets (where small is defined as < 1TB) easily fall apart when you scale them up a little bit more.
These technologies aren't useful just for storing data. Yes, you can do that on your MacBook. These are useful for when you need answers quickly or need to perform complex analysis. For example, spark (the engine behind shark) allows you to run things like logistic regressions at scale in just a few seconds. As far as I know, you can't load 150GB of data into memory in R on your MacBook and then run a logistic regression a few seconds later.
Yeah, I was thinking that the first two sets I are sized around the size I would see on my piddly little SQL Server. 100m results is starting to get into the ballpark.
Big Data is less about the size, and more about characterizing how you want to deal with the data. Straightforward structures queries are the purview of RDBMS, whereas complex analytics is for "Big Data" frameworks like those evaluated.
The author had a terrible brain cramp in the sentence "Redshift uses columnar compression which allows it to bypass a field which is not used in the query."
That totally confuses columnar compression with columnar I/O, an error I've been railing against for several years, e.g. in http://www.dbms2.com/2011/02/06/columnar-compression-databas... (I.e., ever since Oracle tried to popularize the confusion.) But this is a particularly bad instance.
I also don't see a schema that is tuned for redshift.
There is no description of a sortkey or a distribution key (let alone which compression encoding was used)
And at least for the first query you could use UNLOAD instead of select (depending on how you are managing the data coming out of redshift it might be a reasonable solution and doesn't force all your data through the leader node constrained by whatever client driver you are using to read the results).
Instead of trying to select out of redshift this much data -- select the data into another table or (again) use unload.
Mesos is the foundation of the stack, and Spark started out as a research project because they needed something to run on Mesos. But you can also run Hadoop on Mesos, and you can run Spark and Hadoop on the same Mesos cluster. Twitter runs almost everything on Mesos and works directly with AMPLab on the project.
SparkStreaming replaces the need for Storm and handles failures/stragglers better. As Nathan Marz the creator of Storm said, "Spark is interesting because it extends MapReduce with a new primitive that allows Pregel to be built on top of it. So Spark is both Hadoop and Pregel" (http://nathanmarz.com/blog/thrift-graphs-strong-flexible-sch...).
The creators Matei Zaharia, Ion Stoica et al raised a substantial amount. With Tachyon (in-memory file system that supports lineage and is thus fairly robust) and more recently MLBase, one should look at Spark beyond its excellent performance and really look at the overall package and versatility it provides.
I think many people somewhat pointlessly get caught up in arguments about what constitutes big data and what doesn't. As someone who's used Spark substantially for machine learning as well as other complex types of processing beyond your standard joins, filters, etc Spark is incredibly useful even for a few GB of data because it allows one to iterate rapidly.
With MLBase and all, I think Spark will really have an impact because your average engineer will be able to run some standard ML algorithms out of the box at scale. That is huge. That's what matters with data (big or not) -- it's the insights you can gain.
Edited: typos. Also, some more on mlbase: http://www.mlbase.org which will be released tomorrow if am not mistaken.
Redshift is $0.43/TB-hour, it looks like HANA on AWS is around $59/TB-hour. You get a lot for the money (HANA software, in-memory, tons more CPU) but your workload had better really need it at that price difference!