How I came to love big data

meritt · on Nov 9, 2012

Would love to see if indexes and a sane schema were used for the RDBMS case. I've built extremely large reporting databases (Dimensional Modeling techniques from Kimball) that perform exceedingly well for very adhoc queries. If your query patterns are even somewhat predictable and occur frequently, it's far better have a properly structured and indexed database than using the "let's analyze every single data element every single query!" approach that is implicit with Hadoop and MR.

Not to mention the massive cost-savings from using the right technology with a small footprint versus using a brute-force approach and a large cluster of machines.

ironchef · on Nov 9, 2012

"exceedingly well for very adhoc queries" vs "If your query patterns are even somewhat predictable and occur frequently"

Aren't those ... largely opposite? To say "very adhoc" i would anticipate that as largely meaning "not predictable". Also...can you perhaps quantify "very large"? How many terabytes? I've done a decent amount of work with the "new school" OLAP approaches (hadoop / mapreduce, etc.) and found them to work quite well especially in certain cases such as time series (think weblog analysis) where sequential scanning is a simplistic approach.

meritt · on Nov 9, 2012

In a dimensional modeling approach you need to identify what elements/attributes users would query upon ahead of time. This leads to support of adhoc queries -- "Show me minutely clicks & revenue from 11am-1pm on Mondays in 2011 except holidays for publishers (A,B) against advertisers (A,B,C)" As long as you define the grain of your data, adding new attributes for dimensions is very easy and flexible.

I guess I'm comparing it against better-suited-for-brute-force approaches where someone is analyzing a log file for really random things that tend to be a one-time thing. "Show me hits to this particular resource from IP addresses which match this pattern and the user-agent contains Safari and the response time is larger than 300ms and the response size is less than 100ms!". While you could fit this data easily into a DM, you'd need to plan ahead for that sort of querying. If it's a non-frequent occurrence, it'd make more sense to process it sequentially (even if its across 10,000 machines in a hadoop cluster).

When I left the company, our Greenplum cluster (so a bit of both worlds: RDBMS cluster that automatically parallelizes queries across multiple nodes and aggregates the results) was around 500T. This approach was scaled up from a single MySQL instance though, which were seeing around 5 million new rows per day for one particular business channel.

I'm not suggesting that "new school" approaches do not work nor not necessarily work quickly. What I am suggesting is this: MR is a very naive approach that is only "fast" due to executing the problem in parallel across many nodes. If you have datasets which are going to be queried often in similar use cases, one should take the past ~30 years of innovation in RDBMS' instead of masking the difficulties by throwing a lot of CPU (and therefore money) at a problem and solving it in the most inefficient manner possible. It pains me to see people coming up with overly complex "solutions" to basic OLAP needs on Hadoop-based or even NoSQL platforms instead of simply using the right tool for the job.

That said, the one-off cases where it doesn't make sense to build out a schema, ETL pipeline, and managing a database because it's a very niche or one-time need: That's where the real value of Hadoop/MR comes into play.

zachrose · on Nov 9, 2012

Naive question: What does analyzing big data sets get you that sampling doesn't?

rm999 · on Nov 9, 2012

It's not a naive question, it's an important question. Sampling is almost always the first thing I think of when I have a "big" dataset in front of me, and it requires some careful thought.

It really depends on what you want to do. If you want to compute a statistic over a large amount of data or build a predictive model, some sort of sampling may create virtually identical results with orders of magnitude less time/memory. In modeling, sampling away certain classes may actually be essential for algorithmic stability.

Sometimes sampling is wrong. If you have to create statistics over your entire dataset (let's say, number of transactions aggregated by user), sampling may not be worth the loss of precision, and it won't really save you much time either. In general, you never want to sample away data in a way that the confidence in the statistic you are calculating becomes low.

Almaviva · on Nov 9, 2012

Sampling lowers your confidence resolution, period. When you're testing hypotheses the biggest constraint can be that the effect you're looking for is too small to be within the resolution that your confidence intervals give you. Improving this resolution, even by a little bit, can be worth a lot.

tel · on Nov 9, 2012

Because you stated something absolutely, I feel the need to round off the edge. Sampling can increase your confidence resolution if it allows you to integrate signals from more data sources together using a larger model that is infeasible without sampling.

disgruntledphd2 · on Nov 9, 2012

I think the difference is in the aims. With traditional statistics, you're trying to estimate some quantity of interest in the population, while with "big data", you're typically trying to make predictions for individual users. While this can be done with traditional statistics (in fact, the predict method in R does exactly that) it becomes easier to match participants on what books they might like if you have data for what books everyone in your population likes rather than just a sample.

Now, whether or not the inferential premises of statistics hold up on website data (and population data) that typically is neither random nor representative, that's another story.

phren0logy · on Nov 9, 2012

Headaches? That's glib, and it can certainly be worthwhile, but it's amazing how fast things get ugly when the data cannot be stored on one machine and held in memory. I'm still sad that Incanter didn't really turn out to be the tool I hoped it would be.

disgruntledphd2 · on Nov 9, 2012

Its still a young project. I agree that its currently suboptimal, but the source is one github (along with the abadnoned enfer, which looked really nice but appears to have been abandoned).

I definitely think Clojure has a future in big data - the repl, immutability and the strong links to the Java ecosystem make it an obvious choice for developing interactive (in so far as this is possible) big data applications.

Also, it's fun to code, which should never be discounted as a reason for a language to succeed.

etrain · on Nov 9, 2012

The tails.

yummyfajitas · on Nov 9, 2012

You also often need to scan the whole data set in order to sample properly.

Suppose you are interested in user behavior. You want sample(group_by(user_ident, all_user_events)), not group_by(user_ident, sample(all_user_events)). This involves running a group_by on the full data set.

tel · on Nov 9, 2012

Or counting the number of events for each user as they come in.

viralbajaria · on Nov 9, 2012

In addition to what others said, when you want to analyze multiple datasets the biggest problem is to store all that on a single machine or even a machine with network-attached storage. Eventually you will start hitting the bounds of I/O which is what big data tries to solve by moving processing units closer to the data storage (in theory).

In short, not all problems are big data problems.

johnrgrace · on Nov 9, 2012

Samples can miss small pockets of behavior that are out of the norm but very interesting. As a business people using your product or service in unexpected ways can lead to greatinsights, new markets etc. DRM hating readers of Military science fiction might be a small niche, but it's one that publishers BAEN owns.

vinzclortho · on Nov 9, 2012

Geek cred

zwass · on Nov 9, 2012

I'm confused by the assertion that Hive was "much slower than using MySQL with the same dataset." The author makes this claim, and then provides a table that shows Hive performing ~50% better than MySQL on a variety of datasets (none of which really flex the muscle of Hadoop in operating on data sets going beyond single digit GB).

Regardless, Impala sounds like it could be pretty sweet!

xradionut · on Nov 9, 2012

"These aren’t scientific benchmarks by any means (nothing’s been especially tuned or optimized)..."

I had to smile when I read that. Working with data, sometimes optimization or redesign can yield significant performance gains. (Especially when reworking some of my colleages queries or code...)