Big Data Debate: HBase

linuxhansl · on Aug 7, 2013

Oh please.

On the one side we have a commercially competing entity arguing against HBase and on the other side we have an individual "defending" HBase who never contributed a single line of code to open source HBase promoting their own closed source solution.

HBase is a "Sparse, Consistent, Distributed, Multidimensional, Sorted map" and at that it is pretty good.

Other stores are either not consistent or not sorted by default (which means you cannot do range scans). Some can be configured to do that, but then suddenly all those nice claims made by the commercial entities backing them just vanish.

If you do not need consistency and range scans, then do not use HBase. If you do need those HBase will be a excellent fit.

Some of the most heavyweight entities on this planet have committers to HBase (Facebook, Intel, Salesforce, to some extend Twitter, etc) as well as commercial backers such as Cloudera and Hortonworks.

Disclaimer: HBase committer here.

dialtone · on Aug 7, 2013

Agreed. Out of the box none of the other storage systems actually allow you to not worry about range scans at all. While HBase will repartition your data to balance the used disk space automatically, in most other systems you'll end up having to manually partition your data in different ways and then aggregate the queries across partitions.

This is pretty literally the main reason why many companies use HBase. Then in the same companies you can also find a good usage for Cassandra but it'd be a different use case.

trun · on Aug 7, 2013

Linkbait aside there are some reasonable points being made here, specifically "Failover means downtime" about HBase. We run into this pretty much every day in one of our customer facing applications or APIs and it's quite frustrating to have to explain that there's very little you can do to prevent it.

I haven't looked too closely at MapR yet, but "instant recovery, seamless sharding and high availability" are impressive claims. It's still decidedly differently than HBase in my mind given the cost.

monstrado · on Aug 7, 2013

Which version of HBase are you running? If we have a node or two go down in our cluster, HBase is completely unaffected. Recoveries can take up to a few minutes, in the old ages it took hours.

I know that MTTR (mean time to recovery) is being worked on together by several large companies to get down to seconds.

jeremiahjordan · on Aug 7, 2013

"Minutes" is not "unaffected" if your site is now down.

monstrado · on Aug 7, 2013

Sorry, I should have clarified. When a node goes down, there isn't any sort of "minute" interruption...When there's a full scale outage and you need to perform an actual recovery, that can take minutes.

samspenc · on Aug 7, 2013

Also, if you use the Thrift service as an interface to HBase (we do), you can have it running on several machines on the cluster and be able to set them up using a failover architecture - as long as the cluster itself is healthy, Thrift should work from any machine you have it running on.

karterk · on Aug 7, 2013

We use HBase pretty extensively and I have mixed feelings about it. On one hand, it's very clunky. Though the documentation is pretty good these days, setting up and managing a HBase cluster on production and in scale involves tackling many moving parts. Setting up Cassandra is almost a joke in comparison.

HBase-Hadoop integration is great, but Cassandra has caught up significantly on that front. If you want strong consistency, apart from HBase there is no other (non-sql) solution that's really battle tested. If you're okay with eventual consistency, you should take a hard look at Cassandra.

threeseed · on Aug 7, 2013

I don't understand why people keep thinking that Cassandra is only eventually consistent.

It can be set to any consistency level you like for both reads and writes: http://www.datastax.com/docs/1.1/dml/data_consistency

jbellis · on Aug 7, 2013

Even lightweight transactions in 2.0 for linearizability: http://www.datastax.com/dev/blog/lightweight-transactions-in...

karterk · on Aug 7, 2013

I'm aware of consistency levels. But in general HBase performs much better out of the box when compared to tweaking Cassandra to be be more consistent. In other words, if I wanted consistency badly and at scale, I would go with HBase.

duaneb · on Aug 7, 2013

It doesn't even have transactions, I'm not sure how 'consistency' is even in the conversation.

EDIT: Just saying, failures and partitions are extremely difficult to deal with without atomic operations.

jbellis · on Aug 7, 2013

Here we're discussing the Consistency in CAP, not the one in ACID. Here's a good introduction to the concepts involved: http://www.allthingsdistributed.com/2008/12/eventually_consi...

duaneb · on Aug 7, 2013

It's no less applicable. During a partition (you know, the 'P' in CAP) it's a nightmare to identify data that's been dropped AND recover from it without strong transaction guarantees. Transactions are the heart of consistency and I wouldn't consider a piece of software to be CAP consistent as opposed to available without transactions.

linuxhansl · on Aug 7, 2013

And then you get slow writes and/or reads as multiple nodes need to be contacted, as well as temporary outages when a quorum cannot be formed.

jbellis · on Aug 7, 2013

That's a pretty silly objection. With the common three replicas, I have to lose two Cassandra nodes to be unable to achieve quorum. I have to lose just one regionserver to be unable to read from HBase.

linuxhansl · on Aug 7, 2013

The "silly" label is not productive.

When we have R+W>N, and assuming N is 3, there is either a hit on read and write (W=R=2) or heavy hit on write (W=3, R=1) or heavy hit on read with a chance of temporary outages (W=1, R=3).

In HBase you write to exactly one server and read from exactly one server, which also provides for cheap atomic operations.

samspenc · on Aug 7, 2013

Heavy HBase user here. My two cents, FWIW.

I totally identify with both sides of this article, but if we had to do this again, we would probably go for HBase again. Its quite a pain to manage unless you have someone on your team with a PhD in HBase, but:

1. The main HBase committers are also the ones who contribute to the Hadoop project, so its in a forward trajectory with good velocity, since its fairly coupled to the Hadoop ecosystem. Cloudera is a big contributor, as are Facebook and Salesforce.

2. Facebook, Adobe and other BigTM companies use it in production and at scale. (Granted they have armies of smart people to maintain HBase.)

3. For all the pain it is, its reasonably documented and has a growing community that fills in the gap. I could be totally totally wrong about this, but Cassandra doesn't have that same level of community as HBase does.

jbellis · on Aug 7, 2013

/author of the "con" position here

I can see why you might come to those conclusions -- a lot of people with their heads down in Hadoop just don't realize that there's a world outside HDFS. Not saying that in a mean way; that's just the way it is. If you're involved in that ecosystem, there's enough to keep up with without researching what others are doing.

1. True enough, but HBase tends to be an afterthought for the datawarehouse-focused Hadoop players. You see this manifest in a bunch of ways, but as just one example: when I evaluated HBase at Rackspace 4 years ago, they were exploring options for secondary indexes. They're still exploring.

2. Yes, you really do have to have expert-level knowledge of the internals to deploy it successfully. And there's a LOT of those internals, by design (hmaster, ZK, regionserver, and close ties to the HDFS infrastructure). Cassandra is much simpler, much easier to deploy and troubleshoot. When Cassandra gets evaluated vs HBase, it tends to win [1], but often HBase is the "default" choice because of "Hey, we already have HDFS" thinking. This is changing as awareness grows of alternatives.

3. I think that's the tunnel vision I mentioned. By any metric I can think of -- conference attendance (over 1100 at the Cassandra Summit in June), IRC activity, StackOverflow participation, jobs advertised -- the Cassandra community is larger and more active than HBase's. Here's a list of some of the users: [2].

Give it a look, I'll be happy to answer questions!

[1] http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf [2] http://planetcassandra.com/Company/ViewCompany?IndustryId=-1

samspenc · on Aug 7, 2013

Thanks for the reply!

1. True, the fact that we were deploying Hadoop was a big reason to go with HBase. Despite the challenges, it was a case of "better the devil we know than the one we don't."

2. Agreed, I played with Cassandra, but again, point #1 carried the day.

3. That's great! I was at HBaseCon 2013 and that only had 750-850 people. I concede this point. ;) Sorry I didn't explore this further, that was my bad.

[EDIT] One thing we like is that HBase re-partitions data really fast since data is in HDFS. Not sure how well Cassandra holds up there.

m0nastic · on Aug 7, 2013

At the risk of hijacking what seems like a linkbait article, I'll ask folks here a question:

Does anyone have good experiences/recommendations for storing a "reasonable" amount of unstructured logs/pcap files. By "reasonable", I mean, not petabytes (maybe a couple terabytes, over time).

I ask, because I keep thinking something like HBase is overkill (although one of my alternate solutions is to run an internal Openstack Swift cluster, which seems like pretty much the same amount of hardware/engineering).

If I could, I'd just send it to the cloud (to the cloud!), but I need it to be local and internally controlled (however, the developer niceness of S3 or Azure blob storage is what got me thinking about just making my own Swift object store).

monstrado · on Aug 7, 2013

Disclaimer: I work at Cloudera as a Tools Developer

What do you mean by unstructured? Do you mean the data has yet to be parsed into a format which could be logically grouped into columns? Or do you mean that it's deeply nested?

Since log data doesn't really change, it might be overkill to use something like HBase (or any database for that matter). On the tools team at Cloudera, we've found that writing the data into HDFS and using Impala to analyze it works pretty well.

We typically analyze chunks of log data and then ingest it into HDFS (due to the use case), but if you're looking to ingest data in "real-time", you'll want to use something like Apache Flume.

With the data separated into partitions, we're able to run queries that analyze GBs of data in under a second (15 nodes). This is log data (LOG4J) that has been extrapolated into columns, and then loaded into a columnar storage format (RCFile, soon to be Parquet).

Let me know if you have any questions, glad to help.

m0nastic · on Aug 7, 2013

Thanks, I've been testing out a bunch of hare-brained schemes and it didn't even occur to me to just use HDFS directly (and seeing you and karterk both suggest it helps).

Basically, at a high level, the system I'm working on aggregates and processes security information (It's a SIEM, if that product category means anything to you). At the point the logs get ingested, the server determines if they're "actionable" (which is determined by rules I load into Redis), in which case it parses them and stores them in a Postgres event table; or "not individually actionable, but may cause an action in conjunction with some other log" that I want to just store somewhere for batch processing.

I don't really need to tokenize those logs, as at the point I care about them I'm just going to be searching through them. So, they're "unstructured" in the sense that there's about 15 different collection points, each with it's own format (many just an ugly facsimile of syslog with some JSON in the middle).

So, I think your suggestion will work out very well.

Thanks again.

monstrado · on Aug 7, 2013

No problem, glad I could help. Your use case sounds pretty interesting, HDFS should fit the bill for sure.

You should take a look at Parquet (http://parquet.io/) for storing your data. It's an open source columnar format that was designed for Hadoop, it even supports nesting (https://github.com/Parquet/parquet-mr/wiki/The-striping-and-... <-- really interesting). Also, it already works with a lot of the Hadoop ecosystem components (MR, Hive, Pig, Cascading, Impala, ..), so your data doesn't have to move once it's in HDFS.

Good luck!

karterk · on Aug 7, 2013

Store it in HDFS as sequence files. HBase (as the article mentions) is a beast to manage with many moving parts. But, HDFS is just rock solid and very stable. If you want to analyze logs, you can do so in a batch mode (I recommend using Scalding or Pig). If you want it to be more "searchable", then take a fraction of it using a MR job and index it in Solr or ElasticSearch.

duaneb · on Aug 7, 2013

NoSQL is my least favorite buzzword of the decade. It means absolutely nothing but being counter cultural to... SQL culture. Many NoSQL "databases"—and I use that in the lightest term possible—even have SQL engines.

otterley · on Aug 7, 2013

How to make a simple NoSQL database:

(1) Install MySQL.

(2) Create a database called "nosql".

(3) In the "nosql" database, create a single InnoDB table called "data" with 2 columns: a BLOB primary key "mykey" and a BLOB column called "myval".

(4) Write a thin wrapper over your favorite mysql client library implementing "get" and "set" operations (which are translated to SELECT and UPDATE statements).

(5) Enjoy your ludicrous speed.

monstrado · on Aug 7, 2013

If you're comparing this to your riak, redis, tokyo cabinet, ... database than most likely.

With HBase, one of it's bread and butter operations is the start and stop row scan (otherwise known as a range scan). The only thing equivalent I know of in MySQL is using windowing functions, and even then I don't think that's an appropriate comparison.

I hear you though, most people equate NoSQL to some sort of KeyValue store and that's it.

duaneb · on Aug 7, 2013

How are multi-row transactions these days? That was the largest problem when I looked last... A data store does not a database make.

threeseed · on Aug 7, 2013

I don't know about HBase but for Cassandra it should be available in the next minor update.

Single row transactions have just recently been added: http://www.datastax.com/dev/blog/lightweight-transactions-in...

mh- · on Aug 7, 2013

FWIW, Cassandra 1.2 is the latest production/stable version.

jbellis · on Aug 7, 2013

You are correct, for about another week. :)

mh- · on Aug 10, 2013

so, 1.2 will become unstable? ;)

monstrado · on Aug 7, 2013

I don't think that's something they are trying to solve right now. Although HBase guarantees write consistency, I think http://research.google.com/pubs/pub36726.html is the closest paper on how to go about it, but I could be wrong.

Like you mentioned, a lot of people use HBase as a data-store, it's incredibly good at that.

duaneb · on Aug 7, 2013

> Like you mentioned, a lot of people use HBase as a data-store, it's incredibly good at that.

This is exactly what confuses me about NoSQL. It's not really comparable to relational databases on the CA[P] spectrum.

monstrado · on Aug 7, 2013

HBase doesn't claim to be some NoSQL database or equivalent to a relational database, anyone who thinks otherwise hasn't actually read into HBase.

Just look at HBase's website which describe it exactly as

> Apache HBase is the Hadoop database, a distributed, scalable, big data store.

duaneb · on Aug 7, 2013

I mean, I understand this, but plenty of people identify it as NoSQL. I'm sure the people on HBase are intelligent enough to understand it's a meaningless phrase that would make them look silly.

EDIT: By which I mean, "NoSQL" doesn't even make sense as an approach by name. ACID relational databases and high availability data stores both have their places so evangelizing on either side is just silly.

Though I would like to point people toward http://research.google.com/pubs/pub38125.html.

monstrado · on Aug 7, 2013

True, but there's no way to avoid people interpreting NoSQL as a relational database alternative.

To your point regarding Google's F1, try looking at Impala (https://github.com/cloudera/impala)...

duaneb · on Aug 8, 2013

It's quite interesting to compare the two approaches, thanks for the tip. I mentioned it not because of the SQL but because of its unusual construction: it's a relational database stored on top of a (admittedly ACID) NoSQL (it actually does have a sql engine) key-value store: it's a full-blown relational database system where you don't have to worry about sharding. I think Impala also has a great approach to this, but I'd say it's far more similar to Dremel in that it's structurally still Key-Value. This, again, could be good or bad: probably easier to develop with but harder for the query planner to plan without the hints provided by a table-and-index based system. (i.e. possible—that would basically be F1—but you'd have to do it by hand).

monstrado · on Aug 7, 2013

Agreed, and some of these databases don't even embrace it. For example, HBase's main website makes no mention of being a "NoSQL" database (http://hbase.apache.org/).

Of course, given the right circumstances, you could totally use SQL to talk to your data which lives in HBase, but that doesn't make it a SQL database.

SQL is a query language, a database is usually made up of many different components, one of them usually being a query engine. Look at Impala (https://github.com/cloudera/impala) which decouples itself completely from the storage engine layer, it can query data CSV data, doesn't mean it's a SQL database.

e12e · on Aug 7, 2013

I prefer the (backronym?) Not Only SQL - as it better describes what NOSQL can help with -- as opposed to using only a normalized relational data store.

PaulHoule · on Aug 7, 2013

I laugh at mongodb, but every time I've seen a shootout between Hbase and Cassandra, Cassandra has won.

monstrado · on Aug 7, 2013

Could you please elaborate? It totally depends on use case...

PaulHoule · on Aug 7, 2013

(1) I've seen more than one project failure involving mongodb,

(2) Every time I've done a performance/features shootout of mongodb vs. other projects, mongodb doesn't just lose, it comes back with two black eyes.

rjurney · on Aug 7, 2013

MongoDB is actually great to use with Hadoop as well. Laugh away...

rjurney · on Aug 7, 2013

Interesting that neither person has anything to do with Apache HBase. Both are in a position to belittle it.

capkutay · on Aug 7, 2013

some people even describe hbase as nosql, because its not sql right? unless you use cloudera impala, which is sql, but impala is supposed to kill sql too. in the end no one wins.

monstrado · on Aug 7, 2013

> impala is supposed to kill sql too

I'm not sure what you mean by this, Impala is in no no way trying to replace "SQL". Impala is a general purpose distributed query engine that currently translates SQL to a query plan.

capkutay · on Aug 7, 2013

I was commenting in the theme of tc-style enterprise tech journalism but thank you for clarifying.

monstrado · on Aug 7, 2013

:) sometimes I don't know what to believe on the internet

hayksaakian · on Aug 7, 2013

Blatant link bait

http://en.wikipedia.org/wiki/Betteridge's_law_of_headlines

zzzeek · on Aug 7, 2013

NoSQL, what's that, is that like Big Data ?

dallasmarlow · on Aug 7, 2013

NoSQL is only a stepping stone to NoQuery, it's going to be awesome

knodi · on Aug 6, 2013

Does it matter?