NoSQL vs. RDBMS: Let the flames begin

JunkDNA · on March 26, 2010

I completely do not understand this sort of thinking. It's like saying, "All you people driving motor vehicles: trains are better". It's a meaningless comparison. NoSQL systems are really great for certain things. RDBMS systems are really great for certain things. What I would really like to see is if all this effort writing these kinds of articles went into well-thought out pieces that discusses very specific cases NoSQL backends provide an advantage and why. Use cases that are not a blog, digg, or twitter clone would be most helpful. Some of us have to work outside the bay area for companies like banks, insurance companies, hospitals, etc... Can I use Cassandra as a data warehouse for electronic medical records? Hell if I know, without actually having to learn it and implement it to see if it works.

stan_rogers · on March 26, 2010

If you look carefully, you'll find that a whole heck of a lot of those banks, insurance companies, and so forth are already using something that falls into the NoSQL realm -- Lotus Notes/Domino. The use case for a document-based, schemaless database is already well understood, if not always by the developers using it (or by the developers who grew up on a strict diet of relational databases).

It boils down to this: do you need addressing that would not be well-served by structured, tabular data? Relational databases are (or should be) extremely efficient at accessing data that can be arranged neatly into tables and read using the equivalent of pointer math. They suck, though, when the data is stored outside of the table (as large text fields, blobs, and so forth would be). The more heterogeneous and variable the data, the worse performance gets. A schemaless database gives up efficiency with tabular data, often precluding efficient runtime joins and, consequently, multiple orthogonal data access paths (data normalization) for a more efficient generalized access.

As a thought experiment, imagine a data quagmire where, in order to make the data fit a SQL table, every cell in every row in the database would need to be varchar(17767) and may contain multiple character-delimited values (or not --each record can be unique) and every row can have an arbitrary number of cells. That's what schemaless data can, and often does, look like -- and something like Notes or CouchDB can work with that comfortably and with an efficiency that cannot be matched by a relational representation of the same data.

aheilbut · on March 27, 2010

Bioinformatics is another application area where document-oriented databases have been used long before it was fashionable. For example, much of the data at the NCBI is stored in blobs as ASN.1 (which predated JSON by about two decades...)

One thing I don't yet understand is the big revolutionary deal about the NoSQL systems vs. storing one's hierarchical documents in blobs in an SQL database or BDB. At the end of the the day, I'm not sure that there is any magic - B-trees are B-trees, joins are joins, and disk seeks are slow.

rbranson · on March 26, 2010

Upvoted for using the term "data quagmire." Many examples instantly shot through my head and I began laughing hysterically.

suhail · on March 26, 2010

Here's the issue with this entire topic:

1. It's not just which can do reads/writes better--it's not apples to apples, it's apples to oranges. You're right.

2. If you look the type of reads Cassandra can do and the type of reads MySQL can do. MySQL blows Casandra out of the water in terms of flexibility. (at a cost obviously)

MySQL simply has the best flexibility. Cassandra has a ways to go.

3. In Cassandra you essentially have to hack your data model to fit the data structure to make things work and if you decide one day you read things differently it's not always easy. You have to massively denormalize. (but hey disks are cheap as they say)

In a nutshell, use the best tool for the job. Cassandra happens to just fit lots of use cases so it's worth looking at. I don't think it would be best for a company to port everything over, MySQL is still very good at many things and the flexibility in reads is worth the cost.

You shouldn't wait for someone to have to tell you it's going to work for y use case. It really matters what queries you end up asking your data store and Cassandra really shines for simplistic ones and be hacked for complex ones. You need to just dive into the docs and http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-mode... is a great resource.

FYI: I love what Casandra does and I think it's the best of the NoSQL options out there.

shpxnvz · on March 26, 2010

I love what Casandra does and I think it's the best of the NoSQL options out there.

I'd say that the differences between types of NoSQL databases preclude any attempt to choose a winner. It's makes no more sense than trying to pick the better between NoSQL and RDBMS.

I think this illustrates much of the problem with the NoSQL moniker - it seems like an attempt to group a bunch of completely different data storage systems by arbitrarily excluding a single one.

mistermann · on March 27, 2010

"it's apples to oranges"

Exactly, and I didn't feel OP was saying anything but that. What I thought he was trying to get across was that there situations where NoSQL is a no-brainer superior, and by the sounds of it as a response to seasoned SQL DBA's criticizing the NoSQL option.

I work at your typical large corporation writing typical applications on top of sql databases all the time. SQL databases are incredibly flexible and perform excellent, and I'd need a VERY good reason to not use one on a project. For me, 49 times out of 50 they are more than sufficient.

But, at least in my industry, there are cases where its just the wrong technology for certain special requirements. I was AMAZED how poorly Oracle performed in one certain use case (real time aggregation of a relatively large hierarchial database), despite MANY hours devoted to indexing and tuning, with every table pinned in memory. So our solution was to write kind of an in-memory database, which out performed Oracle by >>1000 times or so. Of course this is not surprising.

What is surprising, and I think a large motivation for the OP, is how ignorant some of even the most skilled SQL DBA's are of what's happening in the real world. About a year ago I was talking with one of the most senoir database people at our company, and the topic of high performance databases came up, and I made some reference about how of course if you are dealing with huge performance, of course you wouldn't use something like Oracle. (I think maybe we were actually talking about Google). And he made some reference that Google uses Oracle, and I said basically, no they don't, and then he says he knows they do because ~he knows someone that works there as an oracle dba~. Of course, Google almost undoubtedly runs lots of Oracle internally, but he was under the impression that Google is run on Oracle. And this is a guy that is smart (always relative of course); but he really knows what he's talking about when it comes to Oracle. The thing is though, there is this whole other world, that he is entirely oblivious to.

And then there's just your standard DBA "professionals", who are just covering their asses. I personally consider myself to be largely uninformed when it comes to SQL Server and Oracle, but it's rare that I come across a xSQL DBA that knows more about the internal system tables than I do, or how script actions (sql server DBAs mostly), or write triggers, or do anything even remotely advanced. In the large corporations I work in, I'd estimate at least 50% of DBA's don't know anything about system tables at all beyond what they learned in their certification course. They've likely never used them in their work. A lot of them don't even know they exist. Their profession is largely a mystery to them.

tl;dr: SQL DBA's aren't very well informed about what's happening outside their bubble.

wolfhumble · on March 27, 2010

> I was AMAZED how poorly Oracle performed in one certain use case (real time aggregation of a relatively large hierarchial database)

Out of interest, what kind of hierarchial database was this? Something like this?

http://explainextended.com/2009/09/28/adjacency-list-vs-nest...

Semiapies · on March 26, 2010

Except of course, TFA is not that, but instead, "Train riders, we're not complete idiots for driving cars."

siculars · on March 26, 2010

So my thinking on this is that the way these NoSQL systems will make their way into Healthcare is through the great backdoor of analytics. Virtually all available EHR/EMR systems available today allow various mechanisms for data retrieval based on primary indices. Unfortunately that avenue does not lend itself to secondary data reuse which will become more and more valuable as Institutions realize their data is valuable not only for academic research but for operational efficiencies re. money in the kitty.

Even more unfortunate still is that virtually all these installed systems do not have the performance capacity or advanced search-ability to adequately mine this growing horde of data. Administrators, under capex constraints, do not allocate resources for secondary systems which would duplicate data for mining purposes while alleviating strain on the principle production system. The no money budget problem will lead in house programmers to build these research systems on top of open source NoSQL solutions. There, the technology will prove itself.

Additionally, "NoSQL" comes in different flavors. Generally, all of them forgo the Consistency in CAP for Availability and Partition tolerance, which is fine for many use cases - just not primary medical data acquisition use cases. As the field matures programmers and system designers will learn how to make this work better to the point where one day NoSQL systems may be used as the primary data repository for medical data. However, that day has not come. For instance, Riak allows you to tweak knobs in order to favor of certain aspects of CAP theorem at different times while in production (specifically the w and dw parameters, http://blog.basho.com/2010/03/19/schema-design-in-riak---int...). But having just started working with Riak in the last month or two I would still only use it as an analytics tool exposing my medical record data to m/r jobs at this point. And before jbellis smacks me, I think Cassandra is awesome and I'm looking forward to spending some time with it but I'm still not putting my med app data in Casssandra just yet as a primary data store.

/Disclaimer. I work for a major University Medical Center and write business web applications in this area./

chintan · on March 27, 2010

Secondary data reuse/analytics requires more like "MapReduce" functionality rather than "NoSQL" stuff.

JunkDNA · on March 27, 2010

As a developer of a small, in-house specialized data warehouse for medical research purposes (built with 100% open source), I would be curious to hear more thoughts on this. The flexibility angle is certainly a huge win. My issue is that I can't quite get my head around how one runs ad-hoc queries like "give me everyone with a bmi >x and age between a and b". In the key/value view of the world, that doesn't quite fit.

egroen · on April 3, 2010

Take a look at the revolutionary illuminate Correlation database - ultimate flexibility for "ad-hoc" queries

The Correlation database is a NoSQL database management system (DBMS) that is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. Unlike relational database management systems or column-oriented databases, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values. Queries are performed using natural language instead of SQL.

Learn more at: www.datainnovationsgroup.com

weichi · on March 26, 2010

How do NoSQL databases allow you to do a better job at "secondary data reuse" than a traditional RDBMS? Is it just a matter of (1) Denormalization (2) performance gains from not needing to worry about transactions?

siculars · on March 26, 2010

The main issue from my perspective - beyond performance - is flexibility of the NoSQL model. Specifically the document store concept. The main analytics player is the data warehouse currently which has served very well over the years but comes with certain rigidity mainly in programming and cost. Of course, not to detract from their success, OLAP data warehousing is very mature with a rich toolbase.

The flexibility of the NoSQL model coupled with the exposure of your data to the m/r paradigm is the big win from my vantage. Almost every NoSQL solution will expose your data via bindings in virtually every programming language allowing almost any programmer to leverage NoSQL. Now you kinda have to have experience with data warehousing. Cost wise, what you would spend in licensing can be invested in hardware but more specifically talent instead.

Secondary data reuse as a concept is highly unstructured. Sure your primary systems capture data in specific formats but your analysis can take you in all kinds of directions. Being able to slice and dice without having to pre-define will be huge.

nathanh · on March 26, 2010

I'd really like to see a set of use cases for each type of system too. I recently compared NoSQL systems in terms of the CAP theorem and the underlying data model (discussion here http://news.ycombinator.com/item?id=1190772). I haven't seen enough of these systems in production, but it would be great to see examples of each combination of CA, AP, or CP systems with relational, key-value, column-oriented, document-oriented, and graph (not featured in my post) data models.

japherwocky · on March 27, 2010

yeah no but. if you need to move a lot of shit, trains are better.

flame on? maybe you don't understand because your startup doesn't have a lot of traffic?

you even admit that you don't really understand how the nosql dbs even work!

dasil003 · on March 26, 2010

Can I use Cassandra as a data warehouse for electronic medical records?

I don't think people building those systems are supposed to talk about them...

joe_the_user · on March 26, 2010

Actually, You are allowed to talk about your setup, you're just not allowed to release your actually data. But that's true for a lot of systems.

timwiseman · on March 26, 2010

That depends on the business. Some businesses consider things like that proprietary information. Some go so far as to consider just about every detail of their system including hardware and even OS sensitive.

Others are less secretive, but still want to protect the details of their schema. Properly and carefully developping a schema for a major system can be a big project that reveals a lot about your business and can give an upstart competitor an advantage.

joe_the_user · on March 26, 2010

The point in question is whether HIPAA and other health privacy standards allow a provider to describe their architecture. They do (see: http://en.wikipedia.org/wiki/Health_Insurance_Portability_an...).

Whether a business chooses to talk about their setup depends on a variety factors including their business model. However, keeping such things private is often security through obscurity (http://en.wikipedia.org/wiki/Security_through_obscurity).

timwiseman · on March 26, 2010

I think we had different understandings of the grandparent:

I don't think people building those systems are supposed to talk about them...

In a former job I worked on systems similar to what was described and my former employer would be very unhappy with me if I revealed so much as their hardware setup much less schema. I am not supposed to talk about such things, and there is an NDA that says so....

Whether it is legal for an authorized person from the company to discuss those matters is a separate matter.

Also, I must point out that this is not security through obscurity. Security through obscurity cannot be relied on, I agree. But in this case, it is a matter of preventing you competitors from knowing what you are doing.

You know your competitors can develop the same thing you did in time, but you want to make sure they have to spend that time rather than being "inspired by" reading over your source code or even stealing it entirely. In some competitive environments, even just knowing what your competitors are or are not capable of at that moment can be a huge advantage.

dasil003 · on March 26, 2010

I wasn't referring exclusively to HIPAA. I also assumed that medical software companies would tend to be the type that don't like their employees blogging. It's not even just security, but general trade secret stuff.

rbanffy · on March 26, 2010

Why would these kind of details be better kept in secret?

meroliph · on March 26, 2010

They wouldn't, but often times certain regulations may require it.

mbreese · on March 26, 2010

But in this case, the regulations don't require it.

on March 26, 2010

[deleted]

wisty · on March 26, 2010

Or PhP vs Every Other Language Out There. The NoSQL crowd is mainly saying that RDBMS isn't for everyone, and that other storage methods need a bit of time in the limelight.

timwiseman · on March 26, 2010

In each of your other examples, you are listing two things that are direct (or almost direct) competitors. It makes sense to compare them and then pick one over the other.

SQL vs RDBMS are further apart. There are cases where it makes sense to use both for different parts of your system. Sure you can force one to do the job of the other, but that shows you don't know when to pick one over the other (or were otherwised forced to do that...), not that they really target the same use-cases.

antirez · on March 26, 2010

I don't want to enter into the details of the post, but it is impossible for me avoiding a reaction to this: "Do you honestly think that the PhDs at Google, Amazon, Twitter, Digg, and Facebook created Cassandra, BigTable, Dynamo, etc. when they could have just used a RDBMS instead?"

It is really impossible to argument something based on the fact that people that are supposed to be very smart are doing it. The only way to support arguments is by showing facts...

shameless plug as I don't want to post a message just to say this, but isn't HN too slow lately? I'm at the point that I visit the site less often than I was used to do as I don't want to experience the delay.

aphyr · on March 26, 2010

I think in some cases this kind of appeal to authority can be valid.

Facebook has absolutely insane sparse matrices to handle. They handle enourmous volumes of traffic querying very specific (read: not cachable between users) datasets. Moreover, they've already invested mind-boggling amounts of capital into their stack. Same goes for Amazon with Dynamo. These people operate on scales that startups like us can't even comprehend; and they've found it worthwhile to write their own datastores for those scenarios. Moreover, their use of those databases has apparently contributed to their success. That, to me, is strongly suggestive evidence.

That and HA/fault-tolerance is a no-brainer; Cassandra's scaling characteristics rock the socks off of any SQL DB I've used. The consistency tradeoff is well worth it for some use cases.

wvenable · on March 26, 2010

Absolutely. Facebook. Google. All great examples of the need for a different solutions. But I'm not sure about Digg. It seems like a very straight forward implementation would work for them. But given the small amount of information they've provided about their setup, it doesn't sound like they've ever gone for one.

Compare them to StackOverflow, which at recent evidence, has about 10% of the traffic of Digg. They're running a very straight forward RDBMS configuration on rather pedestrian hardware. If Digg has a 50 node cluster (for example), StackOverflow should require at least a 5 node cluster.

rythie · on March 26, 2010

StackOverflow runs on a 3.33 Ghz quad core x 2, 48 GB RAM, 6 drive RAID 10

http://blog.stackoverflow.com/2009/12/stack-overflow-rack-gl...

So if Digg is as 10x times the size they would need a (assuming perfect scaling) 3.33 Ghz quad core x 20, 480 GB RAM, 60 drive RAID 10

Oh, but you can't get anything more than 8 sockets in x86, and Windows only runs on x86 now. So assuming you switch away from Windows, you'll need a Sun or IBM for that.

Sun's kit is only Dual-Core and the processors aren't as fast (either per Ghz or in clockrate), so here is the 64 CPU model you'd need and it's already got 64 disks: "For a 64-processor gorilla, 2.4 GHz SPARCVI dual-core chips, 6 MB of on-chip L2 cache, 128 GB of memory and a 64 x 73 GB SAS drive raise the price tag to $10,100,320."

http://www.serverwatch.com/hreviews/article.php/3688771/Serv...

And not forgetting you'll probably want to upgrade the 128GB of memory it comes with and you'll need two of these really (the other for failover)

mistermann · on March 27, 2010

"Windows only runs on x86 now"?

PlentyOfFish.com is, according to these posts: http://highscalability.com/plentyoffish-architecture http://www.codinghorror.com/blog/2009/06/scaling-up-vs-scali... ...running: 512 GB of RAM, 32 CPU’s, SQLServer 2008 and Windows 2008

$10 Million? Try ~$100,000. (Granted, the article you linked was from 2007).

My company spends >$10k on fricking meetings to discuss whether they should spend $20k on a server (as well as the other technical details they are unqualifed to be discussing). Of course, anyone that actually knows anything about it is not welcome at these meetings. :)

jbellis · on March 27, 2010

The 32 CPU quote [originally from POF's blog and repeated w/o correction elsewhere] is incorrect; the HP server in question has 8 quad-core _CPUs_, for a total of 32 _cores_.

So still a factor of 2.5 away from our hypothetical 80 core requirement, and not a refutation of GP's claim that x86 maxes out at 8 sockets.

ob · on March 27, 2010

That doesn't make any sense. You're assuming that they have 10 times the content, need 10 times the processing power, and that they derive no advantages to caching (from 48 GB RAM to 480 GB RAM).

That is not how capacity planning works at all.

jbellis · on March 27, 2010

The scenario is 10x the traffic, not 10x the content. And given that Digg is an update-heavy site for all the voting (which affects how pages are rendered for your friends) assuming that caching is a magic wand here seems bogus.

crux_ · on March 26, 2010

I don't think the the 50-node cluster mentioned in the article is for Digg -- it's for http://simplegeo.com. :)

aphyr · on March 26, 2010

Yeah, I'm a little surprised that Digg has moved away for performance reasons. Maybe their data model is fundamentally more complex than StackOverflow? Or maybe SO has a better caching layer in front of the service?

mbreese · on March 26, 2010

Or maybe Digg's model is flawed. I don't know if it is or not, but from everything I'd read it was far from optimal. I'd love to see more about it though. Now, relationship graph traversal is an issue for normalized relational systems, but in these cases, things could be split pretty from articles versus recommendations.

One big problem I see in these comparisons is when a NoSQL person claims that their box is processing 5000 req/sec, what does that mean? Are they denormalizing this so much that it's equivalent to 500-1000 req/sec on a RDBMS?

Another thing: when Digg was starting their type of site was very novel. There wasn't much out there that approached the scale and growth they experienced. I'm sure that StackOverflow has been designed with scaling in mind.

bmm6o · on March 26, 2010

I'm not that familiar with Digg (it's the same as Reddit though, right?), but two major things occur to me:

1) I'm much less likely to vote a answer/question up/down on SO than I am at Reddit. On SO, if I'm not asking or answering, I'm rarely causing any writes to the data store. On Reddit, I vote on most of what I look at. I could see this having a huge impact.

2) Obviously Reddit can do some caching, but I think SO can cache much larger pieces of data. As far as I know, everyone who goes to the main page of SO sees the exact same list of questions. On Reddit, each subreddit's top items can be cached, but they are mixed differently for each user.

mistermann · on March 27, 2010

I'm not familiar with the Digg data model, but the SO data model is not particularly complicated, and any nested interactions between, if there even is such a thing, I'm sure is done in non real-time batch processing.

Facebook, on the other hand, is incredibly complex because of all the interactions between users, not to mention the data is stored if I recall correctly, geographically disparately throughout the world. I don't have a link, but the shit that happens behind the scenes when you logon to your facebook account is wild.

amix · on March 26, 2010

Facebook's primary data storage is MySQL supported by memcached. Cassandra is only used for messaging (inbox search?) This is at least from what I have read - it could be outdated thought.

dasil003 · on March 26, 2010

Facebook is especially poignant considering how Friendster died due to scaling at a tiny fraction of the traffic and feature set that Facebook has.

ubernostrum · on March 26, 2010

It is really impossible to argument something based on the fact that people that are supposed to be very smart are doing it.

There's been a strong undercurrent of posts which basically consist of ad hominem "non-SQL databases are only for idiots who can't figure out how to manage an RDBMS properly". Pointing out that there are people who very definitely are not idiots and who can manage an RDBMS quite effectively, but who feel non-SQL databases are still appropriate for their use cases is, so far as I'm concerned, an acceptable rebuttal to that.

VBprogrammer · on March 26, 2010

Obviously its not a water-tight proof but its generally safe to assume a correlation between smart people and intelligent implementations.

torial · on March 26, 2010

unless said smart people were in a committee :-)

mbreese · on March 26, 2010

A person is smart. People are dumb.

blhack · on March 26, 2010

>shameless plug as I don't want to post a message just to say this, but isn't HN too slow lately? I'm at the point that I visit the site less often than I was used to do as I don't want to experience the delay.

I've noticed that loading my comment and submission history is really slow, but loading everything else is as snappy as it's ever been.

I have a feeling that the front-page and comments get heavily cached, while our comment and submission history does not.

robconery · on March 26, 2010

They should probably go NoSQL :)

ambition · on March 26, 2010

There's no SQL behind Hacker News, the persistence code is custom and straight-to-files.

mistermann · on March 27, 2010

In the true ThoughtWorks way. Sorry, couldn't resist. :)

fogus · on March 26, 2010

It is really impossible to argument something based on the fact that people that are supposed to be very smart are doing it. The only way to support arguments is by showing facts...

I read that as the obligatory appeal to authority that seems to impress some people. The rest of the post however was extremely interesting and likely as fact-filled as it gets when it comes to these SQL vs. NoSQL arguments.

meroliph · on March 26, 2010

"Let’s say you have 10 monster DB servers and 1 DBA; you’re looking at about $500,000 in database costs."

I wonder what he thinks is a "monster db server", and considering he included the DBA in the price, is this the price per year or what?

Having recently set up a dual E5620 with 48GB of RAM and 8 SSD drives(160GB each) with a 3ware controller as well for just shy of 10K USD, I guess my understanding of "monster" is quite different. For 13K USD the same server would have 96GB of RAM.

wvenable · on March 26, 2010

The numbers in the article are strangely inflated. In addition, as if the need for DBA disappears when you've simply changed your data storage software. Somebody still has to know how it works and manage it.

If you don't need a 50 node cluster because your RDBMS is pulling down big numbers, then you don't multiply the cost of the RDBMS solution by 50 either.

The numbers posted here are pretty reasonable. 37Signals spending $7,500 on disks isn't outrageous. That's less than the cost of a single developer integrating a different solution over a few months. How long has Digg been working on this transition and how many employees did it require? They've probably spent a fortune. Just not on hardware.

joe_the_user · on March 26, 2010

It is arguable that the need for a DBA does disappear with the change in the systems. The base approach of the relational model is that there is a professional DBA who makes sure that the mission-critical data is available in a format that multiple applications and multiple department can use directly (the type of data is know but the ordering isn't set). Thus DBA is "hat" that most programmers can't "wear", especially since a large portion of programmers don't understand the relational model.

On the other hand, Nosql and object-databases allow a programmer to just stuff data into the a data-store without worrying about a cohesive datamodel. If we consider this as mission-critical data that multiple departments of a large organization would want to see in multiple forms, then we can find many ways that the approach of "just save this array of values" produces serious problems.

But there are many applications where these problems don't appear. Diggs seems like it could get away with doing nosql. A health-record site seems like it could not do nosql since it ultimately is going to want ACID-and-beyond in its data model.

shpxnvz · on March 26, 2010

Good comment, but I wouldn't quite say that using a non-relational database frees you from having to think about a cohesive data model. It might be more accurate to say that for some data models non-relational stores are a more natural fit which frees you having to think about how to force your model into the wrong container.

joe_the_user · on March 27, 2010

It's hard to find the right adjective for what's distinct about the relational model.

The relational model is a fantastic model of data independent of application. It can even be a great model for an application using the same data in different ways.

But this approach clearly has a cost. In ways, there's the question - is this an application with a company built around it or a company with a application built around it? Digg and Google are applications with companies built around them. Here the RDMS model doesn't make sense.

AaronM · on March 26, 2010

I believe that he is also including the costs of the licenses for SQL for each processor in that 500k figure.

meroliph · on March 26, 2010

Could be, though he speaks about MySQL, PostgreSQL as well, and in other places about MS SQL, mostly hammering the license cost, which is obviously huge.

One more thing I'd add is I have no clue who these upset DBAs are and who is thinking Stump & Co. are dumb. Everyone making these sql/no-sql blog posts seems like they're starting a war with made-up enemies.

evgen · on March 26, 2010

It is a direct response to the article discussed previously in this YC thread:

http://news.ycombinator.com/item?id=1216833

mistermann · on March 27, 2010

Thanks for the link, I totally missed it, thats a great article.

freshfunk · on March 27, 2010

NoSQL vs RDBMS is really a proxy war for Denormalized vs Normalized storage.

You can take a system like Cassandra and treat data very much in a normalized way which would reduce performance. You can take a system like MySQL and completely denormalize your data which would increase performance.

Any test where one set of data is normalized and one isn't is not a fair test.

Also, denormalization can be a big deal. Unless you have some sophisticated code managing it for you, you're trading performance for data storage management complexity. Now you have to manage many instances of data X. But there is a benefit in that you avoid crazy joins.

I think both have concepts to learn from each other. For example, in order to use a NoSQL option effectively, you end up implementing your own concept of indexes, something very easily done for you in RDBMSes.

Raphael_Amiard · on March 26, 2010

It's very fun how basically most people agrees about this subject, that both systems have their strong and weak points (for example i don't think i've seen many articles saying that facebook/amazon should have kept their whole system running on rdbms), but still the endless queue of blog posts goes on. Isn't this a case of violent agreement , as per http://c2.com/cgi/wiki?ViolentAgreement ?

Semiapies · on March 27, 2010

I don't know about that; the fact that the highest-rated comment on this page is a rant written by someone who apparently didn't RTFA makes me think this is just another holy war in early stages.

sunjain · on March 27, 2010

I guess this is a history repeating itself - before RDBMS there were hierarchical dbs, and other such technologies. There was a reason why RDBMS won the battle. And probably the top most reason was the simplicity with which you can define relation ship between data, store this data(with relationship) and access it easily. We are all use to using SQL, and even though with all the ORMs in the world, it is still probably a very simple(yet powerful) language. If you look at HTTP, the corner stone of web, one of the reasons it has been so popular, is because of it's simplicity yet powerful. And it can thought to represent the same simplicity as RDBBMS (everything revolves around some really basic operations - read, write, update, delete <-> GET, POST, PUT, DELETE). And historically, invariability the technologies which hides the complexity of what they do by providing a simple interface, normally win. And that was one of the main reasons for the success of RDBMS. And which still remains true. And that was precisely the reason why every one started using RDBMS for blog type application, even though it is not the best use of RDBMS. Come to think of it, why would you want to store lot of text (content of each blog) for each blog id in RDBMS - but people did (because it was so simple to do, and that is what they are used to). Hence the use of RDBMS for these type of applications was debatable to begin with. However if you look at transactions management that is required by some of the financial applications, for example, I doubt how far the NOSQL solutions will go in satisfying the requirement. With regards to the scalability, I will not get into the cost factor. Because there are different ways of calculating the actual "cost" of something - if you are Google - the cost of having few PhDs writing your own filesystem and db(or NOSQL) is not much compared to the benefits you are going to get out of it. But if you are not Google, it is a different story. So if we leave the cost factor aside, for a moment, the list of options available with some of the high end RDBMS technologies(with regards to performance), for example Oracle, are quite broad - from active/active clustering technologies(RAC), from different indexing types(b-tree, bitmap, clustered, index-organized), multitude of partitioning types - range based, hash-based(combination of these), list based etc, the list goes on. Same is true about Subase or SQL server(except of active/active clustering). So I am sure the performance issues can be handled in these RDBMS technologies (without just throwing hardware at it).

mark_l_watson · on March 26, 2010

I've never read anything by Joe Stump before but I just bookmarked this article so I can peruse more of his stuff later.

I liked the way he personalized his argument to his own deployment situation rather than making generalizations. I also liked to hear about his experience with Cassandra (5 minutes to clone a hot node and have it balanced and in production).

ck2 · on March 26, 2010

Understanding the limits of NoSQL from a RDBMS perspective:

http://www.youtube.com/watch_popup?v=LhnGarRsKnA

(slides: http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=no-sql... )

timtrueman · on March 26, 2010

NoSQL/NoREL is a tradeoff of features for performance. If you don't need certain features and you can make the trade and need the extra performance than it can make sense for some people. I don't think it applies to everyone. They made the decision that was best for them and congrats on that.

Also I can say rotational disks may not provide the economics that make RDBMS seem attractive—but FusionIO cards have really changed that. And I didn't just read the datasheet and get a nerd boner. I watched the queries from 8 beefy physical database boxes (that were getting hammered) combined onto one physical box that was identical in all ways except it had an FusionIO card. It handled 8x the number of queries with ease and could have taken a lot more punishment. Yes, the cards are expensive but in the scheme of getting rid of 7 servers it was actually saving significant amounts of money.

chime · on March 26, 2010

I just wish someone would offer a decent hosted NoSQL platform that I can start using already. https://cloudant.com/ is invite only. So is http://hosting.couch.io/ apparently. https://app.mongohq.com/signup is overpriced, considering it's a cloud service and only 2GB. SimpleDB isn't bad but it has tons of limitations (can't even sort by numbers).

This is why people still use MySQL even for projects that aren't suitable for RDBMS. I use hosted MySQL at dreamhost and don't have to bother with anything except my app and data. It just works and is free with the web hosting package. Is there anything out there that comes close? I don't mind $1/month for 1GB of data. $25 for 2GB is not worth it.

janitha · on March 26, 2010

There aren't that many because there's not a big market for it. You ideally want the DB in close proximity (latency wise) to your web server or what ever uses it directly.

Why not host it your self? Deploying a server like MongoDB is trivial to get going.

_pius · on March 26, 2010

For people deploying to cloud services, in-cloud latency is all the same no matter who the app/database provider is.

For instance, Heroku and MongoHQ are both using Amazon Web Services, so it wouldn't increase the latency to swap a MongoHQ database instance for a Heroku one.

chime · on March 26, 2010

Precisely. Even if my latency is 0.1s-0.2s, that is still not a big deal if I write my app well.

sshumaker · on March 26, 2010

Well, that's not entirely true. If you're running your own data source through EC2, you can make sure it's in the same availability zone (datacenter), which makes latency very small indeed.

ryandvm · on March 26, 2010

Isn't that SimpleDB?

marshallp · on March 26, 2010

and appengine

wvenable · on March 26, 2010

This response to Forbe's article doesn't address the central premise:

"Shocked by the incredibly poor database performance described on the Digg technology blog, baffled that they cast it as demonstrative of performance issues with RDBMS’ in general, I was motivated to create a simile of their database problem."

The central question here isn't so much the maximum performance you can get out of RDBMS system, or how it compares to a NoSQL solution, but how Digg is getting such terrible performance out of their RDBMS design! The numbers are just don't add up.

This article is just a bunch of straw men and that avoids that main issue. And arguing that $7,500 is too much for a serious web SaaS vendor to spend is just comical.

wanderr · on March 26, 2010

Didn't he?

"Has anyone ran benchmarks with MySQL or PostgreSQL in an environment that sees 35,000 requests a second? IO contention becomes a huge issue when your stack needs to serve that many requests simultaneously."

my answer to this point is that IO contention can be vastly reduced in MySQL (and probably even better handled in Postgres, I bet) with some tweaking of settings and lots of memory. Memory is pretty cheap these days, so stuffing a server full of RAM is really not a bad option.

johnrob · on March 27, 2010

It's just a spectrum. At one end, where all we do is write, we use log files. At the other end, where all we do is read, we cache static content at network hubs. There's a lot in between. An RDBMS is one tradeoff, 'NoSQL' represents another. They simply imply different types of read/write tradoffs.

StrawberryFrog · on March 26, 2010

At Digg we had probably a hundred or so tables, each table had varying indexes (a char here, an integer there, a date+time here)

This may be part of the problem, actually. 100 tables to serve posts with attached comments? Um.

blinks · on March 26, 2010

There's a bit more going on there. See http://www.codinghorror.com/blog/2009/07/code-its-trivial.ht... for a related principle.

houseabsolute · on March 26, 2010

True, but I can't imagine more than a couple of those tables are accessed at the high frequency rate that requires performance optimization.

wvenable · on March 27, 2010

Actually, my thinking was 100 tables... so what! That's not a large number. Any software of average complexity will have 100 tables, easy.

hvs · on March 26, 2010

Just because some bloggers are obsessed with arguing whether NoSQL or RDBMSs are "better" doesn't mean we need to post every article to HN. Why don't we agree to just stop?

gnaritas · on March 27, 2010

Perhaps because plenty of people are interested in the subject. If you don't like it, don't read it.

Zak · on March 26, 2010

Furthermore, this is an operational expense as opposed to a capital expense, which is a bit nicer on the books.

Something about this seems broken. Why would it be inherently "nicer" to spend money on a service as you use it than on a product that you get to keep?

wrs · on March 26, 2010

Because we're talking about a business. Capital expenses must be written off as depreciation over time. Operational expenses can be written off immediately. For tax reasons this can be a big deal.

However, you can buy servers through a leasing company to get this benefit; you don't have to use EC2.

andrewf · on March 27, 2010

Another example of this insanity: http://www.joelonsoftware.com/items/2007/04/13.html

(Cube farms aren't necessarily cheaper than nice permanent construction, but you have to amortise tax deductions on the latter over 39 years).

japherwocky · on March 27, 2010

<3

NoSQL is faster to develop/prototype with also, since you only need to understand json dictionaries.

So your shit ships faster, and scales cheaper.

onetimeiter · on March 26, 2010

use whatever makes your product useful!