The thrill of a new technology: CouchDB

jrockway · on March 30, 2009

No, graph databases will rule the world. Couch only supports trees, which is annoying if you want to have relationships between "documents".

Also, Couch's implementation of map/reduce arbitrarily limits the kinds of queries you can run.

If you actually want to ditch your relational database, take a look at things like KiokuDB, Elephant, AllegroCache, and so on. They may not have exciting web 2.0 screencasts, but the technology is much better. (I am biased towards KiokuDB, since I helped write it, but it has been very easy to use KiokuDB instead of a relational database; there has been significantly less code in our apps, tests have been much easier to write, and I don't think we've lost any runtime speed either. I really need to write a long blog post about this, but I haven't had time.)

It is good that the world is gradually working their way up to object/graph databases, though. Last week it was "OMG KEY VALUE STORES SOLVE EVERY PROBLEM", this week it is "DOCUMENT DATABASES WILL RULE THE WORLD", so hopefully the blogosphere is only a few weeks away from enlightenment ;)

moe · on March 30, 2009

Well, just to state the obvious, no particular database will "rule the world".

They are different tools for different tasks.

RDMBS will not go away in data warehousing for a while. Key/Value stores are useful for many webapps where they match the access pattern. Graph DBs have yet their own area of application.

There is not "one tool to rule them all". Albeit it would be an interesting project to wrap up all those engines under a common API (SQL?) and have the server choose the optimal one either on user demand or even by magically analyzing the workload.

I'm saying that because the real uglyness that many of us are facing is that we need not one but several of the aforementioned tools for our particular app. For parts of the data we like the guarantees and integrity of RDMBS, for other parts we need the scalability of a key/value store.

jrockway · on March 30, 2009

Well, just to state the obvious, no particular database will "rule the world".

I don't think this either, I was just parodying the overly-editorialized title.

dagheti · on March 30, 2009

Hierarchies and graphs are both fundamentally navigational data models, which force a physical-logical coupling of concerns, hampering flexibility significantly.

The network model recognizes the flexibility issue with hierarchies, however I'm not sure how it can guarantee the same scalability benefits. Due to this it doesn't have the flexibility of the relational model, nor the performance characteristics of the hierarchical model.

This is probably why it isn't ruling the world.

jrockway · on March 30, 2009

This is probably why it isn't ruling the world.

Technical merit and popularity are only occasionally related in the computer world, so I won't comment on this.

I will elaborate further on the use case of graphs versus a relational database. Basically, I see relational databases as especially useful for cases where you have a big pile of data, and have no idea what it really means. You do queries to learn what you have. ("Aha, in March, people from Illinois buy more of product 23894735 than people from California.") The relational model is good for this, since it doesn't build any preconceived notions into your data.

However, most applications don't need anything like this. They have a well-defined data-model, and rarely run "queries" (except to work around the fact that that's the only interface to their data). In this case, the relational model is basically being used as a dumb key/value store supplemented with joins. This results in a ton of code in the application to translate in-memory structures to something that can be stored in the database.

If you use a graph database, you can just store your in-memory structures directly, and get them back later. (Most good object database give you other features, like the ability to index your data so you can still run searches efficiently. Kioku and Elephant do this, anyway.)

So really, you should use the right tool for the job. Want to persist and search in-memory structures? Use an object database. Have a big pile of data you need to make sense of? Use a relational database.

You are allowed to use more than one, they are tools, not religions.

(FWIW, I see relational databases as filling a very specialized role, and I see object databases as the general thing you should use when you want to persist some state. The rest of the world seems to have gotten this backwards, which is somewhat depressing. I think it's because people think databases are magical, as a result of not understanding how they actually work.)

dasil003 · on March 30, 2009

I sort of see where you're coming from in terms of the parity mismatch between RDBMSs and typical application code. Certainly we can avoid some coding headaches by just dumping things into a data store that is optimized for what we want to do with that data right now. But then you say that most applications have a well-defined data model and rarely run "queries". This is where I think you've gone terribly terribly wrong.

The value of a relational database is that it most agnostically represents the reality of what the data represents. It's not about "big piles of data" or "making sense of your data", relational databases are about making your data as expressive as possible. You're selling this idea that most applications only use data in a few predefined ways. I have to say that sounds like a complete pipe dream. Requirements change all the time. Reporting needs often are not even conceived until you have hundreds of megabytes of data. Let's not even get into multiple applications using the same database.

In every business I've ever been involved with, the data is always more valuable than the code, and it always outlives the code. Too much of the hype around these alternative database technologies are throwing the baby out with the bathwater. The idea that "most" applications don't need structured data just strikes me as incredibly naive and short-sighted. Far more applications need structured data than need to scale.

jrockway · on March 30, 2009

Well, let's think about this in more detail.

Let's say you have two types of data, customers and orders. Customers have many orders, an order belongs to a customer. This is easy enough with a typical relational database. You have a customers table and an orders table. You can join the tables and ask questions like "how many customers spent more than $300 last year?"

Now let's consider the graph/object database equivalent. You have two classes, Order and Customer. A Customer has a set of Orders, and the Order has a customer. (Cycles are fine, this is a graph, not a tree.) Creating an order works with some method like $customer->new_order_for('some pants'). You store this in the database, and the graph structure is stored and indexed. (Usually, object databases index on class name, but you can always specify other conditions. This makes it basically equivalent to the relational database.) Note that this structure works very well; the in-memory relationship is the same as the in-storage relationship. You can also write the same query as with the relational database. Get all customers, find their orders from this year, and sum the totals. (Instead of writing SQL, you would just write a script here. You can index things like the order year to speed up the query, as well. Otherwise, it's O(n), but so is the relational database without an index. There is no magic after all.)

Anyway, there is no lack of flexibility with the graph database. If you want to query your data, you can. It's just less convenient, since you have to write a program to do it, instead of letting your database management engine do it. (This is actually not true in general, AllegroGraph has a querying engine based on prolog.)

Back when I did data warehousing, we had to move all our data from the web app servers to a warehousing server in a specialized schema so that some GUI software could manipulate the data. Even though we used a relational database, we had to convert anyway. Using an object database would have made the app code simpler, and the warehousing code equally complex. So I think that would be a gain, not a loss.

dagheti · on March 30, 2009

The flexibility of the relational model is built on its bare simplicity: values + sets + logic. If you propose adding to this and giving up the benefits provided by these simplifications (simplicity of reasoning, ability to change the physical implementation without affecting the logical one [those not-magic indexes], declarative constraints, declarative queries), you need to have a really good reason.

The network model if it means anything is letting you add pointers to the mix and changing how you reason about your data from sets to graphs. This makes it harder to declare constraints, check integrity, update sets of data, access your data in different ways, and reason about your queries. An imperative script is in no way as safe a program as a declarative query.

The most clear benefit in my view to the network model it lets you use your object code fairly seamlessly. Ok.

The problem is that this is a good trade off for programs where you don't care primarily about your data, but about your code. I'd argue that isn't true for the majority of programs that have databases at all. The issues that kill your system years down the line and give you nightmares are not code issues, they are data issues. And the code you write to fix those problems... will end up re-implementing all those annoying "heavy" bits of RDBMSs that coders seems to hate.

jrockway · on March 30, 2009

And the code you write to fix those problems... will end up re-implementing all those annoying "heavy" bits of RDBMSs that coders seems to hate.

Or you won't. I have many important KiokuDB applications in production. And they are not toy Web 2.0 things, they are important sites that perform important data analyses. I don't think any of the logic I had to write to support was particualarly difficult, and it was fewer lines of code than I would need to define my ORM classes.

The flexibility of the relational model is built on its bare simplicity: values + sets + logic.

Simplicity is good. However, software is complex, and complexity has to live somewhere. Look at git, for example. The underlying model is simple and beautiful, blobs, trees, and commits. Wonderful. But, to make that beautiful model into a revision control system, thousands of lines of code had to be written. So while simplicity makes the bottom part of the system simpler, it didn't do much for the overall simplicity of the entire system.

I feel the same way about object databases. They are more complex than a relational database (unless you count things like replication and embedded scripting and ... as relational database features, which I don't), but they help me decrease the complexity of the code I see.

As an example, when I use an RDBMS, I have to maintain:

SQL schema + upgrades

ORM classes

Logic classes (for abstractions over multiple tables or data stores)

Random scripts to query the database and give me munged reports

With an object database, I only have to write the logic classes and the random scripts. The random scripts are slightly more complicated, but not significantly more. The main app is much simpler, and much easier to test. I also don't have to hack my data model into the relational model. (Ever represent a tree in a relational database? It's a hack.)

jules · on March 30, 2009

> However, software is complex, and complexity has to live somewhere.

Exactly. This argument comes up in a lot of topics. For example, that pure functional programming is easier to reason about. True, but as soon as you have to simulate state it's more complex that just having state. This is also the reason that pure functional programming isn't automatically parallelizable. Data dependencies don't magically go away.

Same thing if you are simulating object database features in a RDBMS.

jrockway · on March 31, 2009

For example, that pure functional programming is easier to reason about. True, but as soon as you have to simulate state it's more complex that just having state.

I actually disagree here. It's true that the traditional state model of "everything gets everything" makes writing code very easy. Everything can see and modify everything else, so you never have to worry about manually passing state around.

This makes the code easy to write, but very difficult to debug. A function can do something different every time you call it with the same arguments. That is hard to reason about.

There are actually a few layers involved, the programming language, the data types, and the app code. In a non-functional language, you really ignore the data types and let your app code interact with the language. ("var foo = 42; ...; foo++")

In a language like Haskell, it's important to think about your data types. Your state abstraction will live there, and then your app code will not have to worry about the details of passing state around. So in the end, your app code looks the same in either paradigm.

An advantage about this approach is an increase in generality. If you just have some variables hanging around and some "stuff" that uses them, that's all you can ever have. If you interact with state via well-defined interfaces like the Monad or Applicative Functor, you can build abstractions that work on all types like this. This has worked very well for Haskell. (An example is the do syntax for monadic computation. Although you have functional purity, your program looks imperative. This, IMO, is the best of both worlds. There is some complexity under the hood, but you get code that's easy to read, write, and reason about.)

dagheti · on March 31, 2009

I agree with you here 100%. Maybe this is a good pathway to explaining why relational databases are good ideas in much the same way as having a programming strategy that uses a backbone of pure functions. This purity-dividend is exactly why the relational model is such a good strategy for database management.

In the same way haskell lets you separate your pure functional code from your monadic code, giving you the ability to referentially transparent reasoning and construct imperative machines where you must, the RDBMS approach applies total logic programming to database management. It's frustrating because your code cannot all live in one language as it can in Haskell, but the separation of pure from impure is exactly where the value is coming from.

RDBMSs focus on values and total computations (you know they will halt, making it even easier to reason about queries than non-total functions) allows you to isolate simple logic programs from the rest of your impure non-total code.

Navigational databases don't give the same benefits because they are not value based nor are they total. They are the imperative-model of the database world, and though they are easy to write, they will end up difficult to debug, maintain, and keep coherent.

jrockway · on March 31, 2009

I don't think this is what I meant.

dagheti · on March 31, 2009

I agreed with your overall point, but I was expanding on part of what you were saying:

You seem to appreciate how Haskell allows you to separate your pure code from your non-referentially transparent monadic code. This gives you reasoning where you can have it, and you use monads where you can't.

I'm just saying that same argument applies in databases where using a relational database allows you to separate your total code from your pure code. You use the value-based relational strategy where you can have it (necessary and sufficient for database management), and you use functional (or imperative) programming where you can't.

dagheti · on March 30, 2009

Ok so let's clarify this:

(SQL schema & upgrades + ORM classes + Logic classes, random scripts) - (logic classes + random scripts) = SQL Schema & upgrades + ORM classes

The question then becomes how do you handle the trade-off between being able to declare constraints (that evil schema) and the necessity to "map" how you access your data to your programming language.

If you "program" your constraints in your host language, you will need to either recreate the declarative system provided by a RDBMs or else create a system that will be fragile. Sure your code protects you against inserting bad data right? Well what about updating? What about when you delete? What if your constraint references another value in your database? What if that one changes? Integrity is best declared, not guarded by your program.

If your network model implements declarative constraints, then how do you define them? Those "random scripts" again? Suddenly all those performance benefits of navigation disspear when you're updating that data and checking those constraints.

I've seen many improtant production systems that used the network data model and as a warning: it usually ends badly. Be it COBOL, or whatever fresh re-invention of that navigation database wheel.

jrockway · on March 31, 2009

Integrity is best declared, not guarded by your program.

How do programs that don't use a database stay internally consistent?

dasil003 · on April 1, 2009

With much testing and reinventing of the wheel.

dasil003 · on March 31, 2009

I can definitely relate to the ugliness of ORM, and the elegance of object stores in an OOP environment. That's all fine and good, you can chop off the entire bottom of the stack, I get it.

My problem is data modeling with objects. I build my object architecture based on what I want to do with the data, but I've never been happy with it as a direct data model. It changes too quickly. Do I put things in an array or a hash? Well, it depends. Persisting these specific structures might be expeditious for the app at hand, but if I need to access the data in some other context then I could see complexity spiraling out of control very quickly.

A relational schema adds layers yes, but have low-level structured data adds tremendous value. There's a parity mismatch to be sure, but I'd rather do those situps than have no structured data at all.

jrockway · on March 31, 2009

Edit: One other thing, representing polymorphic data in a relational database is a nightmare.

Edit2: Wait, how did this end up in a separate comment? I am confused :)

wanderingmarker · on March 30, 2009

I've looked at AllegroCache before, and really liked it. Can you briefly discuss how a graph database of type you describe (AllegroCache or KiokuDB) differs from the object-oriented databases which have not exactly taken the world by storm?

Edited: I also just Googled around a bit and found something called Neo4j. Do you know anything about it?

jrockway · on March 31, 2009

Can you briefly discuss how a graph database of type you describe (AllegroCache or KiokuDB) differs from the object-oriented databases which have not exactly taken the world by storm?

Basically, object databases and graph databases are conceptually the same. I have only used graph databases to store object graphs. Instances are nodes, and "has a" relationships are edges.

I don't know why this hasn't caught on, as it's made my life significantly easier (by making my software simpler and easier to test). My guess is that there hasn't been a good implementation in a popular language. Perl and Lisp are widespread, but not nobody is going to start using either because there is a nice object database for them.

(Neo4j looks nice, and I don't know why its not more popular. Perhaps it needs a good object-storage frontend, and this is hard to write since Java's MOP is pretty limited. Perl and Lisp both have excellent MOPs, Moose and CLOS, so this makes it easy to write good object databases. With Java, you will have to build your own MOP, which is hard. Same goes for C++ and C#, which are the other "popular" languages.)

hendler · on March 30, 2009

I agree, in principle that the most dominant, important technology will be graph databases. The reason I think so is because of experience working in technologies related to the semantic web.

However, there are reasons one might a Distributed Document Store to implement a graph store - it's the index that matters. MySQL is not the same thing as InnoDB/Sphynx. Graphdb product X could be built on mapreduce +CouchDB+Lucene(+somegraphindex) .

The only thing I'm sure of, is that if anything stands out above the rest, Oracle and IBM (and maybe MS) will try to buy it.

mattlanger · on March 30, 2009

With high expectations I clicked a heading entitled "Document based DBs will rule the world" but found someone lifecasting "hello world" on CouchDB.

dasil003 · on March 31, 2009

Total clickbait title. To be fair, the article doesn't make this claim anywhere. Flagged with extreme prejudice.

mattmcknight · on March 30, 2009

I have run into serious issues with the document based approach. Imagine a domain where you have several different types of documents. You still end up doing joins. In the example the author gives, doing a query to get you all of the unique tags can be slow. Or, if they are dates, trying to get a max(date) can be tricky. The biggest problem though, is document size. For a complex domain, there are big tradeoffs between keeping the entire graph in the document, which is slow to read, or creating a variety of document types, which brings you back to joining.

A hybrid approach can give you a little of both. For most of the kind of applications I build, I use relational databases, but I dump a denormalized view of the data into Lucene for searching and such and use a caching layer for most reads. This way I get some document type performance on many operations, but still have the normalization and simple transactions of a relational model.

noss · on March 30, 2009

A document based DB is not like a flat file. They do index for fast retrieval.

It is just that the indexes are added based on what queries are made. You can be explicit about what indexes you want maintaned, but it is often good-enough to just perform a query and have the db figure out how to index to make that query faster.

It's pretty much dynamic-typing vs static-typing for databases (plus the document-entries instead of row-column-entries).

Tichy · on March 30, 2009

Hm, flat files will rule the world? No more messy SQL, just iterate over the file's entries with a for loop - even Java programmers can understand that.

Sorry, but I am not convinced yet.

jshen · on March 30, 2009

You may not be convinced yet, nor am I, but your strawman argument makes you look ... well a little silly.

Tichy · on March 30, 2009

That is the way I understood it. What do you mean by "it creates an index for each view"? I just would like to avoid reinventing the wheel, and relational databases abstract away some of the problems of dealing with large data sets.

Edit: OK, read the CouchDB introduction now. Did not know that the "views" are server side. I have a past as a Java developer, and I have actually come across the "select all, then compute in Java" pattern several times in the real world.

jshen · on March 30, 2009

You should phrase your comments in the form of a question rather than passing judgement on things you haven't looked into.

Tichy · on March 30, 2009

I used to do that but then got negative comments from HN. Anyway, the title of the article was a bit provocative, so I hoped answering in the same spirit would not be taken as offensive. It certainly is inspiring to look into CouchDB.

jshen · on March 30, 2009

Yeah, the HN headline was stupid. Sadly stupid titles usually get a lot of up votes. By stupid I mean, intended to incite a knee-jerk reaction.

jamesbritt · on March 31, 2009

Same seems true of comments. I think I see Tichy's point; if you ask question, you are often ignored, or mocked.

If, however, you make a (willfully) ignorant assertion, people jump at the chance to tell you just how wrong you are.

Not that I want to advocate this approach, but I understand why someone would use it if, in the end, it generates more useful information.

(I think this was discussed elsewhere on HN, with "troll, don't inquire" as a disappointingly effective means of getting information.)

Tichy · on March 31, 2009

I wasn't trying to troll, I merely answered in the same tone as the submission title. Let's not blow this out of proportion.

jamesbritt · on March 31, 2009

I didn't mean to suggest you were trolling, just that some folks have found that adopting certain troll-like characteristics is (for better or worse) an effective means of getting informative replies.

intinig · on March 31, 2009

I am sorry for the headline, but you're right about stupid headlines getting reactions.

intinig · on March 30, 2009

Tichy, give it a spin. The ease of execution (and the sheer speed, even over giant amounts of data) is impressive.

dagheti · on March 30, 2009

What do you mean by impressive "ease of execution"?

How terse the code is? How easy the code is to read? How likely the code is to be correct? How flexible the data model is to new requirements? How simply you can reason about what the code does?

intinig · on March 30, 2009

The code is easy to read, and the data model is as flexible as you want it to be, since it's schemaless.

Reasoning about what the code does is simple once you shift your paradigm to document from relational.

old-gregg · on March 30, 2009

Do your research. "document-based" DBs already "ruled the world" right until relational databases were invented, which quickly obsoleted them.

The only advantage of disk-backed hash table is ease of scalability, this is why they're useful for hm.... top 0.005% of web sites, who handle thousands of updates per second.

jshen · on March 30, 2009

They are also useful for rapid app development or prototyping. I've used couchdb this way a few times and it works great.

sounddust · on March 30, 2009

I simply don't believe that this model scales "over giant amounts of data", at least in any sort of real-world usage scenario. Can you back up this claim with benchmarks?

Tichy · on March 30, 2009

Reading CouchDB documentation now and can't stop reading. It sounds like a lot of fun. Haven't understood yet how to handle the performance problems, but would like to use it just for the fun of it.

Tichy · on March 30, 2009

I can imagine some benefits, but as for speed of execution I am really doubtful. Your example suggests a "select all" followed by some analysis done in code. I can't imagine this scales well.

jshen · on March 30, 2009

It creates an index for each view essentially. It's not anything like a "select all".

leej · on April 2, 2009

We need basic guides for the graph/document database engines like this code in sql is translated as that code.

mapteo · on March 30, 2009

I work only with relational db...but this is a great tech!