Really, the #1 reason to use MongoDB (if you're me, anyway) is to save development time associated with making your relational schema start small and change as your new app progresses. I feel a smug sense of joy every time I add a field somewhere, or delete another, or create some kind of nested document. It's taken me a while to really understand how many compromises I used to make because changing schemas is a pain in the ass.
Simplified queries, though, are a knock against mongo. Joins are great and I would like to do joins on my Mongo documents, but I end up having to replicate a lot of that in code. Sure it's nice that a document can be more complex and you don't spend a lot of time moving things into tables that are really part of the same record. It's nice because it's not forced, though, not because keeping data in different tables is always the wrong way to do things.
I use PostgreSQL and change my schema constantly, adding/removing columns, changing data types, switching around foreign key constraints, all within safely guarded transactions that I can roll back if I realize that I'm doing something wrong. (And yes: adding/removing/renaming columns is "instant": it doesn't actually do the work of rewriting existing rows on disk.)
Frankly, I also use MongoDB, and I'm terrified of screwing much with the schema, because then I either a) have to make certain I have anal documentation about what fields are in use on what subsets of objects, keeping code around to make certain to detect and interpret old kinds of data, or b) use "simplified queries" (really, write a bunch of manual code as if I didn't have a query model at all) in order to find and update these old objects, non-atomically and with no transaction safety.
Seriously: the only reason I've so far heard for why having dynamic type verification in my database server is valuable is when you have /so much data/ that it is now fundamentally infeasible to make changes to it with any centralized transaction control--specifically, Google's "it's always the interim somewhere" scenario--not because it is somehow more convenient to do so when you only have in the tens of millions of rows.
Three warnings when using mongodb. None of these are enough to say not to use it, but they're things you need to watch out for:
1. Don't run javascript on a production db node. db.eval locks the node it's running on until it finishes, so the performance of that node will go down the tubes. Mapreduce is less bad in this regard because it does yield, but it does so too infrequently. If you want to use mongo's built-in javascript interpreter for anything other than development and administration, set up a slave to run your scripts on.
2. Don't use 1.6.1. If you're using 1.6.1 right now, upgrade to 1.6.2. 1.6.1 has a nasty crashing bug that had my mongo node going down about once a day and not coming up without running --repair.
3. Evaluate how much data loss costs you. Mongo stages writes in memory, and so if the db crashes hard it's likely that there will be some data that hasn't made it to disk yet. If you're building a social network the cost of some potential data loss is probably much less than the savings in hardare, admin costs, development costs, etc. But if you're a payment processor or a gambling site, stick with postgres.
> If you're building a social network the cost of some potential data loss is probably much less than the savings in hardare, admin costs, development costs, etc.
This also depends on what is being stored. I would be unhappy if Facebook lost any of my data, and my understanding is that they use safe storage mechanisms (ones where the commit goes all the way to disk before returning) for everything except transient views like the news feed and search. Also, I don't think it's clear that MongoDB has significantly improved either admin or development costs over its safe competition, so we probably only need to look at performance wins.
#3: You can force flush any write that needs to go straight to disk. insertion & update commands (at least in JS and Ruby) take an argument allowing you specify "safe_mode: true"
re: potential data loss: I typically run 1 master and one read-only slave on another server. Replication of writes from the master to the read-only slave is very fast. Sure, there is a window for data loss, but for crucial data use a relational database. Also, it is great to be able to really load down the read-only slave for analytics, etc.
4) For web apps at least, it's nice to get data back in json format that you can give to the browser immediately without having to build your own json object each time. MongoDB saves you development time at every level. Add to the fact that it is easy to horizontally scale.
I'm using mongodb right now and the synergy between jquery - node.js - mongodb is simply amazing.
I like the idea of Mongo and have read about it but my relational mind can not be unwrapped. Maybe I am trying to complicate things too much. In that example if you have multiple people attending one event and you want to update the event name, do you have to go through all the people and all their events and update each one? Seems like that would be a lot of work, unless there is a way to query for that type of thing? And if there is a query, is it efficient? Seems like behind the scenes mongo would just be iterating through the users, but I'm sure there is more to it than that.
In this case, the events in the user document are events that the user is HOSTING.
There would be another object in our data model to represent the people that got invites to the event. These 'recipients' could be another collection or a list embedded inside each event object.
So then similarly if I wanted to see all the events that I was invited to would I have to loop through all users, all their events, and all of their invitees to see if I was one?
You're going to a depth that is testing the limits of my example :) You wouldn't have to do that. You can have deep indexes, and query on those attributes. So if events had invites embedded, and invites all had email addresses for their recipients:
and that would return all the users that had invited you to an event.
Now, like I said you're testing my example. When these kinds of requirements are taken into account, you'd probably want to have a separate collection for events. Then you could do:
The reason you don't keep files in your database is that file systems are much better at handling files. Faster, more efficient, basically all the reasons that a single-purpose layer tends to be faster than a general-purpose layer.
Databases are much better at handling discrete data than file systems - that's what they are built for. Sure, I could keep my data in a bunch of little files, but that doesn't work as well.
(MS SQL has a feature where you "store" the file in the database, but the db writes the file to the filesystem, and just maintains a pointer to the actual file - not a bad hybrid)
I don't know how well GridFS stacks up (it is on my todo list), although I do like the idea of replication and sharding being built in. My gut (which has been wrong before) says that it is good for websites, not so good for general storage.
I use MongoDB for the same reason as mrkurt: prototyping new schemas is a breeze. I still find myself reaching for the old RDBMS toolbox as things move along, grow, and stabilize. Sometimes, a JOIN _is_ the right tool for the job.
To provide some balance: the reasons that many people /do/ keep files in their database are 1) to extend transactions around the storage of those files, and 2) to allow a single backup or replication infrastructure (such as write-ahead log shipping) to handle all forms of data that are being managed.
What about batch processing a large number of small files? say 10 million image files of 500KB. A typical file system will need to seek each small file.
I wonder if GridFS stores small files in blocks to allow efficient batch retrieval for processing.
GridFS is just a standard convention of how to map files to key-value stores like MongoDB -- you can implement GridFS over MongoDB in just a few lines of Ruby code. GridFS breaks files into fixed-size chunks, and uses a single MongoDB document per chunk. It's not exactly rocket science.
The author of the blog post touts it as a _feature_ of MongoDB, but it's more accurate to say that it's an artifact of MongoDB's 4MB document size limit -- you simply cannot store large files in MongoDB without breaking them up. Sure, by splitting files into chunks you can parallelize loading them, but that's about the only advantage.
Among the key-value NoSQL databases, Cassandra and Riak are much better at storing large chunks of data -- neither has a specific limit on the size of objects. I have used both successfully to store assets such as JPEGs, and they are both extremely fast both on reads and on writes.
Neither is built for that purpose, and will load an entire object into memory instead of streaming it, so if you have lots of concurrent queries you will simply run out of memory at some point -- 10 clients each loading a 10MB image at the same time will have the database peak at 100MB at that moment.
Actually, Riak uses dangerously large amounts of memory when just saving a number of large files. I don't know if that's because of Erlang's garbage collector lagging behind, or what; I would be worried about swapping or running out of memory when running it in a production system.
You actually list one of the advantages of GridFS right there in your post: streaming. If you are serving up a 700MB video, you don't want to have to load the whole thing into memory or push the whole thing to the app server before you can start streaming. Since we break the files into chunks, you can start sending data as soon as the first chunk (256k by default) is loaded, and only need to have a little bit in ram at any given moment. (Although obviously the more you have in ram, the faster you will be able to serve files)
GridFS is simple (and probably could be implemented with most DBs) but it was designed to have some nice properties. Notably, you don't have to load the whole file into memory: you can stream it back to the client. You can also easily get random access to the file, not just sequential.
"The reason you don't keep files in your database is that file systems are much better at handling files. Faster, more efficient, basically all the reasons that a single-purpose layer tends to be faster than a general-purpose layer."
To my (limited, I admit) knowledge of databases every one of these reasons was a reason to use CouchDB as well.
When the author got to the real arguments, he kept comparing MongoDB to SQL databases and the jab at CouchDB (and the other non-relational databases) seemed without merit to me.
I'm sure there are good reasons to use mongo over couch but I don't think they're the ones listed here.
MongoDB is elegant and remarkably powerful. Every developer should run through the excellent getting started tutorial they have for it, as it really is eye opening.
However I have to respectfully disagree on the "Simple queries" bit. The SQL example given is kind of terrible, however how about-
SELECT * FROM users WHERE id IN (SELECT user_id FROM events WHERE published_at IS NOT NULL)
or
SELECT * FROM users WHERE EXISTS (SELECT 1 FROM events WHERE published_at IS NOT AND user_id = users.id)
(Never use group by as a surrogate for IN/EXISTS. It forces the server to do a lot of unnecessary work)
Is that really unintuitive? Perhaps it's just acclimation, but I find those incredible easy to grep, with the MongoDB example being a variant of the same thing.
I'm no SQL guru, but won't that subselect will break down when users and events grow to the millions?
I probably should have thought out the example a little more. We don't ever actually write that kind of query against our production database @Punchbowl. We have a data warehouse pull out high level stats every night, and we query that.
WRT aggregations, you're right -- they do require a bit of acclimation. Once you write a few, though, you're good to go.
I have a natural aversion to "SELECT *" (it is almost always a bad usage), so while any decent database system will parse it to yield the same query plan, I still use SELECT 1.
> I'm no SQL guru, but won't that subselect will break down when users and events grow to the millions?
Databases like PostgreSQL are excellent at performing joins -- which is all this subselect really is, namely joining two relations -- even when the datasets are quite large.
But this particular MongoDB query comparison is pretty worthless, since it's simply giving an example of denormalization, a concept which is equally applicable to relational databases -- the main difference being that with MongoDB, you hardly have a choice in the matter, since joins don't exist.
Don't get me wrong, I love MongoDB, but there are much better reasons to use MongoDB, such as the fact that every document is a flexible data structure, not a strict collection of columns. You can add keys and values as you choose, and store them as arrays or sub-documents depending on the encapsulation you need, etc.
So generally you will have an easier time working with data and being impulsive about it, than the square-hole-fitting-only-square-pegs model of relational databases, which require more planning and schema design, which in turn tends to squeeze all the fun out of working with databases.
There are pros and cons to both approaches, of course. MongoDB is not as mature as modern relational databases, by far. On the other hand, it has a nice feature which nobody apparently mentions:
With MongoDB, the old relational theorist's pet peeve about the meaning of null values becomes moot, because in MongoDB a null value (ie., a missing value) is simply a value which is not there, ie. its key is simply not there. That's much better than null values!
Another advantage is the ability to work with hetereogenous collections of data without having to jump through too many hoops. For example, you can have a collection (table) called "publications". In this table you can store different kinds of publications: Books, magazines, comics, newspapers and so on. Each type of publication may have some common fields, but many have type-specific fields -- hence, hetereogenous data.
A relational database designer will tell you that in the relational world, you would denormalize. A central "publications" table with all the common columns, and then tables "books", "magazines", etc., with each table having their type-specific columns, and also having a foreign-key reference back to the "publications" table. Fine. But think of all the joins you will need just in order to list all and query this stuff; if you have only the publication ID, you have to go through all the tables to determine what type of publication it is. There's not just the performance aspect. The relational model is quite different to how people _think_ about data. MongoDB is easier on the brain, that way.
> Don't get me wrong, I love MongoDB, but there are much better reasons to use MongoDB, such as the fact that every document is a flexible data structure, not a strict collection of columns. You can add keys and values as you choose, and store them as arrays or sub-documents depending on the encapsulation you need, etc.
Yup. I dropped the ball on this one. Should have a list of 4 reasons. :)
>I'm no SQL guru, but won't that subselect will break down when users and events grow to the millions?
A moderately decent database server would have no issues in such a case, however yes, optimization would be case-specific, just as it would be with MongoDB. I was simply comparing readability of SQL for the example that you gave.
Simplified queries, though, are a knock against mongo. Joins are great and I would like to do joins on my Mongo documents, but I end up having to replicate a lot of that in code. Sure it's nice that a document can be more complex and you don't spend a lot of time moving things into tables that are really part of the same record. It's nice because it's not forced, though, not because keeping data in different tables is always the wrong way to do things.