Hacker News new | past | comments | ask | show | jobs | submit login
Two Reasons You Shouldn't Use MongoDB (ethangunderson.com)
76 points by ethangunderson on Sept 14, 2010 | hide | past | favorite | 26 comments



People should stop looking for silver bullets that are a) "web scale" (intentionally in quotation marks) and b) super secure/durable/consistent/whatever. It's all trade offs. MongoDB makes sense for some data but not others. Its weaknesses are its strengths and vice versa. Same goes for SQL databases.

We use MongoDB for storing tons and tons of analytics data for which we don't care if some stuff occasionally gets lost in a server crash. The data really fits MongoDB well and it would have been a nightmare if we were to use an SQL database for this. But for bank transactions we wouldn't even consider MongoDB.

The write lock might be a problem for some people. On the other hand MongoDB supports easy sharding, much easier than with SQL. Sharding allows us to scale horizontally which is a huge plus for our data.


One of the best things to come out the nosql 'movement' is exactly that, no more silver bullets. As much as I like Mongo, I would never blindly recommend it, or any other data store for that matter. It's all about analyzing what your problem space actually needs, and using the best tool to fill that space.

And, yes, the speed improvements to the ruby driver are very much appreciated :)


My problem is that I'm not a data storage expert. Do I now have to specialize in this field to be able to choose the appropriate persistence technology for a given problem? I have yet to see a good explanation of when different technologies are more appropriate, as most of the discussions I see usually devolve into some sort of SQL-NoSQL flame war. I would like a good, fair resource to explain the pros and cons of different persistence technologies more clearly.


The questions you should ask yourself are the following:

* Does it matter if you lose the last 5sec. worth of updates? The last 5 minutes? The last day?

If you can lose 5sec. worth of updates, a MongoDB replication pair is just fine. If you can lose a day's worth of updates (or can easily reconstruct the database contents from other sources), you can try out pretty much anything without bad repercussions. If you can't lose anything, you're pretty much limited to the most conservative databases (the SQL bunch).

* What's the most obvious unit of data that you're working with?

If you always update single values (or add things to lists/sets), Redis is an excellent choice. If you have fixed-size records, SQL or one of the table-based options (Cassandra, Hbase) may be for you. If you have documents with substantial internal structure, a document store (MongoDB, CouchDB, or Lotus Notes if you want something expensive and commercial) would be a good option.

* How much data do you have?

If all of your data fits into memory (and for the price of another server, you may well get enough memory to fit all of your data), you can go pretty far with a single server. If it fits on a single set of hard disks, you'd want replication, not sharding, so that the risk of losing data is minimal. If your data is much larger than that, your only hope is a sharding setup - either with SQL+spit+glue, or Cassandra/HBase, or some version of MongoDB where sharding is stable enough for production use (I do remember seeing warnings - so the current version may or may not fit that description).


Thanks for the response. I will mull that over next time I have a say in the matter of how to store data. I think at some point I will also just need invest some time to play with a few of the NoSQL choices to get a better feel for them.


I fail to see why analytics data seem to be considered "low quality" data ("we don't care if some stuff occasionally gets lost"). As far as I can tell, most businesses out there are driven by metrics which are derived from analytics data... so I don't agree that "it's OK to lose some".


I use MySql and Redis for persistence, depending on the type of data. Both get written to on a purchase: MySql upgrades the account info, Redis holds my A/B testing stats which just had several tests score points. If the MySQL write fails, I have a CS emergency because my customer can't get what she paid for and I probably just ruined a lesson plan for tomorrow. If of the Redis writes fails, my A/B test results that I won't look at for a week anyhow shift in a way that almost certainly doesn't alter my final decision.

It is absolutely OK to lose analytics data occasionally, and indeed with the variety of ways to bork that (js is off, user agent prefetches undisplayed page, bot action, etc) if your stats aren't robust against it you are screwed anyhow.


If they use it for statistical analysis then the sample size decrease a bit, but likely not in a significant way. If they lost all their data that would be a bit different.

For instance, I routinely delete web server logs older than 30 days, on the assumption that if I didn't need it in the last 30 days I'll probably never need it. Every now and then this bites me and I need more than 30 days data to test some assumption, then I will just have to wait for a bit. (for events that occur infrequently enough).


What jacquesm said. It depends on the kind of analytics data; our data is not as important as someone else's analyitcs data. For us it's more important writes are of reasonable speed, that we can store the data in arbitrary structures in a schemaless way, that we can define arbitrary search indexes and that the data can be horizontally spread across multiple servers.

Website visit counters are a great example. I don't think many people care if a few visits get lost once in a while.


Unrelated comment: thanks a bunch for the speed optimisations pushed into the ruby mongo driver :)


some humor to validate your point :) http://www.xtranormal.com/watch/6995033/


On the topic of the global write lock, you can run mongostat and see what the write lock percentage is at any given time. How much throughput you can squeeze out of a server is partially dependent on your sync delay setting.

If your disk is saturated during a sync due to a lot of writes, you might see the write lock at 100% for several seconds and everything will momentarily grind to a hault. If you set sync delay to zero and just rely on the OS to sync, you'll never see this happen.

We're using MongoDB in production and have a few boxes setup. One runs without sync delay and is used to process feed data. If Mongo crashed, and the data was irrevocably lost, it wouldn't really be the end of the world as the data only hangs around for about ten days anyway.

The second box runs with a sync delay and is used to store session data. In either scenario, Mongo is designed to be run with a slave at all times since it's not single server durable.

Regarding performance, the feeds box without sync delay can consistently write 25,000 records per second without ever going above 75% write lock. I find that pretty amazing. Doing preliminary benchmarks and performing batch writes that completely lock the server, I was getting about 250,000 writes / second.


Good points, I can think of a few companies who appear to be using Mongo because it's 'cool' and don't sem to understand the tradeoffs. For the applications I have in mind, there is absolutely no reason they couldn't just use postgres, other than that they want to believe they are 'cutting edge'.


I'd argue that there are more people out there who could be benefitting from using Mongo but won't because SQL databases are "standard".


I agree. I have a few applications which I suspect could benefit from a non traditional data store. The reason I haven't used one yet isn't that, though, it's simply that I am much more experienced with installing, tuning, maintaining and programming for relational databases.


The idea should be that you are picking the right tool for the job. Just because you could just use Postgres, doesn't mean that another data store wouldn't be better suited for your problem space. Could be Mongo, could be Cassandra, it very well could be Postgres, but you should be doing some up front analysis and research before you make that decision.


Sure. My point is that as far as I can tell, Postgres would be better suited to the application, but they want to use Mongo because it's new, regardless of the fact that data integrity, not sheer performance, is more important for this application.


We're running MongoDB in an extremely write heavy environment (web crawling). Another solution for the write issues is to split out a single write connection from all of the other read connections and MongoDB gives the reads execution priority before the writes (in principal, in practice and in my experience it's pretty close).

Again this debate comes down to how you structure your data and picking the best model for that and then figuring out how/if you can deal with its idiosyncracies. The single server redundancy issue has been beaten to death and for any production application should be planned in from the beginning regardless of the database.


Interesting, I didn't know that priority was given to reads(in theory anyways). Thanks for the tip!

As for server durability, yes, it has been beaten to death. Yet, I still talk to developers that either have no idea that's how Mongo operates, or don't know what save mode has to offer.


Well, the MongoDB docs make the same points really clear. Run master slave, or replica set. Do reads from the slaves, writes to the master. If you do this, that about covers it. In my blog, I call this MongoDB "good enough practices."


If you plan to have no redundancy, you probably aren't building anything valuable enough to care about losing some data anyways. The extra hours of my life I get back due to productivity from using mongo over something like mysql is worth it x 100. Not a big fan of the single server durability argument.


I agree; I don't get all the fuss about "single server durability." Single servers fail all the time. After many hard lessons learned, I just assume that if it's only on a single server it's as good as gone. Bad RAID controllers, etc.

If you're really worried about durability, you should be replicating anyway--and, ideally, over 3+ nodes at least two datacenters. I've always just assumed MongoDB was designed around that idea.


At the risk of sounding both lazy and naive, if i want to utlize replica sets, do i have to separate each mongodb instance onto a separate hardware box? put another way, if i want to deploy a durable mongodb installation, do i need three or four EC2 instances (for example) running at the same time, all the time?


yes you do. and not only on production. if your dev box with mongo is rebooted unexpectedly, there goes your data.. you'll have to do a db.repairDatabase() or restore completely, which can take hours if you have a lot of data.


Bonus points for spelling "kernel" wrong.


Thanks for pointing that out, it's been fixed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: