He's saying that black people and women are under-represented in CS. I think most people would agree that Indians and Chinese people are not underrepresented, but also can face prejudice.
Riak and all the dynamo-style databases are really distributed key/value stores and I think, you know, I've never used Riak in production, but I have no reason not to believe it's not a very good, highly scalable distributed key/value store.
The difference between something like Riak and Mongo is that Mongo tries to solve a more generic problem. A couple of key points: one is consistency. Mongo is fully consistent, and all dynamo implementations are eventually consistent and for a lot of developers and a lot of applications, eventual consistency just is not an option. So I think for the default data store for a web site, you need something that's fully consistent.
The other major difference is just data model and query-ability and being able to manipulate data. So for example with Mongo you can index on any fields you want, you can have compound indexes, you can sort, you know, all the same types of queries you do with a relational database work with Mongo. In addition, you can update individual fields, you can increment counters, you can do a lot of the same kinds of update operations you would do with a relational database. It maps much closer to a relational database than to a key/value store. Key/value stores are great if you've got billions of keys and you need to store them, they'll work very well, but if you need to replace a relational database with something that is pretty feature-comparable, they're not designed to do that.
Can you please explain this for a case where there are multiple replica sets, the database is sharded and nodes are across data centers? What's sacrificed? Something must be.
When we talk about consistency, we're talking about taking the database from one consistant state to another.
With replica sets, we're still only dealing with one master. We can get inconsistant reads from the replicas, but we're always writing to a single master, which allows that master to determine the integrity of a write.
With sharding, we're still only dealing with one canonical home for a specific key(defined by the shard key). (besides latency, I'm not sure how datacenters would affect this)
What we're giving up in this case is availability. If an entire replica set goes down, we can't read or write any data for the key ranges contained on those machines. This is where Riak shines.
With Riak, any node can accept writes, and nodes contain copys of several other nodes data. What that means is, as long as we have one node up, we can write to the database. Because of this, there is the possibility of nodes having different views of the data. This is handled in a number of ways(read repairs, vector clocks, etc). Check out the Amazon Dynamo paper for more info, great read.
I'm sure I'm missing some stuff, but I think that covers the gist of it.
EDIT: One thing that I want to make clear, I don't think that one architecture is better than the other. They each have their own pros and cons, and are really suited to solve different problems.
None of this is guaranteed by default. By default, writes are flushed every 60 seconds. By default, there's no journaling. How can one claim full consistency if the the former two points are true?
Don't get me wrong, I love mongo. I'm building a web app backed by it. But the marketing talk is grating, which whT this post nails.
I think those two issues are orthogonal to consistency. In ACID, consistency and durability are two different letters and CAP doesn't even mention durability. Are you referring to another definition of consistency?
How is flushing a write every 60 seconds orthogonal to consistency? If there's a server crash between the write to RAM and the subsequent flush, the data is lost, is it not? How do you guarantee the data is there in that case?
That would mean the data set was not durable, it doesn't speak to consistency at all. DB consistency is about transaction ordering. Transaction 1 always comes before transaction 2, but 2 may exist or not as it pleases. Transaction 1 must be present if 2 is present.
If a slave can continue serving reads whilst partitioned from a master that continues to accept writes then you cannot guarantee consistency. If a slave cannot serve reads when partitioned then you aren't available. If a master cannot accept writes when partitioned then you aren't available. See this excellent post from Coda Hale on why it is meaningless to claim a system is partition tolerant http://codahale.com/you-cant-sacrifice-partition-tolerance/.
I interpreted "what is sacrificed?" as asking which letter of CAP MongoDB was giving up. Coda's article actually explains exactly the tradeoffs MongoDB makes for CP:
-------------------
Choosing Consistency Over Availability
If a system chooses to provide Consistency over Availability in the presence of partitions (again, read: failures), it will preserve the guarantees of its atomic reads and writes by refusing to respond to some requests. It may decide to shut down entirely (like the clients of a single-node data store), refuse writes (like Two-Phase Commit), or only respond to reads and writes for pieces of data whose "master" node is inside the partition component (like Membase).
This is perfectly reasonable. There are plenty of things (atomic counters, for one) which are made much easier (or even possible) by strongly consistent systems. They are a perfectly valid type of tool for satisfying a particular set of business requirements.
In a replica set configuration, all reads and writes are routed to the master by default. In this scenario, consistency is guaranteed. (You can optionally mark reads as "slaveOk", but then you admit inconsistency.)
This does sacrifice availability (in the CAP sense), but I haven't heard anyone claim otherwise.
"In a replica set configuration, all reads and writes are routed to the master by default. In this scenario, consistency is guaranteed."
One would hope that reading and writing a single node database was consistent. This is table stakes for something calling itself a persistent store. Claiming partition tolerance in the above is the same as claiming availability. The former claim has been made. Rest left as exercise for the reader.
If a slave is partitioned from its master, it won't be able to serve requests. (Unless the request is a read query marked as "slaveOk", in which case you admit inconsistency.) I highly doubt anyone would claim otherwise.
The implication is that the people for whom eventual consistency is not an option will never reach a data set size or availability requirement that'll require them to use replication and experience the lag (and eventual consistency) involved.
Among major features touted are auto-sharding and replica sets. I don't know if the implication is that it's only for web apps/websites that won't need those
In the sharded case, at any given moment each object will still live on exactly one replica set, which will have at most one master. You can do operations (such as findAndModify http://bit.ly/ilomQo) that require a "current" version of an object because all writes are always sent to the master for that object. You can also choose to accept a weaker form of consistency for some reads by directing them to slaves for performance. This decision can be made per-operation from most languages.
As for trade-offs: Relative to a relational db, there is no way to guarantee a consistent view of multiple objects because they could live on different servers which disagree about when "now" is. Relative to an eventually consistent system, you are unable to do writes if you can't contact the master or a majority of nodes are down.
A lot of comments seem to be squabbling over details, but your startup is almost certainly going to fail, and the longer you haggle over splitting proceeds, the more likely failure is. Just split it and start working already!
If your company is a success, great, but is it really going to matter if you're worth 50 million vs. 60 million in the infinitesimal chance that it pops?
(I work on MongoDB, but trying to give a balanced opinion.)
Redis is a great key-value store, MongoDB is more of a fully-featured database. Redis has some nice set operations and is pretty easy to learn (all of the commands are here: http://redis.io/commands). MongoDB is also pretty easy to learn (click the "Try it out" button at http://mongodb.org/), but there are a lot of advanced features to learn about.
So, if you need a key-value store, Redis is a great choice. If you want to do something more complex, MongoDB would probably work better.
Journaling should give you the crash-safety you're looking for. You should combine it with safe writes to get the commit safety you want (not the default, but is easy to choose, see your driver's documentation).
A lot of people like it because it makes development faster. It's like the scripting language of databases: you can get stuff out the door really fast (with the obvious power/responsibility caveats).
This was the main reason why I picked it for a project. Its just sick the amount of code you DON'T have to write to use this database. I literally have 1 class called GenericCRUD with 5 functions that do all my database functions for all my models. Plus not having to worry about stored procedures anymore is a heavy weight lifted off your shoulders.
Excellent point. MongoDB doesn't claim to be any faster than other dbs at the simple stuff (in fact, it's often slower because we haven't had years to optimize everything). The speed gains that people usually see are because they can just fetch one document, instead of doing complex joins or aggregations.
Indeed. A lot of the reason I like document databases so much is that you can solve problems differently (and many times, much more easily) than you could in a relational database. Compare:
db.posts.find({tags: {$in: ["foo", "bar"]}})
to:
SELECT * from posts JOIN taggings ON taggings.post_id = posts.id JOIN tags ON tags.tagging_id = taggings.id WHERE tags.tag IN ('foo', 'bar');
(Single query tag lookup; naively joins the entire taggings and tags tables before limiting with WHERE)
Or a "better" query with two subselects (yikes!)
SELECT * FROM posts where post_id IN (SELECT taggings.post_id FROM taggings WHERE taggings.tag_id IN (SELECT id FROM tags WHERE tags.name IN ('foo', 'bar')))
And that's the "find where any tag matches" case. Try the "when all tags match" case ($all in MongoDB), and you'll go grey a few years earlier.
> (Single query tag lookup; naively joins the entire taggings and tags tables before limiting with WHERE)
Really? Which modern RDBMS database actually works like this when indexes are available?
Index selection and optimization is a standard feature of even the simplest relational database system. JOINs on keys with high selectivity (especially unique keys) are extremely efficient in modern database systems - in some common cases just as efficient as pulling two columns out of the same table. Subselects are something that several database systems (MySQL, for example) really suck at - but they are easily avoided in most schemas.
> Try the "when all tags match" case ($all in MongoDB), and you'll go grey a few years earlier.
Depending on your database engine, even very large AND chains can be extremely efficiently computed on an indexed column. For larger use-cases, there are good ways to avoid this problem using careful schema design.
There are legitimate reasons why relational databases aren't the solution to every database problem. Claiming issues which clearly do not exist is not a good way to get points for the other side.
They get strangled by their DBA before they're finished with it.
And: the point of document stores is that you can denormalize (i.e. put lots of stuff into one data item) and _still_ use indices into the lots of stuff. Most commercial RDBMSes have means and ways to do that (e.g. storage of XML content), but afaik it's not standardized and would tie you closely to a single ($$$$$$) database vendor who will not hesitate to make you bleed whenever they can.
the things like olap and intentional denormalization is not that dba are against of.
by denormalization i didnt mean keeping xml in a clob, but keeping the values in a row as if several tables are already joined into one wide table, thus eliminating the need in joins.
No one is ruling it out as being the default in the future. However 1) it was a pretty major code change and 2) this is the first release to make it publicly available.
Do you really want that enabled by default right off the bat?
I thought the whole idea behind MongoDB is to deploy at scale. In other words, deploy over 100s of boxes and don't worry about RAID, or single sever durability because at that scale any single server is liable to fail so you'll need to ensure platform-wide durability via replication.
MongoDBs killer feature was/is sharding. If your deployment isn't going to require sharding from the get go, then I'm not sure why you'd be attracted to it instead of any of the other more mature alternatives.
Adding single server durability just gives MongoDB more possible use cases (e.g. small site, single server, low overhead enviroment, etc.).
Currently (mongodb 1.6.x) sharding is a killer feature that might kill yourself. It's just a marketing propaganda.
I once tried switch to sharding to avoid high write lock ratio(MongoDB use DB level lock currently, only one write operation allowed at the same time). After sharding, MongoDB even don't know how to count my colleciton. db.mycollection.count() return values at random. mongorestore also failed in sharding setup, there's no error message when I was restoring millions of documents, but after it reported successfully restored, I checked the DB, no documents there.
"This (mongodb's write plolicy) is kind of like using UDP for data that you care about getting somewhere, it’s theoretically faster but the importance of making sure your data gets somewhere and is accessible is almost always more important."
"It's okay to accept trade-offs with whatever database you choose to your own liking. However, in my opinion, the potential of losing all your data when you use kill -9 to stop it should not be one of them, nor should accepting that you always need a slave to achieve any level of durability. The problem is less with the fact that it's MongoDB's current way of doing persistence, it's with people implying that it's a seemingly good choice. I don't accept it as such."
development agility is a goal of the project too, in addition to scale-out. i've seen a lot of happy users with a single server (or two) and that's all they need.