Having used mongo in a professional context, I'm sort of amused by how much vitriol it gets. It has it's flaws, but it's not that bad. I think it's been a bit misbranded as the thing you use to "scale", which ticks people off. To me, when I use mongo, I mostly use it because it's the most convenient option. It's very easy to develop against, and awesome for prototyping and a lot of times it's good enough that you never need to replace it after the prototype phase.
Relational databases are great, but they're almost an optimization -- they're way more useful after the problem set has been well defined and you have a much better sense of data access patterns and how you should lay out your tables and so on. But a lot of times that information isn't obvious upfront, and mongo is great in that context.
> Relational databases are great, but they're almost an optimization -- they're way more useful after the problem set has been well defined and you have a much better sense of data access patterns and how you should lay out your tables and so on. But a lot of times that information isn't obvious upfront, and mongo is great in that context.
I think it's exactly the other way around. I prefer to lay out tables in a way that reflects the meaning and the relationship in the data. Only later, if there is a bottleneck somewhere, I might add a cache (i.e. de-normalize) to better fit the specific data access patterns that were causing trouble.
I think if your understanding of the domain is complete enough that you can map out a relatively wholesome definition of the relationships of data then you're probably right.
I think the advantage the parent is talking about is when you're exploring a domain that you do not have as deep of an understanding in. Schema-less data storage can be helpful in that exploration, as it allows you to dive in and get messy without much upfront consideration. Then afterwards you can step back with what you've learned/seen/experienced and build out your conceptual model of the problem domain.
Thanks for spelling that out, I see that it can be a useful exploration tool.
For me though, the act of thinking about a problem in terms of data and relationships helps a lot in exploring and understanding what I'm dealing with. Even more so in an agile setting, where things can and will be changed often - the schema is no exception, it can be changed. No need to cast in stone the first schema you came up with.
But that's just my preferred way of approaching the problem :-)
I share your preference, but I am always curious how others do things. IMO the datastore is the most important piece of a CRUD app, because it is the foundation that everything hangs off of. So from my point of view tracking changes to its structure is extremely important in avoiding major headaches. How do developers manage this without a schema definition? Mongo's popularity has always made me second guess my assumptions about how important I think this is.
See I disagree. As a developer the core of a system for me is not the datastore but the code. I have my domain model in code e.g. User.java and it gets stored transparently and correctly to the database using the ORM every time I save it. The ORM will never compromise the integrity of the database by say trying to store a Date in the wrong format.
So you have to ask yourself what is the schema getting me ? It doesn't help with validation or business rules enforcement since I do that in code anyway. And it doesn't help with integrity since in the case of Java my data is strongly typed so no chance of storing types incorrectly.
Thanks for sharing your point of view. I'm still not convinced that it's much easier to change schema with MongoDB. Do you mean before there is any production data that needs to be kept indefinitely? If that's so, then I get it. But if you have some production data and your data model changes, you have two options:
1) Don't migrate any existing data. That means you must forever handle the old form of data when you read it from the db.
2) Actually migrate and transform existing data in the database to the new data model. In which case, it seems easier to do in RDBMS because schema changes are at least properly transactional and you actually now what exact schema you're migrating from.
Additionally, with the relational database, you simply don't have to make as many decisions about how to store your data early on because your data model is less dictated by your usage patterns (because you have joins and other tools at your disposal). In my eyes, that's a big advantage that relational databases have, even for prototyping.
You're putting far too much importance on the DB, when really, it's an implementation detail. It has nothing to do with your app, really, and letting your app be a slave to a schema is an anti-pattern. In the end, all a schema really is a fancier linter.
This is one of the advantages of Mongo. No schema, no longer an issue.
I'm fascinated by this perspective because for any business, it's the data that is treasured more than anything else. For example, as a bank I can afford to lose all the app code but I cannot afford to lose a record of my customers and their balances. Therefore, I would never see data as being inferior to my app. The app can be rebuilt from logic; data not so easily.
In my perspective I don't want my app to even touch the schema. That is not the app's job.
It also means that just because a dev decided that the user object no longer needs the DOB field that that field will be discarded. Even scarier, what precisely happens in those situations varies from implementation to implementation. Someone who is handling the database directly will think many many times before deleting any db column. Even then, he will take a back up. I don't see the same discipline among developers when dealing with the same data, just abstracted via an object programmatically.
I would certainly be happier using a schemaless database with a strongly typed language.
I guess it comes down to where you want your dynamism: in the app code, or in the persistence layer. Using a highly dynamic language with a schemaless database feels very unsafe to me. Similarly, using a strongly typed language with a relational DB is sometimes a pain.
I wonder, when one makes a large change to one's app code in the "strongly typed language + schemaless db" scenario, what happens to the existing data that was persisted before the change, which is now in a format that is incompatible with the new model code?
I'm used to writing migrations that change the schema and update the existing data if needed, all inside a transaction.
I just had an ah ha moment, thank you for that. It looks to me like the trade off is deciding where to put business logic based on what needs to access it. The setup you describe sounds great as long as all access to it is through your ORM. I usually assume access from multiple applications is going to happen at some point.
We solve this by creating a schema definition for mongo collections, from within the app layer. Now we have easy version control where we want it, not in sqldump files like we used to. That was painful.
Most full stack frameworks have a way to create migrations. When the schema changes over time, you just get a mess with mongo, thats just my observation. Its far easier to manage change with db migrations.
You'd be right except that we also have the ability to create data migrations based on schema changes over time. I'm glad it's not built into MongoDB or else we wouldn't have the (fairly ideal) solution we built to work with it today.
During early prototyping, when I'm still figuring out the problem and how the user thinks, that's when I try not to worry too much about a schema. For me a data model is kinda seductive; of course it's necessary when you know what you're doing, but it draws my attention away from the messy uncontrollable users.
IMO you can't generalize every problem domain by specifying schema first approach.
I developed a real time analytic solution where the system was suppose to receive unstructured web data. And each client of system can send any type of data & generate unlimited number of reports with unlimited number of fields and all of this in real time. When I say real time, I literally mean all processing was supposed to be done in sub-second. Above all the size of data was 100s of GBs every day.
None of RDBMS would have modeled this problem with the efficiency like MongoDB did.
Most of the web related things like analytics, CMS systems or when you need high performance and horizontal scalability, its hard to beat Document DBs like MongoDB.
I found it to really suck with horizontal scalability and high performance. We are paying 3k per month for mongo hosting because its such a pain to manage and it performs so poorly at just 150gb of data that we need 3 shards which I find incredibly ridiculous. I would pick postgres tables with JSON fields over mongo any day of the week.
Since most companies now are doing Agile development there isn't the big upfront design process where the data model is clearly understood at the beginning. Instead you have this situation where the schema is continually evolving week by week and hence this is why schema less systems can be appealing. It isn't about performance.
I would argue if you don't know how your data is going to be used then you should use the most flexible option - a normalised relational database.
This gives you the most flexibility when querying, whereas with a denormalised database you need to know how you're going to query it ahead of time. Unless you want to abandon any semblance of performance.
IMO, this is the worst argument. There are multiple schema evolution tools for SQL, there's nothing stopping your team from changing the schema every week - plus, it's not hard, certainly less hard than having to maintain code that deals with multiple document schemas at once.
Rails-style migrations (did they invent them? I have no idea. I currently use Sequel's implementation) allow you to change the schema as often as you want. I often write half a dozen of them over the course of a feature branch. The rest of the team merges them in, and when they boot the app they get a message saying they have migrations to run. It's always been supremely reliable, even after complex merges. It gets more involved when you have multiple interacting DBs, but what doesn't?
You have to write sensible migrations of course, or you can get into trouble. This is a feature.
Obviously wide-ranging schema changes may require you to rewrite a lot of your app code, but I don't see how that's different for a schemaless database.
My bigger worry is that every "schemaless" app still has a schema, just one that isn't well defined, similarly to how an app with a big mess of flags and boolean logic still has a state machine representation, it just isn't codified, or even finite or deterministic for all the developers know.
The point is that if you using an ORM and have domain classes then it is unnecessary and annoying step. You have to make changes in two places rather than just one. Most people using MongoDB I know are seasoned enterprise Java developers and we have used many schema management tools for the last decade. It is a giant relief to be finally rid of them.
IMO, it sounds like the wrong remedy for the right diagnostic. I would never throw away the database because I'm duplicating information between ORM and domain classes. This seems more related to the particular constraints imposed by your codebase/architecture than the database.
Right now I'm writing a REST API that will be consumed by web and mobile apps. It would impractical to duplicate validation across all codebases. Rather, I'm leveraging the database checks to get form validation on all clients for free. The application layer is thin, adding a field amounts to adding one line and running a command.
I believe it boils down to which component rules the system: application or data layer.
I agree, but this is something else. The parent was talking about evolving schema. You're talking about what is effectively unstructured data. In this case, the main concern is being able to store the data first, and figuring out how to deal with it later, at the application layer, after you've added intelligence to deal with the new schema(s).
The point is that rapid prototyping and rigid database hierarchies are diametrically opposite.
If you can maintain a flexible sql database, thats great. However, my experience has always been that the 'normalised databases are good' crowd either a) are DBAs trying to remain relevant or b) people who have never actually done that in a project; because its not flexible and dynamic, its performant.
It depends on your problem domain; and servers are so ridiculously overspec'd these days (linode $20/month is a 2GB of ram machine) that performance optimisation is severely less important than rapidly developing functionality.
anyway NoSQL or SQL you'll still have migration issues if you change the way your application consume datas.
if you have an array of category strings in a document and then you decide you prefer categories to be a dictionary with title keys, you still need to migrate your old datas. NoSQL or SQL same thing.
I think what made MongoDB interesting at first place is the use of JSON for storing documents,and the aggregation framework.
then you realize simple SQL queries are almost impossible to write as aggregate, so you end up writing map/reduce code which is slow and gold forbid uses javascript.
At first you think it solves the impedance mismatch problem,then you realize MongoDB has its own datatypes and you still need an ORM anyway because at some point you need to implement "single table" inheritance because your code is OO.
Now about perfomances. They are good yes, but only if you use "maybe" writes.
Now in my opinion, CouchDB does far less,but the little it does ,it does it better than MongoDB. curious about CouchBase though.
The only reason i'd use Mongo is when using nodejs,cause frankly Mongoose is the most mature ORM on the plateform,and that's quite a bad reason.
> most companies now are doing Agile development ... you have this situation where the schema is continually evolving week by week
I haven't seen that level of churn in any of the agile places that I have worked and it is not an inevitable consequence of agile working. If you don't want schema churn, then don't do that.
I agree it has its uses but I feel like MongoDB (the company) sometimes puts forth use cases for MongoDB (the database) that are untenable. For example the pre-aggregated reports use case that they tout that fails at any reasonable scale (e.g. hundreds of upserts a second against tens of thousands of documents)
Yeah, I can agree with that. A lot of their claims of scalability and benchmarks aren't lies, but they come with a lot of asterisks. I know in my last job we used both mongo and (mainly) microsoft sql server, and frankly, I enjoyed developing against mongodb way better, but if I needed something to be consistently fast, there was no question I'd put it in sql server. (Mongo's query speed can be really variable, and the tools to diagnose query planning kind of suck compared to more established databases)
"Relational databases are great ... after the problem set has been well defined ... But a lot of times that information isn't obvious upfront, and mongo is great in that context."
A lot of people get stuck on the schema because they are trying to make it perfect upfront and hit analysis paralysis. I think that comes from other database systems that make it very hard to change the schema.
In postgres, just define a very simple schema and get started with your prototype. It may just have a couple ordinary columns, and then a JSON or JSONB column as a catchall.
Then, you can add/remove/rename columns and tables (all O(1) operations) or do more sophisticated transformations/migrations. All of this is transactional, meaning that your app can still be running without seeing an inconsistent (half-migrated) state. I believe this level of flexibility far exceeds what document databases offer.
"MongoDB allows very fast writes and updates by default. The tradeoff is that you are not explicitly notified of failures. By default most drivers do asynchronous, 'unsafe' writes - this means that the driver does not return an error directly, similar to INSERT DELAYED with MySQL. If you want to know if something succeeded, you have to manually check for errors using getLastError."
We have very, very different definitions of safe defaults. Especially when this fact was poorly documented early on and caused people to be surprised by it.
You may wish to consider that other people have different perspectives of what 'safe' means. Safe, to me, means:
1) Fsyncs to journal every write.
2) Returns an error immediately if an error occurred.
To me this whole discussion seems equivalent to the static/dynamic typing war in programming languages. And so far, none of the camps have "won" as has been predicted over and over again. There are tradeoffs to both and people will prefer one over the other for what they will believe are rational reasons but probably really just a series of coincidences.
A lot of the vitriol seems to come from the PostgreSQL camp. The enemy used to be MySQL but since they lost that fight they seem to enjoy spreading FUD about MongoDB. If they just focused on improving their own product i.e. easy of use, sharding, clustering etc. there wouldn't be a need for MongoDB or even HBase and Cassandra.
Postgres is winning the war with MySQL, if you're going to classify it that way. And from everything I've read and heard on HN, Postgres is constantly improving on clustering/HA fronts.
No MySQL won many years ago. Every shared hosting provider uses the LAMP stack. And PostgreSQL has a pretty average scalability story. It lacks the simplicity of MongoDB or the flexibility of Cassandra or HBase. It needs a lot of work.
I wouldn't exactly call that winning. It's like saying that Justin Bieber won because he's so popular. The article seems to focus on features rather than popularity. Yes, MySQL will be used mostly by amateurs for a long time, because for all it's flaws it somewhat compensates with ease of usability. I as a professional however will celebrate the day my company finally finishes the move from MySQL to PostgreSQL.
I really don't see how a fixed schema is seen as such a bad thing by many NoSQL advocates. In most databases, altering a schema is an operation that's over quickly and in many databases it's easily reversible by just rolling back a transaction.
The advantages of a fixed schema are similar to the advantages of a static type system in programming: You get a lot of error checking for free, so you can be sure that whenever you read the data, it'll be in a valid and immediately useful format.
Just because your database is bad at altering the schema (for example by taking an exclusive lock on a table) or validating input data (like turning invalid dates into '0000-00-00' and then later returning that) doesn't mean that this is something you need to abolish the schema to solve.
> In most databases, altering a schema is an operation that's over quickly
I think many people were burned by MySQL InnoDB, which modifies (or modified; I'm still on quite an old version) schemas by making a new table and copying the data, and don't realise that all databases don't do that.
InnoDB tables have online, non-copying (for most operations) DDL in 5.6. Also, TokuDB has had online, non-copying DDL (column add and drop, add index) since TokuDB 5.0.
Data migrations are rarely just about adding or removing a few columns. It also involves data manipulation. Schema doesn't help you with this. So at some point you have to write data migration code. In which case (and if you are using an ORM) why not do the schema change at the same time.
And sorry but it is complete nonsense to suggest that data migrations are almost impossible on schema less systems. I've done countless myself and people are doing it every day.
Relational models are well-described and can be changed transactionally, both in structure and content, using a well-known and ubiquitous language to describe the migration. It doesn't get any easier than that.
> Data migrations are rarely just about adding or removing a few columns.
This has not been my experience. Sure, you occasionally need code to do a migration. But at least 90% of the time, you are adding to or refactoring your existing data model. In which case, pure SQL is entirely sufficient, unless you've been doing untoward things with your schema.
Well, I think the static typing analogy pretty much cuts to the heart of it. I wouldn't be shocked if static typing advocates tend to like relational databases, and dynamic typing advocates tend to favor nosql. (I can't prove this -- but I wouldn't be surprised).
I don't think so. I'm a fan of Python and Postgres. Rationale: there's nothing special about type errors in the code. They're just ordinary errors; you fix the code and move on. But if bad data gets into your database, it's too late to just fix the code. By the time you discover the problem, you're already set up for a very bad day, or long sequence thereof. The database has to be treated as sacred.
Nope. MongoDB is very popular in enterprises which are largely dominated by Java i.e. strongly typed. MongoDB actually suits strongly typed languages since you enforce your domain in code rather than in the database.
And since Morphia (https://github.com/mongodb/morphia) is similar to JPA it is trivial to port your existing code to MongoDB. Which then leaves you with the same experience as an SQL database except with a lot of benefits (single binary, easy to query/administer/scale, geo spatial features etc).
I'd not count Java as a sign that people at a place like strongly typed systems: Java is a default. The typing system has so few features, it pales in comparison with the alternatives. Those that are in a JVM and really care about a type system might be using Scala instead.
I agree, and I'm pretty sure there is a correlation between those two axes, although there are a lot of subtleties that can make someone prefer dynamic+SQL or static+NoSQL. Speaking for myself, I'm completely in the static+SQL camp.
I suspect that it would be somewhat harder to find static+NoSQL combinations than dynamic+SQL, since it's easier to use a dynamic language over an SQL database (types in SQL don't match 1:1 those in languages anyway, so queries are traversed by column index in both static and dynamic languages; this means that SQL's "static typing" won't affect your code) than a static language over a NoSQL database (suddenly you start to lose guarantees about what goes in every field, and checks must be performed).
I agree that the philosophies seem to match up, but I believe most practical people try to look for machine guarantees from somewhere to contain the complexity.
Someone who is comfortable with strong guarantees in the database may be more willing to use a dynamically-typed language for the application, and (less commonly) vice-versa.
I had the same thought about NoSQl / SQL being like dynamic / static languages. But I prefer dynamic langauges. I think I like the strictness that a relational database gives me, so that I can be worry less about the app code level (which will be in a dynamic language).
A good relational database that implements the normal checks and constraints can serve as an additional check on your dynamically typed code trying to store gubble in the database.
I don't agree with you and I'd like some clarification from anyone who knows for sure.
If the database is small, then sure, schema changes are harmless. But if a database is in production with data shuttling in and out of it and has many records, then ALTERs are non-trivial.
And I don't think an ALTER can survive in a transaction, i.e. it is not something that can be rolled back. It can be undone via another ALTER that is it's opposite, but it cannot be rolled back the way INSERT and UPDATE can.
> And I don't think an ALTER can survive in a transaction, i.e. it is not something that can be rolled back. It can be undone via another ALTER that is it's opposite, but it cannot be rolled back the way INSERT and UPDATE can.
It can in databases that use transactional DDL (and, really, doing so is part of the relational model, since part of that model is that metadata describing the structure of data is itself stored in relations and -- whatever the syntactic sugar layered over top of it -- can itself be manipulated with the same tools that regular data is manipulated with.)
Having the DDL be described relationally is only incidental. I would imagine rollback support to be highly dependent on how the storage engine works. It seems non-trivial to me, short of copying the entire table across the ALTER. And that doesn't sound scalable.
I agree. I wonder how many NoSQL advocates should really be called NoMySQL advocates. I have always been a NoMySQL advocate, but not necessarily a NoSQL advocate.
Without a fixed schema, the schema must be stored with each item. Every "row" not only has the data but the metadata as well, including field names.
If you have a lot of repeatable data this can consume very significant amounts of space (regularly on the order of 4x comparing PG rows to BSON docs). This isn't just a disk space issue, this means less caches and more pages to scan when doing any operation. Compression only improves the disk space issue while adding CPU overhead.
Relational DB's like PostgreSQL having a fixed schema need only store it one time, each row has minimal overhead (about 23 bytes in PG) to store only the unique data.
Perhaps Mongo could optimize behind the scenes, detecting a repeating schema and making a fixed schema data structure for it similar to how the V8 javascript engine creates C++ classes for javascripts objects so its not a key-value lookup everywhere limiting performance. I haven't seen any NoSql db attempt that, I don't think any even attempt simple string interning as suggest in Mongos 863 issue.
PG gives me a choice I can go nearly schemaless with the JSON type or I can create a fixed schema when that makes sense, with many optimized data types to choose from. PG having array support give you a nice middle ground between the two as well.
It's because a fixed schema is unnecessary in most cases. Most developers using a database are doing so via an ORM. In which case they already have a data model in their code. It often fixed and depending on the language strongly typed. It is also centralised, richer, commented, has far better validation and in source control.
So again why do I need a second data model that does nothing but get in my way ?
because querying and aggregating your data with arbitrary conditions becomes much easier. Using SQL you describe what result you want and let the database figure out how to retrieve it.
In many document stores, you have to write manual map/reduce jobs telling the system how to get the data you want.
Most software development projects tend to be falling into two camps these days. Either they are simple and just need basic filtering/grouping in which case you don't need map/reduce. Or they are doing more advanced querying of data in which case Hadoop/Mahout is a better option.
See my other response about the advantages of schema-less design. I agree that fixed schemas can often be useful - the point is to think about this in advance for every project before deciding on SQL vs. NoSQL. I'd say a fixed schema helps during initial development, but will lead to more and more expensive data migrations later on in a system's life cycle, except in case of systems that are very well defined before anyone starts writing one line of code - and sometimes it's plain not possible if you want to have a flexible data model that's not fixed at programming time. It's simply - as always - a question of choosing the right tool for the right job.
> I'd say a fixed schema helps during initial development, but will lead to more and more expensive data migrations later on in a system's life cycle
Yes. The migrations might become expensive (though that varies between databases - some take exclusive locks for a long time, others don't), but you can save a lot of time and money when querying the data.
In a reasonably normalized schema, there's next to nothing that you can't extract with the help of SQL and whatever you throw at it, databases have become very good at planning the optimal way to query for the data you've requested.
If all your data is in huge blobs, querying for arbitrary conditions and aggregating parts of those blobs is complicated and requires manual planning of every query you may come up with.
If your data is well normalized, whatever SQL you throw at it, the database will know how to answer. No specific retrieval code needed.
As such I would argue that a strict schema and a relational database will save you a lot of time as the requirements of your application change.
1.) Yes, ad-hoc querying is usually easier to do and more powerful on relational databases (although the trend in NoSQL databases seems to go in that direction as well, ArangoDB being one example).
2.) The document based approach doesn't say anything about the degree of data normalization - you can have your data as normalized or denormalized as you want, just as in relational databases. Each row can correspond to one document, the containing table usually corresponds to a 'type' property on the document. Following this approach, building views / queries is about as powerful as it can be in a RDBMS, the trade-off is rather between strictness in forbidding/allowing slow algorithms vs. ease of implementing a first working version.
> The advantages of a fixed schema are similar to the advantages of a static type system in programming
Yes, but I already have one of those in the form of a static type system. I dislike database schema for the same reason I dislike XML schema: it tends to enforce only the most trivial constraints, and means you duplicate part of your validation (because you still want validation in code for the things you can't check in the schema). It's just another moving part that can be a potential source of failures.
Are you sure about that? You can validate an awful lot in a PostgreSQL constraint, and I often don't bother validating in the app as a result. Validation in the app can also be extremely difficult to get right in a concurrent environment. There is a high likelihood that many such checks have race conditions that you aren't aware of, if you haven't gone over your application with a dedicated DBA.
You're right that a lot is possible, particularly if you're willing to write a trigger at which point you have a programming language at your disposal. But it's not a great environment for writing code with logic, in terms of things like tool support, debug support, library ecosystem.
I do not disagree with your primary PoV that schemas can be a good thing, but:
a. Schema changes are considered DDL statements and hence can't be simply rolled back via transaction in most databases (as far as I know).
b. If you have millions or more rows and you need to introduce a schema change, it can get painful to do so and can result in app downtime. It is a well documented problem with some well documented work arounds.
On the contrary, coming from the world of mature SQL systems (like SQL Server), I was horrified to learn that MySQL supports neither transactional DDL nor online schema updates. These are taken for granted in premium SQL databases, and to a lesser extent in good free ones like PostgreSQL. (Raise your hand if you've seen the polite suggestion from the Django (South?) team to use a "real" database when your migrate command fails...)
Put simply: MySQL was written by and for people whose time isn't valuable. I.e. the assumption is that it's cheaper to throw low-grade people and brainpower at workarounds (OpEx) than to pay to have the problems avoided in the first place (CapEx). There's a reason people shell out the money for big-iron databases: the math can be easy when you factor in the costs saved (both the direct costs and the opportunity costs).
Luckily, just as not all NoSQL databases are the same, nor are all SQL databases. If NoSQL is an option, standards are obviously less relevant, so you can rely on the behaviour of one SQL database engine of your choice. Postgres is one of a few SQL database engines which implement transactional schema changes.
Companies such as Percona also provide open-source tools which allow for online, non-locking schema changes for Postgres, and I assume commercial database engines have options as well.
"This is not to deny that MongoDB offers some compelling advantages. Many users have found that they can get up and running on MongoDB very quickly, an area where PostgreSQL and other relational databases have traditionally struggled."
A few things about this I've never understood.
1. Someone's going to make a technology decision with priority given to a one-time cost over the strengths and weaknesses of running competing products in perpetuity? The ability to avoid having to learn something is a pro in this decision?
2. If you don't have the skill set to install or maintain MySQL or Postgres, you should not be in charge of installing or maintaining production systems, period. You will hit your ceiling the first time you have to do anything non-trivial with a system that happens to have been easier to "get started with."
If you are building a system for a fortune 500 with a well defined purpose, that will run for years, your #1 might apply. But if you are building something new that you aren't sure what will become yet, like a startup, I would go with the thing that gets you up running the fastest. This allows you to validate quicker and then perhaps change if the project grows to require it.
As for #2, I think that technology is now such a big field that this sort of argument is very rarely valid. NoSQL databases might be so young that there are no db admins with that as their primary field just yet, but that will change soon.
The critical point in the life of a startup is very often that short period at the start, when the team is small and stretched in a hundred directions, trying to 'execute' on everything at once. It's really important to choose your battles.
Where I work (wavo.me), we went with MongoDB, and though imperfect, it was the right DB for us. It's quick to get going, and the schemaless document store makes it easy to change directions and iterate very quickly. That was (and still is) super important to us. It's really not about skillset, it's about cycles.
My point is that MySQL/Postgres aren't difficult and this is a pretty low-hanging fruit example of a decision that doesn't need to be made and shouldn't be made with short-term thinking. Maybe it's necessary at times, but you can go too far with that excuse. Code is much, much easier to change/fix than a data store.
Sure. Lots of factors go into deciding a data store, and you're right that you shouldn't just choose at random.
I disagree with the implication that if you do actually think about it, you'll necessarily land on MySQL or Postgres.
We landed on Mongo, and one of the reasons we did was because we were uncertain about the product direction. We knew we were going to be making massive changes to the schema. We didn't want to (couldn't) spend the time up front the think through the precise implications of every table in the system. In a very real way, we didn't even know what data we were going to be storing!
So, we mostly skipped it. We came up with a simple set of documents, our best guess, and ran with that. We made mistakes (we knew we would), and we corrected them easily. We saved time up front, and really haven't regretted it at all.
MySQL/Postgres aren't difficult but they are more complex that MongoDB.
And in Agile you are taught to only focus on the short term and then move only when you need to. In which case picking the simpler database has merits.
In a non-trival application, using a schemaless document store doesn't relieve you of having a schema, especially in the face of a relationship change. You either have to write a migration to change all the records, which non-MySQL relational engines happen to be extremely good at, or you litter your ORM with a bunch of version specific nonsense.
That quote doesn't directly relate to either of your points. It's only pointing out that an advantage is that Mongo is easy to get set up and load data into quickly.
In certain use cases, that advantage is important (rapid prototyping). In other cases, it's easily outweighed by things like the potential lifetime of the system and the needs associated with that.
No one should make a decision on which tech to use by any one strength/weakness in isolation, and I don't think this article is promoting that mindset.
I can't really speak about Mongo, but since the post seems to be talking about relational vs. document based DBs in general, here's my perspective coming from CouchDB:
- Schemaless DBs make sense, when handled correctly within the application framework, for what I'd call information systems with regularly changing requirements. I'm currently building a rapid development platform for these kinds of systems, where users can define an arbitrary relational data model from a Web-UI and get the application with forms and views all pre-made. The user design is changeable at any point without breaking anything and even the underlying static data structures can be changed without any need for an update or data migration process - it's all handled when opening or saving a document with a previous version.
- CouchDB's map/reduce view definitions are interesting when designing a DB system, since they IMO restrict the developer in exactly the right way: One is forced to write the right kind of index views instead of being able to just join willy nilly. Making something slow usually means writing complex views and chaining many lookups together - one has to work for it and, conversely, being lazy and reducing everything to the simplest solution usually results in a fast solution as well. The result usually scales up to a high number of documents in terms of performance.
- Being able to replicate without any precondition, including to mobile devices with TouchDB, is a big plus - and in fact a requirement in our case. Offline access is still important, especially in countries where people spend a lot of time in trains or for systems that manager types want to be accessing in flight.
I think his real point is that on a SQL database like PostgreSQL you can add schemaless parts to your data if you think that it's right for your case, and with other DBMSs "capabilities in this area are improving rapidly".
But starting with a "Schemaless db" it's much harder to go in the other direction. "The advantages of relational databases are not so easily emulated."
Yes, that point is true - it's easier to go from the stricter environment to the more open one - I guess that's basically just how entropy works. Note that I don't necessarily disagree with the article, I just wanted to offer some perspective on why I think the schemaless approach can often make sense, especially in how CouchDB handles it. If at some future point, the worlds of RDBMS and NoSQL fold into one DBMS handling all the use cases perfectly, I'm all for it - it's just that right now I'm still skeptical about going all schemaless on an RDBMS, since it's still a rather new feature, and the underlying system is very complex and has grown for decades.
This is clearly a flamebait title. The article doesn't say anything about why MongoDB is running out of time other then "PostgreSQL is making real progress as a document store". I think unwittingly the author has identified why MongoDB is not running out of time. It still has a huge lead on RDBMS in a very common and useful workload. If they can continue to make progress while other engines catch up with some of the document oriented features ( as Postgre has done ) then they will still have compelling features to offer. If they do nothing while other engines make progress then of course they will fail.
If anything all this points out is that Document stores, like MongoDB, have a real market where they excel and other engines are playing catchup.
"It still has a huge lead on RDBMS in a very common and useful workload."
There's very little barrier to entry there, though. With 9.4, PostgreSQL will work better on a lot of JSON workloads than MongoDB. And postgres will continue filling in features to cover essentially all of the use cases MongoDB offers now.
MongoDB will have a much harder time supporting things like joins (and normalization which reduces redundancy), new data types, good (and flexible) transaction semantics, sophisticated indexing strategies, cost-based optimizer, useful constraints, etc.
This is one reason why XML databases never went anywhere -- existing SQL database systems just said "we can do that, too" and the XML-only systems were relegated to a small niche. Unless MongoDB really continues to innovate, it will have the same fate. JSON+Sharding is a good combination, but it's just not enough to be a lasting player in my opinion.
As I always say when these sorts of articles come out (and I admit this is my use case for MongoDB and that doesn't necessarily match everyone else's use case), where is the easy to setup HA and sharding for PostgreSQL? I know it's coming, but right now it's not there.
For someone who redistributes a product that relies on end-users setting up HA, this for me is MongoDB's killer feature, easy to configure replication and sharding. I love PostgreSQL, but this is the one big thing that keeps me from using it right now.
Often when I read articles like this I take them with a grain of salt. A lot of hate for MEAN technologies seems to stem from people who don't know how to properly use them. You're not supposed to use MongoDB like *SQL and if you try you're gonna have a bad time.
Let's say I have a big pick-up truck I use to haul xWidgets. My friend gets a little motorized scooter to drive around town hauling his yWidgets. Is it proper for me to believe that my friends scooter sucks because it doesn't haul xWidgets like my truck? I don't know much about scooters, but by friend says it doesn't have a steering wheel and it only has two wheels. How the heck can it steer without a wheel? He says it has "handle bars" for steering. Seems kind of dumb but I guess it works for him. He touts the fact that he gets 80mpg on gas but it doesn't matter how efficient it is if he can't haul xWidgets. Scooters suck because they don't work the way my truck does.
Okay. What does this have to do with MongoDB, though? People in this thread have presented plenty of usecases for it, but every single one of them (pretty much--someone correct me if I'm wrong) is also satisfied by jsonb in Postgres 9.4, which in your analogy would make it a truck that gets 85mpg.
My personal "big picture" critique of MongoDB is that I see it evolving into a SQL database with a different syntax. It is a strongly-consistent, distributed-through-replication-and-sharding system. Additions to 'traditional' databases, like Postgres' HStore or MySQL's HandlerSocket show that many of the MongoDB differences are not fundamental.
Much more interesting to me are systems that do something fundamentally different. e.g. explore different parts of the CAP trade-off, like Cassandra. Or systems that are built around the idea that data no longer lives exclusively on a server, like CouchDB.
I think I disagree with you on a few points. MongoDB might be adding features of SQL databases, and heading in that direction, but they are still missing joins, and that's a crucial part of relational databases.
Also, I'd clarify a little by saying it's trying to be a strongly consistent, distributed through replication and sharding database. Right now there are some bugs in the system, some improvements that need to be made, and definitely some default configurations that need to be changed before it's going to be a consistent database. It's definitely going for consistency instead of availability, but I don't think it can call itself consistent yet.
I don't think we disagree as much as you might think!
I'm trying not to critique MongoDB today, but rather thinking of what it will become. Today's implementation problems will hopefully be fixed.
I agree that without joins, MongoDB querying is very limited. I credit them enough to believe they will end up fixing that and supporting joins in some way. But then, they have "SQL with a different syntax".
And I definitely agree on the data-loss bugs. But again, presuming they fix those, they just end up with something very similar to a relational DB, in terms of the CAP theorem etc.
In short, even when/if they fix all the bugs and add all the missing features, I worry that we'll just end up back where we started.
Well, if that's truly the case, there's no reason you couldn't just translate SQL into JSON queries. (It's been a while since I looked at .NET mongo drivers, but LINQ in .NET basically already does this).
I agree that document-oriented databases will probably not replace relational databases in the near future. In my opinion though, the schemaless design of MongoDB paired with its ease of use and its native support of JSON data makes it a perfect choice for prototyping and (in some cases) a viable option for use in production.
What you also have to consider when comparing document-oriented to relational databases is that the former is still a very young technology: MongoDB has been founded in 2007, whereas Postgres has been around since 1986! So given what the MongoDB team has achieved in such a short time span, I expect to see some huge improvements in this technology over the next decade, especially given the large amount of funding that 10gen received.
In addition, the root cause for most complaints ("it doesn't scale!", "it loses my data!", "it's not consistent!") is that people try to apply design patterns from relational databases to MongoDB, which often is just a horrible idea. Document databases and relational databases are very different beasts and need to be handled very differently: Most design patterns for relational databases (data normalization, using joins to group data, using flat data structures, scaling vertically instead of horizontally) are actually anti-patterns in the non-relational world. If you take this into account I think MongoDB can be an awesome tool.
There's cases where occasional data loss is an acceptable trade off for performance. In a particular app I'm working on it's ok for me to have 99.9% of the data and all my updates and inserts have write concern set to 0, journaling off, etc.
> In addition, the root cause for most complaints ("it doesn't scale!", "it loses my data!", "it's not consistent!") is that people try to apply design patterns from relational databases to MongoDB, which often is just a horrible idea.
Is expecting a database to be acid a horrible idea? if you care about your datas it is not.
No that's probably not a horrible idea, I'm just saying that MongoDB is by design not an ACID database, so I think it's silly to complain about this fact when using it.
I'm starting to form this idea of what constitutes an ideal use case for mongo in my head, and i'm trying to prove the model.
If I were to imagine some kind of "realtime" multiplayer game, like quake or something.
1. You have to have the state be shared between all the parties in a reasonable time.
2. The clients only need the data that is directly relating to the round they are in, so you have the concept of cold and hot data.
3. The data is all kind of ephemeral too, so that you don't specifically care about who was on what bouncy pad when, but you do want to know what the kill score/ratio is afterwards.
4. You have a couple of entities that have some kind of lightweight relationship to each other, which makes it just more complex than a key-value store like redis is really suitable for.
5. These entities are sort of a shared state, and thus get updated more often than new unrelated documents get added, and couchdb's ref-counting and append-only nature makes it really unsuited for constant updates of an existing record.
This seems totally reasonable to me. I was playing around with using Firebase (https://www.firebase.com/) as the datastore for a turn-based multiplayer game on the web and its data policies seemed very well suited.
Essentially, each game was its own document and relevant stats were updated/duplicated on the player documents as well.
There don't seem to be useful relationships here that I'd want to query. Maybe successive game states? But not across different games, right?
Well, I was imagining a situation where you had multiple different pieces in a game, relating to each player and distinct to that game. Maybe like an RTS game or something?
So you might want to query how many pieces a certain player has available, or do a check on who this piece belongs to.
Keep in mind I'm still not saying an actual game, but any problem that had similar data storage and access patterns as a game along these lines would have.
I don't know enough about firebase to confirm whether it is suited to this, but if I had any kind of backend at all, and I had requirements for really quick reponses, i wouldn't be comfortable adding the extra hop to a hosted service onto each call.
From reading Redmond & Wilson's book "Seven Databases in Seven Weeks", my impression is that the CAP Theorem is a big idea (essentially, you get two choose any two: consistency, availability, and partition-tolerance). Relational databases have to be "consistent", i.e. transactions are atomic and reads always get the latest value. NoSQL databases make it possible to choose Availability instead of Consistency, so you can get faster response times even if it means you sometimes get out-of-date data.
I would use Mongo or Riak or something like those for "write once, then read-only" applications, for example a twitter or facebook type application, or simple blog. In those cases when the user hits "refresh" they're never going to know or care that they don't yet see a comment someone posted a half second ago. They're just going to be happy that the page refreshed fast.
Given the title I expected a completely different article. The reason being that the opinion that "MongoDB is running out of time" can be supported by a few very strong arguments. There are a few very serious problems with the current technology that give NoSQL alternatives a chance to catch up and grab momentum. For example, MongoDB still has a questionable mmap based storage engine, there are cluster state consistency issues, the cluster topology management is outdated and so on. This article should have been titled "Why I personally like relational databases better than MongoDB". That discussion is incredibly subjective. Source : Very early adopter of MongoDB and use it professionally.
I think many folks confuse normalization as a strategy with the underlying database technologies. Oracle and other RDMBS technologies can create normalized databases too. In the end it's a design judgment. There is a lot of room between fully normalized and one-big-table. Even firms that logically map things out fully normalized frequently decide that for some things that doesn't make sense.
Taking a step back, there are still reasons to abandon Oracle. It may not scale up, or be good for certain time series calculations, but that's another story entirely.
It´s funny to see this kind of posts now and then predicting the close end of mongodb... for several years now!
MongoDB is here to stay. It´s opinatred, PostrgreSql isn't. It's faster out of the box. The client drivers are pretty good. (Dont forget that SQL database still send raw text as request and get raw text in return!). It's fitting the bill for a lot of quick and dirty web apps and deliver early performance and argualy scalable performance. Dont get me wrong, I still litteraly love postgres.
...for reads. Once you get any real traffic it hangs on its DB write lock. Yes, I have experienced this, and the solution "just shard!" is especially obnoxious because setting up sharding is a lot harder than they make it seem. Not only is it complicated, you go from one server to three config servers, two repsets with 3 servers each (so 9 fucking servers) just so you can make more than 40 writes/s that most relational DBs wouldn't bat an eyelash at.
It's a poorly designed system, and this becomes painfully apparent once your app moves out of the basement and into the real world. Luckily for MongoDB, most of their users' apps never leave the basement and they think "wow, this is great!!"
> argualy scalable performance
Not scalable. At all. Global locks do not scale. Complicated config setups are hard to scale.
Check out RethinkDB. It's an open-source document DB that fixes just about all the problems MongoDB has and it has DB-side joins. It's just as easy to get quick and dirty apps running against, and it doesn't actively flush your data down the toilet like Mongo has been known to do.
"Dont forget that SQL database still send raw text as request and get raw text in return!"
Depending on your database adapter, this is not the case. With Postgres, you have the Simple Query Protocol and the Extended Query Protocol (http://www.postgresql.org/docs/9.3/static/protocol-flow.html). A Simple Query involves text in and out. However with an Extended Query, properly parsed statements are always executed by the DB, and will always return properly typecasted data. You can also stream data to the client and use Cursors, if you have very large datasets (a life saver when reporting on large tables). Also, you don’t need to worry about SQL injection when using Extended Queries, as they are handled as prepared statements.
JDBC supports the Extended Query Protocol with Postgres, so if you are using Java, Scala, JRuby, or anything running on the JVM, you are good to go. But, I have not had to deal with raw text being returned from my Postgres queries in a long time ;)
Dramatic titles get the readers attention, therefore I get the choice of words. However, our industry is so big that there will never be a single solution. MongoDB may never become #1 but as it's described in many comments, it's a pretty good choice in various situations. So, as they say in Ireland "Stall da beans der bi!" I don't hear a ticking clock.
> If all order data is lumped together, the user will be forced to retrieve the entirety of each order that contains a relevant order line - or perhaps even to scan the entire database and examine every order to see whether it contains a relevant order line.
Both of these are untrue. Author needs to read up on secondary indexes.
Maybe the clock is ticking because thier large production deployments are migrating to other tech? At least we are seeing plenty of folks who realize that a query API on top of mmap isn't really a database. :)
The author is conflating (or just ignoring) the very significant difference between application databases and reporting databases. Not surprising since most of us do this as well when we are building applications. However no comparison of the relative value of database schema styles can responsibly ignore this difference.
Relational databases are great, but they're almost an optimization -- they're way more useful after the problem set has been well defined and you have a much better sense of data access patterns and how you should lay out your tables and so on. But a lot of times that information isn't obvious upfront, and mongo is great in that context.