like the yesterday's submission of "Why you should never use MongoDB (2013)" (https://news.ycombinator.com/item?id=12290739), this is all just plain sad because totally unnecessary: the failure of hierarchical databases was obvious in 1960s and was the background of Edgar F. Codd's work on the relational model.
unfortunately, this industry is dominated by cocky PFYs (of any age) convinced they don't need to study history of their field and, as a result, don't even recognize they're repeating 50 years old mistakes. sure, SQL has fallen so short of the promise of the relational model it's not even funny, but don't conflate the model with the query language, folks!
Right tool for the right job. Thinking RDBMS is a silver bullet is just as bad as thinking a document database is a silver bullet. There is room in the world for both.
but we're not talking about silver bullets: we're talking about models which let you derive most information from your data. mathematics says you will get the most bang for your buck from the relational model.
if you like hierarchical databases (data trees), consider that relational database gives you a forest: you can treat any datum as your tree root and bloom from there. with hierarchical, you're tied to a single pre-designated root.
both relational and hierarchical databases let you go from department to employee, only one lets you go from employee to their department without enumeration. what purpose does precluding the latter serve?
> Speed of development and execution when I don't, and will likely never need, to query that way.
The initial joy and speed of of development is really nice. However, my experience has been you end up paying significant technical debt when the specs evolve faster than you think.
I agree,and migrations can be a ghetto. With that known, I do side step constant SQL migrations and even get to skip simple migrations all together by specifying defaults in my domain.
i don't think so, i simply refuse to have my arguments framed that way since i'm very well aware of the shortcomings present in what has so far passed for a "RDBMS". i consider velocitypsycho's argument ("Thinking RDBMS is a silver bullet is just as bad as thinking a document database is a silver bullet.") a non-sequitur: the former is a proper superset of the latter, that sentence simply "does not compute".
doubled down on SQL as the silver bullet
where does the post you replied to mention SQL? are you confusing SQL for the relational model?
Weird, speed of development and execution is exactly why/when I would reach for Just using a SQL based RDBMS.
Whereas, going to a non-relational solution would be something I'd do when I needed to trade away simplicity and consistency in exchange for scaling/performance for specific cases where relational can't handle the write load. Losing the schema and ability to do joins and arbitrary queries and assume transactionality is a big loss that requires lots of work.
A dirty secret of NoSQL is that sure, part of their target audience is legitimate needs, but another part of their target audience is just new developers who don't have much experience with either SQL or NoSQL databases, and are easily mislead into buying the idea that NoSQL will be easier Because There's No Schema or Joins To Think About
I'm starting to tune out whenever I hear someone saying "schemaless". There's always schema. The question is whether schema's enforced at write time, or if instead it needs to be dealt with at read time.
Schema on write is a hassle up front because you've got to start imposing a strict data model up front, possibly before you've even got a good idea what your information domain looks like. And every misstep will be immediately punished with a painful migration.
Schema on read is, IMO, a hassle in the long run because now any code that's consuming the data needs to be prepared to have the information come in any of the ways that it has ever been stored in the history of the database. If folks were being disciplined, then hopefully that's a small number. If they weren't, you may end up with either some sort of combinatorial explosion, or a situation where you've seriously got to null-check every little thing. Every misstep will forever be punished with a million tiny little if-statements nagging at you like endless paper cuts.
I suppose it's easy to guess where my crass sentiments lie.
Yeah, as for me you're preaching to the choir :-) But that's fun so I'll do it some more too
You don't get to just stop worrying about schemas, joins, and transactions in your database. If your database won't do those things for you, now YOU have to.
* No schema just means (as you nicely describe) your app deals with schema changes, not the database. Have fun.
* No joins just mean your app deals with the complex choreography of keeping denormalized tables up to date. You think that's easy? Have fun deciding what updates to make synchronously or asynchronously, what to compute where and when, making sure you don't screw up and either hose performance or end up with stale data in a dependent table... enough of this and you'll be dying to come back to a proper relational database where you can just CREATE MATERIALIZED VIEW and call it a day
* No transactions just means.... oh, who am I kidding, you're not going to bother to write your app carefully to deal with those consistency semantics (and if you do, you'll probably do it wrong), you're just going to ignore them, call it a day, and hope your product doesn't get popular enough that the race conditions start pissing people off.
* No SQL means instead you're probably going to use some protocol that's much newer, much less popular, and locks your app and your mental knowledge into a single specific database product. Have fun rewriting your whole database layer and relearning the whole API and data model when you realize FooDB might be a better fit than BarDB. And it's way easier to go from SQL to non SQL if you end up having to than the other way around
NoSQL is driven more by scaling issues than anything else. Joins and strong consistency are awesome but run head first into the CAP theorem and other concerns like single node performance on a sharded cluster.
There is also the fact that no programming language lets you deal with relational data sanely in code, so you have the well known impedance mismatch heaeache. All popular languages I've seen offer hierarchical (maps of maps etc.) and more primitive structures only.
Joins seriously have nothing to do with the CAP theorem, except inasmuch as multi-key read/write transactions do. Consistent multi-key read-only transactions and write-only transactions are actually not terribly difficult to do under serializability across partitions. Additionally, from a sheer performance and data perspective, centralized relational databases work just fine for real workloads. Unfortunately, pretty much zero real-world apps don't need multi-key read/write transactions, and modern businesses expect uptime that is unrealistic for a centralized system, so people are forced to replicate across datacenters. Ultimately what most people end up doing (regardless of database) is partitioning into small enough key groups that they can afford the highly expensive latency cost for maintaining high availability (across multiple datacenters) while maintaining consistency within each group for read/write transactions, and giving up on consistency across partitions for such operations (but usually maintaining consistency on read-only and write-only transactions). NoSQL can sometimes give you a clearer understanding of the cost/consistency tradeoff you're making, and hierarchical keys can make it much easier to partition, but joins really hardly enter into it.
> Joins and strong consistency are awesome but run head first into the CAP theorem and other concerns like single node performance on a sharded cluster.
The CAP theorem doesn't force you to give up consistency within a single node. NoSQL databases often do, though.
> All popular languages I've seen offer hierarchical (maps of maps etc.) and more primitive structures only.
So, instead of fixing programming languages, let's cripple databases?
LINQ makes querying somewhat more pleasant, but it doesn't fix the real problem: relations are inexpressible as first-class values (not the same thing as first-class objects!) in most object-oriented languages.
Mongo doesn't solve the impedance mismatch either, if you have 2 classes which have a many to many relationship you'll still need to think about it when you model datas, without any guarantee of consistency. Mongo DB barely make queries easier to write, without the power of SQL. How would you persist a "recursive" model with MongoDB while being able to do aggregation operations on the collection , like counting the number of models ? if you use a single collection for all the models, you'll then have to load all the models in the application code and count them in the code, where a simple COUNT would do the trick. With Mongo you're constently trying to reinvent SQL in the code. On the other hand, SQL allows one to write recursive queries for tree like structures.
I understand the trade offs in order to scale, but you can give up on joins WHEN it's time to scale. Mongo doesn't give one the choice at first place.
SQL is a relatively popular programming language that lets you deal quite sanely with relational data.
Then there's Prolog, Kanren, Mercury… which are admittedly not especially popular.
The so-called “impedance mismatch” comes mostly from people who don't understand all that SQL can do and inevitably end up replicating it in some other language.
> There is also the fact that no programming language lets you deal with relational data sanely in code, so you have the well known impedance mismatch heaeache.
The "Object-Relational impedance mismatch" is not a result of the supposed fact that "no programming language lets you deal with relational data sanely in code", its a result of the fact that industrially popular object-oriented languages, of the time when the term was coined, did not align well with the data model supported by then-existing relational databases, and vice versa.
The thing is, for the majority of deployments out there, developers don't even know what is the CAP theorem, nor do they need the theoretical performance NoSQL databases have over default configured RDBMS.
Just like all those big data deployments that can fit on an USB key and be processed by plain UNIX tools.
"NoSQL is simply the result of not wanting to think about the logical structure of data."
NoSQL was an attempt to scale by sacrificing some of the capabilities of the relational model. Key value stores scale great, at the cost of having almost no query capabilities to speak of.
Now, some developers may adopt NoSQL due to the ease of getting a new project started. But I don't think that was the main motivation of the developers of the major NoSQL databases.
(Although, NoSQL is so broad I'm sure there are counter examples.)
> NoSQL was an attempt to scale by sacrificing some of the capabilities of the relational model.
You can give up global consistency without sacrificing local (single-node) consistency. And normalization isn't an all-or-nothing proposition: you can select the kind of schema that best fits your needs. Unlike the case with NoSQL, which just says “lalala... I can't hear you” whenever you bring up consistency.
> Key value stores scale great, at the cost of having almost no query capabilities to speak of.
Far more worrisome is the loss of data integrity guarantees. It's okay to let me selectively disable these guarantees when I don't need them (say, by using a less structured schema), but a “database management system” that doesn't let me enforce the intended structure of my data, under any configuration, is simply not worthy of the name “database management system”.
To be fair, though, it is genuinely difficult to gauge which technologies are worth adopting.
Clearly some are, otherwise we'd all still be writing hand-crafted assembler to optimize drum memory I/O, or perhaps heaps of Perl CGI scripts.
It's also unrealistic to expect the average programmer - hell, even the pretty good programmer - to be aware of 50+ years of relevant research. For starters, many of them can't even access the relevant papers without paying some gatekeeper a whole lot for the privilege. If they could, they'd need a significant investment of time to understand them, especially without ready access to domain experts who can tell them which papers are most worth reading and in which order. (All of these are benefits of graduate student life that are often taken for granted.)
Intuitively understanding the benefits of the relational model doesn't require access to scientific publications. Just a little bit of mathematical inclination and, most importantly, basic common sense.
Mongo isn't the only bad one but it's kind of humorously become the canononical bad one -- where by "bad one" I mean the kind of NoSQL database you choose just because you don't understand SQL) and not because you have a specific scaling need. A good example of a NoSQL database you'd pick to help with write scaling might be Cassandra (which has schemas and is looking more and more SQL like these days with CQL -- the consistency and data models are just different to allow for writes to scale up)
Not saying Mongo doesn't have any legitimate uses (although I honestly suspect it doesn't, not even the ones listed at the end of TFA, at least not assuming the developer is already well versed in using a good relational database and related tooling. In fact I think I'd be willing to challenge and bet money against a mongo proponent to see who could create "an MVP on a super-tight schedule" faster.)
The author freely admits they tried to do a relational schema in a non-relational database. Then proceeds to say it was the wrong choice of product, instead of design. This confuses me. I use MongoDB as my main database, and it works great for me. However, I did have to learn that it works differently, which meant writing my code differently.
My question is this, if you tried to use PostGres and designed your tables super super wide and embedded everything in large object data types and then it got to complicated to manage and figure out, would you consider that a failure of PostGres, or your failure as a developer/architect?
I will have to chock this up to a good laugh myself. Lastly, I've used relational databases for most of my career, and I'm very good at writing SQL. But after learning MongoDB and how to use it properly, there isn't a lot that I would want to go back to an RDBMS for. And with the future roadmap of MongoDB, I don't see that changing.
I agree... but at this point, it's tough to see with all the ink that's been spilled on these issues for years how you could think anything else. Maybe I read HN too much, but the manifold problems with MongoDB have been widely publicized for the past 6 years... it seems pretty close to conventional wisdom that you're going to have those problems if you decide to use Mongo.
You have to separate between "Mongo the database" and "Mongo as it's used by companies"; the latter causes far more problems than the former.
I last used MongoDB seriously in 2012-2015. We had myriad operations problems including inconsistent indexing across shards (where some shards had an index created and others didn't, it was baffling), issues with the balancer not moving chunks properly, and more. Also it's just different than other DBs with its lack of transactional consistency (I think they've made progress on building this), but that's part of why it's fast.
However, the bigger problem is that document databases -- in general -- enable a kind of software development where the model sort of emerges over time, rather than being carefully designed from the beginning. Yes, it's flexible, but you pay an absolutely enormous cost down the line dealing with inconsistent documents. It's not like code where if you do something stupid, you can fix it over time with refactoring and "remodeling" -- data has mass. You can get into a situation where, with a large data set, it can take a week or more just to run the migration script required to scan an entire collection and rewrite a few billion documents into a new, better format.
There is no such thing as a "schemaless" database. That's like saying, oh sure, we just have a bunch of 1s and 0s in memory -- our data is "structureless". The question is whether the database enforces the schema, or not. And I think that in a lot of cases, it's a lot worse to have an "uncodified schema" than a rigid, but at least well-defined, one, that's consistent across the data at all points.
Sidenote: It's also occurred to me over the past few years that it's almost impossible to impose a consistent schema on a large enough dataset. If you truly are dealing with "big data" (TB/PB scale) maybe go straight to the document store of columnar because doing a migration is outright impossible, but don't be so quick to write it off for GB-scale datasets.
I should point out that for many organizations that have "big data" sitting somewhere, it usually is structured to begin with because it was collected by a repeatable process; or at the very least each piece of the whole (if it is a collection of stuff from different corners) has its' own internal consistency.
A challenge there is determining whether it makes sense to massage the data into a common schema for further analysis or to use an unstructured initial approach from the beginning. Sometimes you get to the former from the latter.
>>> Sidenote: It's also occurred to me over the past few years that it's almost impossible to impose a consistent schema on a large enough dataset
+1.
I agree that NoSql promotes "careless db design" to an extent.
But yes (as an example) situations do arise where you either need to alter the existing SQL table with bazillions of rows of data in it or refactor your design in an ugly-ass way by adding a 2-column table tied to the 1st table by some FK.
I would say the latter and the former are inexorably intertwined; i.e. Mongo causes problems for companies because it does things you would not expect a database to do, like your example of phantom indexes...and pretty much all bets being off when you try to leverage sharding... the one thing it was supposed to be able to do to scale...
We had more problems with sharding over the years than you could imagine...the distributed locking mechanism didn't work a lot of the time, the balancer didn't work, weird consistency issues between the config servers, configuration that didn't get replicated across all shards, stupid shard key selection (admittedly our fault but there really should be better guidance on this topic), etc.
I agree with your points about typical database usage. Could it be that some of the people who reach for a database don't in fact know what a particular database implementation (like Mongo, or MySQL, or Redis) actually does, and they're just looking for a black-box that holds data at rest and occasionally gives it back out?
Yeah, maybe. But there's a bigger point here and that's that software engineering is starting to become more of a "there's a way we do things" craft profession (like medicine, law, or architecture) vs. a "everything is from first principles all the time" endeavor. There's just too much stuff to know. You can't expect people to have deep experience with more than, say, 2-3 databases every decade.
So we have to rely to some extent on the experience of others and our own intuitions / less than perfect inferences.
Because Mongo is wrongly marketed as "all-in-one" database.
Instead it should be marketed as "good database for dynamic data that requires complex filtering, also we have very good drivers for many languages" because that's the only thing it's good at.
But there are other databases that are good at that, too, and don't drop your data. RethinkDB? RavenDB? Depending on the type of data you're using either of those could be an option, and won't just drop your data randomly.
I've had nothing but positive experience with RethinkDB drivers. RavenDB has good drivers for some languages, but admittedly not every language is well-supported.
Still, I can't see a situation where I would choose a non-working datastore over a working datastore.
I'm having an excellent experience coding against RethinkDB in Tornado Python. A shift in programming style from conventional callbacks to use Tornado's gen.coroutine is necessary. Having made that shift it's become very easy to use coroutines in the server to stream RethinkDB's JSON result sets up to an AngularJS front end. End to end JSON makes for zero impedance and rapid development. I'm constantly rejigging my schema, indexes and joins as I go so it feels like RDBMS based dev in a lot of ways.
Yeah. I remember thinking a few years ago that it tries to be three totally different products at the same time: a straight NoSQL k-v store like S3, a SQL-like OLTP relational thing ("rows are sorta like documents, right") and an offline analytics store with the map/reduce functionality.
It's like they wanted to be all three but couldn't quite commit to one, and the end result is the proverbial "tankicopter" that's neither as strong as a tank, nor as maneuverable as a helicopter.
I think it also might be okay as storage for dervied data, but then again, why not shove it into `uuid-jsonb` Postgres table, perhaps on the second server.
AFAIK quite a few businesses are stuck with it, as the cost of migrating to anything else would be prohibitive. It's usually easy to migrate logic from one language to another, it's harder to change a persistance layer, especially when it involves a paradigm shift such as NoSQL to SQL .
I think Mongo is GREAT at one very important thing:
Storing other people's data.
If you need to consume other services, especially if its more than one, it's hard to beat Mongo (or other NoSQL databases). You get a lot of search power (not always the easiest to tap, but you get it), and your app won't break when they change their formats. If you need a LOT of it, even better.
I would never use it as my main datastore, though. At least not for any projects I can think of offhand.
Postgresql 9.4/9.5 offers nice support for JSON type data structures, if someone wants to iterate fast - maybe they should start with that as a baseline.
It's absurd to use a schemaless database for the simple sake of "rapid" iteration! That is so, so incredibly inefficient. You need — NEED — to understand your domain data (its structure, volume, usage, etc.) before you even begin writing a single line of code or picking a specific DB vendor. This — and only this — will rightfully inform you of what your actual technical requirements are.
If we treat schemaless is to Mongo then we are sure to abuse it after sometime. There are several other factors to consider. Mongo is NoSQL db and not a greek god of data storage. So, do not curse it for your sins.
Are there any key/value JSON databases that enforce JSON Schema (or some other schema language)? That seems way better to me in situations where you actually care about data integrity.
I don't think so -- I'm looking for a database that will actually enforce a JSON schema for you -- I don't think Postgres has any built-in support for JSON schematization.
I could always do validation before inserting data, but that opens me up to error on my side which I'd like to avoid=)
You are trying to hammer a screw in. If what you want is a facility for the database to provide you with JSON on query, Postgres has a couple of functions for that, namely array_to_json and row_to_json: https://www.postgresql.org/docs/9.2/static/functions-json.ht...
A little late to reply, but CouchDB has Validate Doc Update functions that can validate based on the old document, the new document, and the user context, in javascript:
In my particular case clients of different quality are connecting to a local datastore. I'd like to make sure that even if they mess up validation the data in the store still matches the schemas it claims to. Of course, I could have them connect to a local process instead and have that process handle validation before the data goes in the store, but it's always nice to avoid intermediaries.
If your data store has existed for more than a year or two, it's very likely that other people in the org have written utilities against it (in different languages) that you personally are unaware of. You can reliably and atomically change the data store's business logic but not all the clients'.
I am coming from an 18-year SQL admin/dev background, and am both an MCDBA and MongoDB certified DBA. I and my DEV team have been incorporating MongoDB into our stack for about 5 years now in multiple use-cases.
I am using MongoDB in my production environment for things like consuming incoming rates from different vendors, storing and serving pay stubs and client invoices to our web customers, controlling MSMQ message queues, archiving client emails for historical audits, etc. The key here is that they are document oriented entities, not normalized relational data.
What it comes down to is a willingness to be agnostic in selection of your database platform. Or more to the point, to let your use-cases drive the platform instead of the reverse. If you are going to develop a use-case that requires frequent partial updates, JOINs between multiple data structures, and traditional entity normalization, then a traditional RDBMS is appropriate. If your case allows for a denormalized mode of storage where all of the relevant information is contained within one document structure, and you can benefit from a fluid design, then MongoDB could be a good fit. If you need ephemeral key-value pair structures such as in session state caching, then something like Redis may be more in line with the requirements. They all have sweet spots that they fill well...
We all tend to have our preferred DB "hammer" to drive developmental "nails". What I propose is that we need to have an entire database toolbox from which to choose the right tool for each job.
Others have commented here about the need to thoughtfully plan before you write one line of code and/or choose your DB platform. I have to agree completely. You can map a typical relational use-case to MongoDB very easily to start with, but you do need to be able to enforce some level of control. MongoDB does this now with document validation.
MongoDB, like the rest of us, is constantly iterating and improving. If you have not looked at it lately and are basing your opinions on earlier versions, I would encourage you to take another look...
Not nodejs, but I am in the process of tracking down and event loop loop (e.g. emit( event_id ) somewhere in the path of an event lisener of that name). I assume this can happen within node too.
The code is based on boost asio, so the io_service is on the same thread as your code if you only start 1. Then after that I am in gdb/ldb. Asio's io_service handles the async calls portion and dispatches to my callback. From there on it is no longer necessarily async. The fun is that I have a higher level event handler sitting above this to register/unregister listeners, like one would in node. If one of those listeners happens to emit a message to what is is also listening for, boom.
long store short, it's not the async code per say, but after that.
I'd like to know this too. I've just started a proof of concept s3-lambda-dynamodb project and I'd like to know what others think about it.
Also I know document dbs don't aggregate well, but wouldn't it be appropriate to have your app backend in NoSQL and have etl or other duplication to an rdbms or possibly cassandra?
Why, is there something inherently better in MySQL that makes their second choice also suboptimal? Genuine question, I have very limited knowledge on DBs.
unfortunately, this industry is dominated by cocky PFYs (of any age) convinced they don't need to study history of their field and, as a result, don't even recognize they're repeating 50 years old mistakes. sure, SQL has fallen so short of the promise of the relational model it's not even funny, but don't conflate the model with the query language, folks!