This is wonderful work, and it is the foundation for yet more wonderful work in the future. However, as much as i am a huge PostgreSQL fanboy and a dedicated MongoDB peanut-thrower, i worry that the caption applied to this post is misleading.
Once this work is released, PostgreSQL will be faster than current versions of MongoDB at carrying out queries for documents which look for a given value at a given path (or perhaps more generally, values satisfying a given predicate at paths satisfying a given predicate).
But that has never been one of MongoDB's important strengths, has it? The operations for which MongoDB is optimised are inserting documents, retrieving document by ID, and conducting map-reduces across the whole database. Yes, it has the ability to query by values in leaves, and it can use indices to make that faster, but that's a bit of a second-tier, bag-on-the-side feature, isn't it? If that was the main thing you needed to do, you wouldn't have chosen MongoDB, right?
Or just a single transaction with synchronous_commit!
> Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a "success" indication to the client.
> This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits.
The feature works per table and is off by default. Unlogged tables have another nice property: If the data gets corrupted somehow, then the table will be completely empty. This makes it clear to users that you'd use unlogged tables for caching or other data that you have no problem with losing.
It also means that you can be sure that the data you read from an unlogged table is that data that you put there. I much prefer this over the possibility of maybe reading corrupt data without any possibility of detection.
In my case, the use-case of an unlogged table is a table that practically works like a materialized view (until postgres learns to refresh them without an exclusive lock and maybe even automatically as the underlying data changes). If the table is empty, I can easily recreate the data, so I don't care about any kind of WAL logging or replication. If the table's there, perfect. I'll use it instead of a ton of joins and an equally big amount of application logic.
If the data isn't there, well. Then I run that application logic which will cause the joins to happen. So that's way, way slower, but it'll work the same.
The advantage of using an unlogged table here is that it's cheaper to update it and it doesn't use replication traffic (unlogged tables are not replicated).
It is not stupid it actually makes sense for ETL workflows where you load raw data into unlogged tables and transform them in the database into the final (logged) representation.
> I'm now immune to MongoDB bashing now but jesus keep up with updates.
We do want Jesus to keep up with the updates, for sure, but this isn't just about a random slip-up, goes a bit deeper than that. This wasn't an unfortunate bug, "oh we meant to do all we could to keep your data safe" it was a deliberate design decision.
Dishonesty and shadiness was the problem. When finally they started shipping with better defaults, did we see an apology? Was anyone responsible terminated? Nope. Don't remember.
Now, it could all be a statistics fluke and people randomly all decided to bash MongoDB, say, more than Basho, Cassandra, PostgreSQL, HBase and other database products? Odd isn't it. Or, isn't it more rational, that it isn't quite random, there is a reason for it.
When developers sell to developers, it is expected a certain level of honesty. If I claim I am building a credit card processing API but I deliberately add a feature that randomly double charges to gain extra performance, you should be very upset about. Yes, it is in the documentation, on page 156, that says in fine print "oh yeah enable safe transaction, should have read that". And you as a result should probably never buy or use my product and stop trusting me with credit card processing from them on.
The bottom line is, they are probably good guys to have beers with but I wouldn't trust them with storing data.
Your analogy rests on the absurd assumption that mission critical components don't need to be fully vetted and understood.
Shady? Yes. Dishonest? No. To be sure, both of these attributes suck, but beyond that there is an absolute world of difference between the two. One passively leaves knowledge gaps that even novice consumer programmers would uncover (the source is wide open after all, and speaking from personal experience, any claims of hyper performance should arouse enough suspicion to inspire some basic research), while the other asserts no knowledge gaps and actively inhibits knowledge discovery.
That said, they fell very far form the ideal, and for that they should be faulted. They're now paying the price by living in a dense fog of distaste and disgust. I don't think anyone believes it's a "statistics fluke" but it's a likely that a mob/tribal mentality plays some role (this is an industry notoriously susceptible to cargo cults) and that many individuals on the burn-MongoDB-at-the-stake bandwagon do not actually have deep--or moderate!--experience with it and competing products. Also, would you chalk it up to statistical anamoly that they were able to close funding and that many large scale systems use their product?
Isn't it possible that the company had shameful marketing practices that repel the programmer psyche, but that the product itself has a unique enough featureset (schemaless yet semi-structured querying, indexing, dead simple partitioning and replication, non-proprietary management language, simple backups, all at the cost of RAM/SSD) to make it compelling for some problem classes?
Unacknowledged writes are disturbing but this can happen in any system: data hits the disk, and just before the server can send an ack to the client, it crashes.
When map/reduce comes into play, speed isn't usually the main issue. It's a way to do work in parallel - which may be done to increase speed, but more often than not it's done because there's too much work to be done any other way.
Put another way, I see map/reduce as a background task - building indexes, reporting, etc. I don't think speed is as important there as it is for a foreground task like GET, PUT and DELETE.
For these operations, especially by key, Postgres has always been right up there - what's happening now is that Postgres is getting really good at NoSQL features - documents, key value pairs, deep querying etc - good enough that if you have Postgres you can do pretty much whatever you could have done with Mongo - while still making use of everything else it has to offer.
You mischaracterize one of MongoDB's main draws. It's not just indexing, but indexing over a schemaless dataset. Mongo isn't unique in providing this, but combined with other features it is compelling for certain scenarios.
That said, if Postgres matches/exceeds enough Mongo features, it could certainly outshine Mongo for those scenarios.
AFAIK, map-reduce is relatively expensive on Mongo... typically recommended only for background execution, nothing synchronous.
I've harped on this many times in the past. I love PostgreSQL, but where is its easy to replicate feature that matches MongoDB for ease of replication? Please answer "you must not be keeping up with PostgreSQL lately" and I'll eat my words!
IMHO this seems like building a better apple by making it taste like an orange. This could definitely be useful for storing json from javascript frameworks like backbone/angular/ember, but I don't see how this is going to improve the lives of PostgreSQL users in general. Hopefully this leads to some weird cool emergent use case that make PostgreSQL way better than other databases, but I'm not sure at this point that it will. Does anybody have more context on the long-game for this approach? I'm not sure what the goal is for adding this feature.
Look at some of the linked PDFs. It's the indexing technique and on-disk format that is an enabler for several notable use cases; personally, I'm looking forward to a time when I can use pg for full-text search without having to resort to another external system like Lucene or Sphinx. What it won't have is the query "smarts" like Lucene or Sphinx do today, but you need speed first.
I'm very excited to put my 3m documents into hstore, as they are currently in a horrible, unqueryable, unwritable (mostly) blobs of compressed, b64encoded serialized ruby hash structures (long story, predating nosql).
This makes me very, very happy.
Edit:
I should point out why:
-my document structures pretty much match 1:1 with hstore now (we have nested arrays of hashes)
-i can query & report on the data with normal tools
-i can update individual fields on my data (currently impossible)
-i can do relational joins & such-like in the db (currently impossible)
-i can have sensible transactions
This is impressive but not really surprising given the quality of PostgreSQL database. Quote:
"We added performance comparison with MongoDB. MongoDB is very slow on loading data (slide 59) - 8 minutes vs 76s,
seqscan speed is the same - about 1s, index scan is very fast - 1ms vs 17 ms with GIN fast-scan patch. But we managed to create new opclass (slides 61-62) for hstore using hashing of full-paths concatenated with values and got 0.6ms, which is faster than mongodb !"
But note it's at the expense of a very large index file - 800MB (vs Mongo's 100MB). Although Pg's index covers the entire JSON structure, whereas Mongo's index only covers the leafs being searched, so it will optimize a larger variety of queries.
800MB is for the old GIN index, the new experimental one is 350MB, and has not yet undergone more space optimizations. They are also not really comparable: Mongo only indexes a few select keys, whereas the new GIN index does many.
It takes a non-0 amount of time to load large indices into memory (upwards of 5 seconds, depending on your choice of disks and arrays), and to iterate through large indices once they are in memory.
Such large indices also limit how many of them can be stored in memory at the same time - making transactions against those indexes slower over time if the indices have to be loaded and unloaded from memory frequently.
You should always choose indexes with care, because indexes not only cost space, they cost time. And as you said, Time is expensive.
> It takes a non-0 amount of time to load large indices into memory (upwards of 5 seconds, depending on your choice of disks and arrays), and to iterate through large indices once they are in memory.
This is why MongoDB recommends all indices fit in memory
Great work and research; horrible submission headline.
The news here is that, in addition to the huge market for traditional DBs, postgres is going to compete in a serious way on MongoDB's home turf. As that becomes more apparent, it will validate postgres's flexability/adaptability and cast doubt over special-purpose database systems and NoSQL.
MongoDB still has a story around clustering, of course. But that story is somewhat mixed (as all DB clustering stories are); and postgres is not standing still on that front, either.
(Disclaimer: I'm a member of the postgres community.)
Hey Jeff, been a postgresql fan for a long time, but clustering is my number one issue. What is coming down the pipe for clustering/replication for PostgreSQL?
Logical replication will be a big deal over the next couple releases. I'm not sure exactly what will be available when, but I know that god foundations are being built now.
Jeff, first, thanks for the response and I'm heartened by what you say.
If I could suggest, please look at ease of configuring replication in MongoDB. If you make it that easy, even if it's just for the JSON stores, I know that maybe you can't make it easy for relational tables, I will move my product to PostgreSQL in a heartbeat.
I have used and promoted PostgreSQL throughout my career, but I'm currently stuck with having to deliver HA in a product for redistribution to customers and Slony can't cut it for our use cases, it's too complicated. I think the key is that we are a vendor and we need to provide a database with our products, so it's not me that's responsible for configuring HA, it's our users. And replication is what I'm concerned about 90% of the time, not necessarily horizontally scaling out (i.e. we're not using sharding in 90% of our cases, we just need HA).
SELECT json_field FROM data WHERE json_field->'age' > 15
Part of the performance increase for hstore is improvements for GIN indexes, and according to the author can be applied to json. So yes you can use indexes on your hstore or json documents.
Yes, for a project I'm working on, I had a JSON array of objects that each had a key called name.
I have fast indexed queries that return any rows that have an object with a given name in it, and for full text search on that name along with a few other things.
Which is not surprising, as this is 'real' database. I read review of MongoDB internals and it sounded not very positive. While you can create NoSQL quickly, to do it well with any db, it takes time which PostgreSQL definitely has.
Just to be clear, MongoDB is fine db for certain scenarios and I am using it in production.
If you want to find more of the detail and background around this which came from a talk by the author of the blog post then you check out his slides at http://www.sai.msu.su/~megera/postgres/talks/hstore-dublin-2.... One of the key things not in the blog post but addressed in the slides is that there will be a new index type that really pushes the performance gains even further.
Well, if I make the same sacrifices I make to use Mongo I don't really have a reason to use PostgreSQL.
No, but I really like the recent enhancements in PostgreSQL. Failover is nearly as easy as in MongoDB, however all this doesn't play so nicely yet, if you are using stuff like extensions (PostGIS), your own functions and still isn't really an out-of-the box experience.
I agree, if you use it with the same limits and the performance gain is worth it, you can as well be using Postgres. However, a lot of this actually only changed in the recent releases. It's all still fairly new.
Also there are still a number of things that are basically missing, like out of the box upserts (we are using a function for this, but it more a hack) and if you are still somewhat in development a lot of little changes get really hard in PostgreSQL. Converting your data structure, even with stuff like CTE and surrounding functionality can become really challenging, especially when you think there must be an easier way.
Where it is easier to modify structures in MongoDB it is actually harder to aggregate it sometimes. Using stuff like Map Reduce (even the lightweight version called aggregate) frequently appears sort of an overkill.
I think however it really depends on the kind of data you are dealing with. That's why we are using a hybrid system right now. Both systems are actually evolving really quickly and if you have the joy of using their most recent versions one is always excited about new releases.
About the upsert function, have you found a way to do this generically, or are you generating (at least) one function per table from a template? I've found the hackiness of this to be easily the worst part about using Postgres.
You didn't answer my question at all. You claimed postgresql was an enormous challenge to scale horizontally. I asked how. There are still lots of reasons to use postgresql, I assumed you knew this since you expressed that you wanted it to be easier to scale horizontally.
If you want to scale out using a replica-strategy you really want to have the most recent version of PostgreSQL. It made things like log shipping pretty much an automatic experience. It still has rough edges, but the latest version really makes a difference here. See the release notes of PostgreSQL.
You will however find plenty of information on the wiki. Sometimes it is not completely up to date because of so much going on on that front.
You can use third party solutions too, but they usually have a number of caveats and general problems. However it depends on what you do with your database. If you don't deal with writing your own functions, use extensions or have fancy transaction they will work just fine. If you do, the functionality of Postgres is a more safe strategy. If you are dealing with extensions, etc. you should really be ready to get dedicated support for these things.
However, it really depends on what you do. For everything standard it won't make too much trouble once the first setup is done.
The problem is he doesn't know he is making the sacrifices.
MongoDB marketing sloppy and promises "webscale" while hand waving partition-tolerance way.
After the whole "let's disable durable write to make our benchmarks faster" I just can't see trusting them with data I want to actually read back from the database. It might be good for probabilistic storage or stats reporting. It sort of because an issue of trust more than anything.
Implementing our doc store on Postgres a while ago in lieu of switching to MongoDB suddenly seems like a very good call.
We had nothing against Mongo (frankly haven't gone into deep analysis of how it would turn out). It was simply the db we already had at hand and we knew it well and trusted it.
It would be great if they could run these benchmarks with mongodb using TokuMX.
I think tokutek's fractal trees would make an exciting data store for postgresql, but benchmarks are difficult to find... Who is perceived to be faster these days generally? Mysql+innodb, mysql+tokudb or postgresql?
What I did: 1) Use the latest Postgres (9.3) 2) use Rails4 - this provides String/Integer Arrays, HStore, UUIDs out of the box 3) use post_json (https://github.com/webnuts/post_json).
These tools made it simpler to do things like tagging, user-defined data attributes, hierarchical data, flexible configuration data.
It was always possible to store much of this data using text fields and JSON.parse (or some such), but the new Postgres stuff makes the flexible datatypes queryable. And as built-in support for flexible data types comes online, you can throw away your custom serializers.
There's other Postgres datatypes that I haven't used - for geocoding/geosearching, network addresses, etc.
I was in the audience!
It was a great session - you're a funny guy, explained the concepts well, and there was some really interesting real(or alternate :p) world use-cases for all this stuff.
Once this work is released, PostgreSQL will be faster than current versions of MongoDB at carrying out queries for documents which look for a given value at a given path (or perhaps more generally, values satisfying a given predicate at paths satisfying a given predicate).
But that has never been one of MongoDB's important strengths, has it? The operations for which MongoDB is optimised are inserting documents, retrieving document by ID, and conducting map-reduces across the whole database. Yes, it has the ability to query by values in leaves, and it can use indices to make that faster, but that's a bit of a second-tier, bag-on-the-side feature, isn't it? If that was the main thing you needed to do, you wouldn't have chosen MongoDB, right?