It's sad, but many developers actually count on JSON key ordering. Why, I don't know, but they do. Why people like to code to the implementation, not the spec, I do not know.
I've dealt with one of these. EAN, the Expedia Affiliate Network[0], has an API with what they call an "XML" mode and a "REST" mode, where "REST" means JSON. The "REST" mode is very clearly translated directly from the XML mode. I'm simplifying and doing this from memory so bear with me, but here is how translation works:
<Hotel id="1">
<Amenity>Bacon</Amenity>
</Hotel>
turns into
{'Hotel': {'@id': '1',
'Amenity': 'Bacon'}}
Straight forward enough. But here's what happens when you have two <Amenity> elements:
The translation engine appears to have no knowledge of the schema, it just adds ___List and ___s entries. So the schema of the JSON is different based on the presence of a repeated element
Also since it doesn't have any knowledge of the schema, all elements are textual except the special @size element.
Because @size is a special element for these lists (which of course you have no need for in JSON), but the engine turns XML attributes into @____ as well, there is no way to get at the actual "size" attribute if one exists.
It has a reasonably decent (highly decent for a header file) description.
One thing I had to take a guess at is that, in JEntry, the header is a combination of a bit mask and an offset. I also would guess that offsets for identical keys (can happen in nested json), and maybe offsets for other identical data will be identical. I didn't check either.
PostgreSQL was already able to compete with Oracle's RDBMS and Microsoft's SQL Server but could soon supplant Mongo for most jobs.
It's great to know that the only required storage components nowadays could be PG and ElasticSearch (as PG's full-text search can't compete with ES), and that the former is a no-brainer to setup (on top of AWS, Rackspace, etc.) or cheap to acquire (with Heroku Postgres for example).
Out of interest, what would you say the weaknesses are of the full-text search in PostgreSQL?
NB I've been using PostgreSQL for a few months on a side project and I've been hugely impressed. I wanted to add full text searching at some point and rather than using Lucene or Solr (or similar) I thought I would use PostgreSQL's own search capabilities - which certainly makes some things a lot simpler than using a separate search engine.
I can't speak for the original poster, but I have three issues with the postgres offering:
1) it has suboptimal support for handling compound words (like finding the "wurst" in "bratwurst"). If the body you're searching is in a language that uses compounds (like german), then you have to use ispell dictionaries which have rudimentary support for compounds and which aren't maintained any more in many cases because ispell has been more or less replaced by hunspell which has far superior compound support which in turn is not supported by postgres.
2) If you use a dictionary for FTS (which you have to if you need to support compounds), the dictionary has to be loaded once per connection. Loading a 20MB dictionary takes about 0.5 seconds, so if you use Postgres FTS, you practically have to use persistent connections or some kind of proxy (like pgbouncer). Not a huge issue, but more infrastructure to keep in mind.
3) It's really hard to do google-suggest like query suggestions. In the end I had to resort to a bad hack in my case.
Nothing unsolvable, but not-quite-elastic search either.
Thank you very much for 2) - this will allow me to considerably lessen the amount of infrastructure I'm having to take care of (pgbouncer can go now).
About 3: That's what I was using before moving to tsearch, but pg_trgm based suggestions sometimes were really senseless and confusing to users. Also, using pg_trgm for suggestion will cause suggestions to be shown that then don't return results when I later run the real search using tsearch.
I'm happy with my current hack though, so I just wanted to give a heads-up.
Thanks - I'd be delighted to get to the point where I have those kinds of problems!
What I will do is bear those limitations in mind and if I ever do have those problems there is, I guess, a reasonable chance that postgres development will have addressed them by them or I will redesign things to use a separate search engine.
If you use a dictionary for FTS (which you have to if
you need to support compounds), the dictionary has to
be loaded once per connection. Loading a 20MB dictionary
takes about 0.5
Does this happen for every connection, or just those that use FTS?
I'm thinking of a case where only maybe ~0.5% of queries will use FTS. I'm working on discussion software, and I'd like the discussions to be searchable, but realistically only a small percentage of queries will actually involve full text searching.
Only happens for connections that use FTS and only if you have configured it to use a dictionary (I believe there's no need to in case of an english body). But keep in mind that this means that you'll have to pay at least 0.5 seconds per search. If you do some google suggest style incremental searching on key press, you'll need to be faster than that.
So either use a persistent connection for full text searching (I'm connecting via pgbouncer for FTS-using connections) or, as karavelov recommended below, use http://pgxn.org/dist/shared_ispell/ which will load the dictionary once into shared memory (much less infrastructure needed for this one)
This isn't a problem of the index type (I'm using GIN in this case), but of the tsearch lexizer which only has limited support for hunspell.
The step that causes the trouble is ts_lexize(), not storing and consequently looking it up.
Again, I made it work for my case, but it was some hassle and involved running a home-grown script over an ispell dictionary I've created by converting a hunspell one into an ispell one.
Because using a real full text solution goes much farther than using regexes and LIKE. It also allows the use of indexes which many regexes and many LIKE patterns would not allow.
For example, I can now find the "wurst" in "Weisswürste" which, yes, I could do with a regex, but I can also find the "haus" in "Krankenhäuser" and all other special cases in the language I'm working with without having to write special regexes for every special term I might come up with.
Exactly. The only reason I have to bother with Solr at the moment is to get efficient ngram indexing for sequence data; which consists of lists of 7-8 character strings from various sources. What I have works, but feels like overkill for my case.
These advances within the GIN inverted index infrastructure will also greatly benefit jsonb, since it has two GIN operator classes (this is more or less the compelling way to query jsonb).
2. Solr uses TF-IDF out of the box, and PG doesn't.
3. PG is good enough for 90% of cases, but Solr has some advanced stuff that PG doesn't. Like integration with OpenNLP, things like the WordDelimiterFilter. (Andre3000 = Andre 3000)
4. PG is kinda annoying in that it will parse "B.C." as a hostname, even though I want it to be a province.
5. Solr is faster than PG, but PG has everything in one server.
6. Solr handles character-grams and word-grams better.
Another weakness not mentioned is document similarity search. You have one document and you want to find the N most similar in your collection to offer suggestions like Amazon does. You can do it with fts in pg, but it is way to slow to be practical. Custom fts engines are much better at that task.
If search is your application's main purpose, you'll want something more flexible.
Otherwise, it's fairly easy to implement, and you get full SQL support so joins, transactions, etc. So you can always prototype it and see what limitations you run into.
I mean for offline usage. Just using JSON basically means you
ll have to reimplement all the database behavior for the offline client. This is where TouchDB comes in handy, which is the reason I'm using CouchDB - you can basically run most the server on the mobile device.
I still think that performance aside, MongoDB is more interesting for developers using Node.js because the development time is shorter and the API more intuitive.
It depends on what your data looks like. If you've got a collection of JSON documents that gets served verbatim through a JSON API, Mongo is a natural fit. If you've got a domain model with different types of entities and links between them, Mongo is aweful.
The app I'm working on right now evolved from the former to the latter, prompting me to switch from Mongo to Postgres, and it's made the code base much, much simpler. Mongo gets really painful when you have to fetch several documents (serially, because you have to follow the links between them) and join them at the application level before rendering the output.
For that kind of application, SQL with joins and sub-selects is so much better.
The nice thing here is that it's easy enough to write a mongo compatability layer on top of Postgres (hell, it probably already exists...), allowing you the just-get-to-it ease of mongo combined with the substantially superior engineering of Postgres.
For those interested, the online documentation for the development version of PostgreSQL has been rebuilt and contains documentation about the new feature:
Yes, you can do extract values from JSON, filter based on those values and even create indices - not sure if that counts as "Mongo-style queries" but I've used it and it works pretty well:
The only thing to note about current JSON support in PostgreSQL is that, as far as I know, there is no way to update parts of a JSON document - you have to rewrite the entire document each time. I've not found this to be a big headache YMMV.
One important difference is that you don't need to extract the values to create an index. E.g. in Mongo I believe you still have to decide which keys to index. A GIN index on top of jsonb in PostgreSQL will make it possible to do an indexed search on any key, not just those that are specified at create index time.
Isn't this the case for postgresql data in general? ie, on updates, new data rows are written containing all the field values, whether or not they have changed.
It's more they case that when you want to reach into an existing stored document and change a single value you have to re-write the entire document from your application.
On the reading side there are functions and operators that allow you to reach into stored JSON and extract the parts you want. What would be nice would be to be able to do something similar for updates - although this is clearly more complex than reading, so I can see why it has been done this way.
Edit: I guess the most general solution would be to directly support something like JSON Patch:
Yes, but you don't have to specify all the other columns that aren't being updated.
With json/jsonb, you have to provide the entire object graph each time you are updating it. You can't update one field in the graph. Which could be a pain if you have concurrent updates.
Hopefully we'll be able to update parts of the jsonb columns sometime.
You're right, because MVCC works on immutable rows. The structure can be broken up to accommodate large values[1], which saves updating parts that don't change, but json data is a single column and updates need to copy the entire document.
I should have been more clear. What I meant was that in-place partial updates are not going to happen any time soon. This is mostly a performance issue, updating a value inside a JSON document will never be lightning fast because you need to copy the whole row and clean up the old version after the transaction has committed. JSON patch would avoid transfering the JSON document back and forth over the network, but the local copy will still need to happen.
True. I think reducing network overhead, removing one SQL call (currently need SELECT + UPDATE), and simplifying application code is a good start and probably fast enough for most use cases.
Do you have a reference for that - I had a quick look through the documentation for this new feature and it looks mainly like a better way of storing and accessing JSON data - not sure it helps with updates.
[NB Not being snarky... genuinely curious about this!]
It provides a Mongo-like API directly on Postgres, using normal tables/columns.
I've used it in some projects in the past on some projects and it's fairly staightforward, and makes it easy to port an app from Mongo->Postgres.
All of the equivalent operators don't exist, though those are only slightly interesting. Perhaps most interesting would be all the index types, right now currently GIN works, but GiST didn't quite make it. GIN was definitely showing the better performance on JSONB but there are still valuable cases for GiST which often appears smaller on disk.
I only hope it won't be too long before various "drivers" and frameworks start supporting it, otherwise the acceptance rate will be too low. And Jsonb deserves to be accepted as fast as possible.
I'd suggest opening bugs on drivers (that's the most important one) right now since the feature is officially in the development version. If it doesn't already work that is.
That's how it was done for 9.3's expanded diagnostics[0], and as a result both Python's psygopg2[1] and Ruby's ruby-pg[2] (and maybe others) added extended diagnostics support before 9.3 was even released[3], and working from day 1.
Considering Postgres is the first relational database to support jsonb, portability (which is the main benefit of drivers) doesn't really apply. Any decent driver or framework should let you write native psql commands, so you should be able to start using this as soon as it's released.
Being able to use Jsonb through a framework and fully utilize it (through it) are not the same things. I will be definitely use it soon, but unless it's fully supported in frameworks (...) it won't be widely accepted.
There are ORMs like SQLAlchemy (Python) and Sequel (Ruby) which try to expose as much as possible of the features of all supported databases. It is common for Sequel to add support for new PostgreSQL features while the new PostgreSQL version is still in beta testing.
What's involved in supporting JSONB in Rails? Given that you probably want access to the entire JSON blob at the Rails level, aren't you good to go already?
Sure, but it's also nice to have a layer of abstraction over psql. I personally really enjoy working with Django's ORM, for example, over writing native queries.
BSON actually was considered as the JSON storage format for PostgreSQL, but was discarded once people figured out that BSON stores ["a", "b", "c"] as {0: "a", 1: "b", 2: "c"} which is just silly.
wow.. thanks for that info, i came here to see the diference and and a little shocked by the silliness of BSON (and the team that created it, apparently Mongo); compared quality devotion of the postgres community..
jsonb is the name of a data type used in postgres in order to store, index and update JSON. bson is the same used by MongoDB and also made a semi-official standard.
Can anyone comment on the storage overhead of JSONB key names? One of the worst things about MongoDB's BSON format is that 8M instances of the following record:
{
"user": "test",
"isValid": true
}
will at a minimum cost 8M * (4 + 7) simply due to field name overhead. MongoDB isn't smart enough to alias common field names, and this has remained as one of its biggest problems.
If only there were a way to store data where each record has the same structure...
The JSON support in postgres allows a hybrid approach. Put the common data in a tabular format, with a JSON column that stores all the extra, variable data.
Internal pseudo-schematized data storage and JSON support aren't mutually-exclusive. Nice to have a checkbox feature like this but the implementation isn't particularly exciting for large datasets.
PostgreSQL jsonb has the same problem in the actual document storage, with repeated keys. I believe it does better in the index, when you use the correct operator, but the table storage has the same type of overhead.
Document stores trade off some performance and robustness to increase ease of development. When that becomes a problem you can switch to more structured tables. One possibility is to structure frequently accessed fields and stuff the remainder into a json column.
With a power law, only a few keys occur frequently.
Changing the data representation implicitly would cause unpredictable performance as columns get rewritten.
If you want to do more explicit storage optimisations, maybe one of the mongo-like overlays could accommodate that.
It's more than that - virtually large all datasets I've worked with (> 20B documents) had a keyspace of under 128 keys/columns per nested level of a JSON document. Meaning an optimizing DB could express most keys as single byte (or less with packing). I don't thing column reordering would be necessary unless you exhausted the first 128 keys and you wanted to optimize the base-128 keyspace assignment. Such an operation could be done during compaction or any other operation that rewrites a table.
Is there a trend of making verbs reflexive among developers/managers? I've seen "product X has released", "patch Y has applied", and a lot of similar phrases which I'm still having trouble parsing. When I saw the title I thought "has committed what?"
As others have pointed out, this seems to be due to the writer not learning English as their first language.
I love this, and it reminds me of Steven Pinker's "The Language Instinct" where he mentions some stats about the most common dialects of English world-wide are not actually in "native" English speaking countries, but in regions in south-east Asia, as well as India and, increasingly, China.
It's highly likely that, in two or three generations, all of the English speaking population of the world will be "reflexifying" their verbs... Please excuse my mangling of my native tongue :-)
Native English is the new Latin. Born French, I'm good in English, but I remarked I had something to learn when I noticed I couldn't understand anyone in that country [1] although they could understand each other and were speaking with the same actual rules.
I've also noticed how very common it has become to find questions in the format "How to do [task]?" rather than "How do I do [task]?"
I strongly suspect that this has arisen from Google searching, where people naturally write their queries to complete the sentence "I would like some information about ________."
Setting aside speculations of a second-language speaker, it's common to hear things like "landed" from manager types.
I think it's a trend of increasingly depersonalized corporate-speak.
Sam added a feature. [Sam did or achieved something and has gained some partial responsibility for the end product]
We added a feature. [Sam is lost in the collective, but at least he is part of us]
A feature was added. [name omitted and agency de-emphasized, but the omission is a bit pointed; one still could wonder who added it]
The feature landed. [not even an implied agent here]
This progression removes more and more of the ownership and achievement of the paid worker-bees until they aren't even there. Then, features are simply landing left and right out of the sky, according to the Jobs-like "vision" of designers, PMs, executives.
Edit: it's like "damnatio memoriae" for the people who are actually laying the brick.
Does the JSON type still have a set of use cases as well or is it just obsolete now? If I was writing large quantities of json documents to a table, and I didn't need tag indexing or targeted updating, would using the JSON type be faster than JSONB?
Probably the most important difference is that the old JSON types preserves whitespace and attribute ordering, whereas the new one doesn't. If you don't care about those, the old one is also slightly smaller on disk - but I would say not worth it based on the simple fact that you might want to search it in the future some time...
As of ver 9.4, PostgreSQL support storing data in JSON format. Which it is nothing but 'text' datatype with additional syntax check built-in. You can't do a query or do update partial/selected node inside the JSON doc. But with this new jsonb (JSON Binary) format, you can do simple query as well as update certain node (if it can reach through the query).
Think it like a very basic "MongoDB" implementation :)
just nitpicking, but the json data type was added with 9.2 where it was not much more than syntax checked text plus one or two functions to convert from records and arrays into json.
In 9.3 we got many more useful functions that allow to query into json documents, though if you wanted to select rows based on the content of a json document, you'd have to resort to functional indexes.
9.4 (which is what the above post is talking about) will provide index support for arbitrary queries and a much more efficient format to store JSON data which will not involve reparsing it whenever you use any of the above functions.
You don't need an index to select rows based on the contents of a json document in 9.3. You may want or need them depending on your queries. (I use them sparingly, the most common pattern for me is that an indexed non-json column filters a query down to a smaller set of rows, and then a further index on the attributes referenced in the query isn't necessary.)
And if you do want to create an index now on a JSON attribute in 9.3, a function isn't necessary.
Yes. You are right. At the moment there's no exposed functions for altering that binary data structure.
Because of the way how Postgres works though, the row will always have to be rewritten in the future (all updates to a row will cause a new copy to be written - rows are immutable in Postgres). What we might gain in the future is a way to skip the parsing process, but the document will always have to be rewritten.
I don't follow Postgres too much and I saw the tweet by Peter van Hardenberg first. It's great to see the enthusiasm for the project.
It's likely that 9.5 or even 9.6 will actually see the real significance of this new feature. I don't know how solid the implementation is but once its in the wild and these missing features are added to bring it to parity with MongoDB and others, and the bugs have been fixed, it will be a great solution for a NoSQL database instead of MongoDB and others.
That is significant and worthy of celebrating. This will also give MongoDB and others another yardstick to stay on top of their game.
tl;dr storing json in a way that doesn't mean repeatedly parsing it to make updates