PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data

darksaints · on Jan 7, 2016

Whenever there is a version update, I can't help but be grateful for the documentation ethic of Postgres. For the vast majority of my projects I have to wade through unaffiliated and incomplete blog tutorials that may or may not be relevant to the version I'm trying to use. With anything related to Postgres, I may read about something on a blog post, but I always know that I can count on the Postgres documentation if I need supplemental information, or sometimes I'll skip the post and go straight to the official docs. The PostgreSQL project, in my mind, sets the standard globally for software documentation.

I should add that Postgres was the first database I ever used, and I literally learned pretty much everything I know about Postgres, SQL, as well as Relational and Set Logic from the official docs. And that was with no background in software development and an undergraduate business degree with Excel being my most technologically advanced toolset. That is a documentation success story.

jrapdx3 · on Jan 7, 2016

Remarkable similar to my own history with Postgresql, which I started using in ~1998 at the time of their first public release. Postgresql documentation has indeed been the exemplar for all software, open source or not. It's been the SQL textbook I've relied on.

With the steady addition of features, it's gotten much more complex, and there will come a time when using just the documentation won't be enough to learn how to use Postgresql to full advantage. With release of 9.5 we might be there now.

Perhaps the logical extension of the documentation is some form of coursework to enable users to learn the DB systematically. I haven't looked into it, this might already be offered.

AlisdairO · on Jan 7, 2016

(self plug) for coursework on the read-only SQL side, you might want to give http://pgexercises.com a try. I have ambitions to expand it to include arrays and json, but alas haven't found the time so far...

Alex3917 · on Jan 7, 2016

This is awesome, thanks for making it!

AlisdairO · on Jan 8, 2016

Cheers, hope it's useful :-)

ptman · on Jan 7, 2016

I call that BSD culture

cpursley · on Jan 8, 2016

Speaking of great documentation, I've been blown away by the quality of documentation for sequel - a Ruby ORM that has some more advanced features (than ActiveRecord) for working with Postgres:

- http://sequel.jeremyevans.net/documentation.html

- http://sequel.jeremyevans.net/rdoc/files/doc/postgresql_rdoc...

- http://github.com/jeremyevans/sequel

sedatk · on Jan 8, 2016

an interesting point about pgsql is that the team wrote the documentation first and coded it after. when it's finished they didn't need to write docs.

threeseed · on Jan 7, 2016

> PostgreSQL project, in my mind, sets the standard globally for software documentation.

Sure if you already understand PostgreSQL. If you don't then it isn't particularly user friendly. It's awkward to navigate, doesn't explain the basics (how to actually install it) and tends to combine information that is more for advanced users on the same page as beginners.

Compare you're standard with mine (MongoDB):

http://www.postgresql.org/docs/9.5/static/

https://docs.mongodb.org/manual/

simoncion · on Jan 7, 2016

> ...doesn't explain the basics (how to actually install it)...

Er, http://www.postgresql.org/download/ and/or http://www.postgresql.org/docs/current/static/installation.h... ?

> ...tends to combine information that is more for advanced users on the same page as beginners.

There's a tutorial for beginners: http://www.postgresql.org/docs/9.4/static/tutorial.html

Then there's reference material in the rest of the manual. The tutorial even suggests the order in which you should read the rest of the manual. :)

threeseed · on Jan 7, 2016

The documentation doesn't explain how to install binaries and the download pages don't either. Everything defers responsibility to whatever package manager you're using e.g. "PostgreSQL can also be installed on Mac OS X using Homebrew. Please see the Homewbrew documentation for information on how to install packages". And the documentation only explains installing from source code which I can't imagine most people aren't doing.

The point is that every other database explains precisely the recommended approach for how to install it on every platform.

simoncion · on Jan 7, 2016

> The documentation doesn't explain how to install binaries and the download pages don't either.

Hmm.

"The graphical installer for PostgreSQL includes the PostgreSQL server, pgAdmin III; a graphical tool for managing and developing your databases, and StackBuilder; a package manager that can be used to download and install additional PostgreSQL applications and drivers.

The installer is designed to be as straightforward as possible and the fastest way to get up and running with PostgreSQL on Windows." http://www.postgresql.org/download/windows/

"Binary packages for Solaris can be downloaded from the solaris subdirectory of the version you require from our file browser.

Packages for Solaris 10 and 11 are available for Sparc and i386 platforms.

Although produced by Oracle (previously Sun), these packages are not officially supported by them.

Solaris packages are installed by unpacking the compressed tar files directly into the install directory; see the README files for details." http://www.postgresql.org/download/solaris/

"These distributions all include PostgreSQL by default. To install PostgreSQL from these repositories, use the yum command:

yum install postgresql-server

Which version of PostgreSQL you get will depend on the version of the distribution: ..." http://www.postgresql.org/download/linux/redhat/

"Debian includes PostgreSQL by default. To install PostgreSQL on Debian, use the apt-get (or other apt-driving) command: apt-get install postgresql-9.4

The repository contains many different packages including third party addons. The most common and important packages are (substitute the version number as required): ..." http://www.postgresql.org/download/linux/debian/

"Ubuntu includes PostgreSQL by default. To install PostgreSQL on Ubuntu, use the apt-get (or other apt-driving) command: apt-get install postgresql-9.4

The repository contains many different packages including third party addons. The most common and important packages are (substitute the version number as required):" http://www.postgresql.org/download/linux/ubuntu/

"RPMs for SUSE Linux and openSUSE are available from the openSUSE Build Service in the project server:database:postgresql. Platform-specific RPM packages are available for PostgreSQL as well as a variety of related software. Use the search facility to find suitable packages. Documentation is also available there." http://www.postgresql.org/download/linux/suse/

"Note! These are the generic Linux download instructions. If you are using one of the major Linux distributions, you should read the distribution specific instructions: ... PostgreSQL is available integrated with the package management on most Linux platforms. When available, this is the recommended way to install PostgreSQL, since it provides proper integration with the operating system, including automatic patching and other management functionality.

Should packages not be available for your distribution, or there are issues with your package manager, there are graphical installers available.

Finally, most Linux systems make it easy to build from source. ... [Also, i]nstallers are available for 32 and 64-bit Linux distributions and include PostgreSQL, pgAdmin and the StackBuilder utility for installation of additional packages.

Download the installer from EnterpriseDB for all supported versions." http://www.postgresql.org/download/linux/

> The point is that every other database explains precisely the recommended approach for how to install it on every platform.

It really looks like Postgresql does that, too. Do you disagree?

jsprogrammer · on Jan 7, 2016

The first Mac OS X package is the Graphical Installer. [0]

It looks like it takes you to a link that lets you download a .dmg file. I assume those are basically one-click installers for Mac OS X programs?

[0] http://www.postgresql.org/download/macosx/

donarb · on Jan 8, 2016

Even easier is the Postgres.app for Mac. It's a Mac application that wraps the Postgres server. Nothing to compile, just download and start the app.

vetinari · on Jan 8, 2016

Yes, it is one click installer for OSX.

However, the EnterpriseDB for OSX is basically broken. The good news is, that the user, who needs step by step instructions to download and install, will never find it out. On the other hand, user, who will find it out (because he is unable to build any pg extension with their pgxs, for example) does not need instructions at this level.

nkozyra · on Jan 7, 2016

They are. I don't think it's in any way more arcane to install than, say, MySQL. Not to say the latter has stellar documentation, but those complaints are picky if not invalid.

lugus35 · on Jan 7, 2016

With a REST layer like https://github.com/begriffs/postgrest you don't need any extra application server layer to serve your data, securely (RLS). Bye bye Java EE ?

reitanqild · on Jan 7, 2016

You just got an accidental downvote as I came back after bookmarking your link. Sorry.

Dang/mods: Can we please please get something to avoid accidentally downvoting good stuff on mobile phones?

danieltillett · on Jan 7, 2016

Happens all the time and has been the most requested feature of all time. I like to take the charitable view and say the current design is there to let your ego explain away down votes - i.e. I am only being down voted by accident, not my post is terrible.

mdellabitta · on Jan 7, 2016

That feature is included with every new account!

saidajigumi · on Jan 7, 2016

It's especially bad on mobile, but desktop can be a problem too. I know I've misclicked a downvote on rare occasion, to my regret.

spacemanmatt · on Jan 8, 2016

I gotta plug OpenRESTy at this point. It's based on Nginx, and it's really easy to work with. It is my go-to REST solution for PostgreSQL.

rodionos · on Jan 8, 2016

It's not going to be as easy as you make it sound - getting rid of the application tier. Think of all the goodness that the Spring framework provides, for example. Do you anticipate all the modules will be taken over by a database?

scardine · on Jan 7, 2016

Postgrest has a very clever design. I want to see other tools going this route.

darksaints · on Jan 7, 2016

It would be kinda cool to find a way to stuff a fast HTTP server into Postgres and run it all directly from the Postgres process serving the request.

dc2447 · on Jan 7, 2016

There is an nginx module for this.

nickpeterson · on Jan 8, 2016

People seem to dislike this because they want separate components in order to 'scale out', but honestly, having the option for most trivial applications would be extremely nice. Even more so when you consider Postgres supports routines written in non-sql programming languages.

leeoniya · on Jan 7, 2016

relevant?:

https://en.wikipedia.org/wiki/Jamie_Zawinski#Zawinski.27s_la...

bdcravens · on Jan 7, 2016

I looked at Postgrest, and it was a little more opinionated re URLs and associations than I would have preferred.

brunoqc · on Jan 7, 2016

I'm not that familiar with REST. Care to explain what you mean?

Pxtl · on Jan 8, 2016

REST just mean exposing your business object via HTTP verbs. In addition to the normal Get and Post http verbs used by your browser, there's also Delete and Put. Put is effectively equivalent to an Upset command, so with Get, Put, and Delete you've got all the operations to manage an object (create/read/update/delete), with Post for special procedures.

Those also obviously correspond to the Insert/update/select/delete SQL operations.

So hooking a RESTful interface direct to your db lets JavaScript on the client talk directly to the database via an http server with no business layer in between. It's obviously a nifty workflow for the "JavaScript all the things" developer.

That said, I think restfulness is a bit of a fad tied to the popularity of javascript-everything. But it has created a limited, simple standard for http services, so that's nice.

brunoqc · on Jan 8, 2016

Thanks but I was referring to what the person meant by saying that something was not like he would have preferred.

topspin · on Jan 8, 2016

Postgrest deviates from some REST URL conventions that REST users are familiar with. As an example, one very common convention gives each item a distinct path with the 'id' (the primary key of some table, typically) as the last path element: /some_path/123. Notice that the 'key' column name is implied and simple (not compound).

Postgrest does not adopt this convention. This decision is not arbitrary, however. Postgrest is conceptually a function that computes a REST API for a PostgreSQL database, and tables in a PostgreSQL can (for better or worse) have compound primary keys, which do not map well to a simple URL path. So Postgrest uses a more general URL syntax. The equivalent of the above is: /some_path?id=eq.123.

Postgrest represents some novel thinking about the use of certain HTTP verbs as well. A lot of this is discussed in the Postrest introduction video here: http://postgrest.com/

gnaritas · on Jan 8, 2016

> Postgrest deviates from some REST URL conventions that REST users are familiar with.

Those are common API conventions, but they have nothing to do with REST; REST only requires unique URI's for resources, it has nothing to say about the format of those URI's. The Postgres version does not deviate from REST, it deviates from a popular API pattern, that's all.

bdcravens · on Jan 8, 2016

I like the idea of REST as a language-agnostic datasource; adhering to conventions facilitates this.

gnaritas · on Jan 8, 2016

If you're actually doing REST, the URI convention is irrelevant since you won't be constructing URI's to being with, but following the URI's given to you by previous entry points. Actual REST requires you follow links, not construct them. So having well defined URI conventions like above actually harm the REST API be encouraging API clients to construct links manually rather than use the links given to them and it hurts the API by not allowing the server to change the URI structure over time because clients become bound to a particular format. Those conventions are harmful under the principles of REST as they encourage tight coupling and prevent change. URI's are not supposed to matter for a reason, REST is supposed to be hypermedia, i.e. follow links, don't build them.

dragonwriter · on Jan 8, 2016

REST specifically rejects the need for URL conventions because HATEOAS. If URL conventions matter -- i.e., if resource identifiers are communicated out-of-band rather than client and server being decoupled by way of resource identifiers being communicated in-bamd through hypermedia -- what you are doing is not REST, but instead exactly the problem for which REST is the solution.

terrorblade · on Jan 8, 2016

noob here .. if the database is in charge of serving rest, where does the business logic go? am i right in thinking that this would eliminate the need for a django/express?

lkjaero · on Jan 9, 2016

The database stores the data, the rest API handles the business logic. Django has multiple plugins for providing a REST API, such as Tastypie. http://tastypieapi.org/ With this, django is the API. A rest API is good for when you want to share data between multiple clients (eg a website and mobile apps) and want to have more complicated business logic than the database can provide by itself.

tracker1 · on Jan 7, 2016

Really nice to see the direction things are moving in... I do feel that the replication/failover story needs a lot of work, but there's been some progress towards getting it in the box. Even digging into any kind of HA strategy is cumbersome to say the least (short of a 5-6 figure support contract). It's one of the things that generally stops me from considering PostgreSQL for a lot of projects. As a side note, I really like how RethinkDB's administrative interface is and their failover usage. It would be great to see something similar reach an integration point for PostgreSQL.

I also think that PLv8 should probably make it in the box in the next release or two. With the addition of JSON and Binary JSON data options, having a procedural interface that leverages JS in the box would be a huge win IMHO. Though I know some would be adamantly opposed to this idea.

roeme · on Jan 7, 2016

> I do feel that the replication/failover story needs a lot of work, but there's been some progress towards getting it in the box. Even digging into any kind of HA strategy is cumbersome to say the least (short of a 5-6 figure support contract)

Eeeh...a simple HA solution can be developed in about a week (I was able to do so on 9.3, and so far, it held it's ground). Also, now with 9.5's pg_rewind you can easily switch back and forth between nodes (http://www.postgresql.org/docs/9.5/static/app-pgrewind.html), simplifying things a great deal. Can't imagine that's 5-6 figures.

I agree that you don't get a Plug&Play-Solution out of the box, but from anecdotal evidence they often don't quite work as advertised anyway (remember 1995? And I'm sure your friendly DBA has some stories to share as well).

yashap · on Jan 8, 2016

> Even digging into any kind of HA strategy is cumbersome to say the least (short of a 5-6 figure support contract). It's one of the things that generally stops me from considering PostgreSQL for a lot of projects.

If you're going to be hosting your db on something like AWS EC2 anyways, then just buy a db product like AWS RDS, and pay for the HA option. Ends up around the same price as if you'd set up everything yourself (assuming you were going to host on AWS anyways, and not going with a low cost option), and is very easy.

elchief · on Jan 7, 2016

Ya I'd love to see PLV8 (with as many ES6 features as possible) as a stock language

jeltz · on Jan 7, 2016

There are plenty of nice minor improvements in the release notes. One of my favorites is "Allow array_agg() and ARRAY() to take arrays as inputs (Ali Akbar, Tom Lane)", it will come in handy when writing ad hoc queries to understand stored data. Right now I have to build a string which I use as input to array_agg().

Twisell · on Jan 7, 2016

I'm so pleased to see that I was not the only freak out here doing that!

spamizbad · on Jan 7, 2016

I'm very excited about CUBE and ROLLUP -- I am just about to start a project that barely requires an OLAP shim. Now it looks like I can just do it all with just database features. Yay for fewer dependencies!

paulsmith · on Jan 7, 2016

I'm not familiar with those statements, can you provide an example?

jsmeaton · on Jan 8, 2016

They're useful for providing summaries like row totals and column totals. Rather than just get an aggregated count for the total GROUP, you can also get aggregated counts for each unique combination of columns within the GROUP.

  col1 col2 count  
  ---- ---- -----  
  a    b    10  
  a    null 5  
  null b    5

There's more to it than that obviously, but you can read about them here: http://www.postgresql.org/docs/devel/static/queries-table-ex... (7.2.4. GROUPING SETS, CUBE, and ROLLUP)

tommoor · on Jan 7, 2016

I might get some hate, but I also think upsert was one of the best features that MongoDB offered that PG didn't, so this is a big win from that perspective too.

bpicolo · on Jan 7, 2016

Which makes sense, because the atomicity of the upsert is really the tricky part.

JohnBooty · on Jan 7, 2016

Not at all. UPSERT is the feature everybody's excited about!

elchief · on Jan 7, 2016

Regarding Row Security, yes you can use it with a web application and still use connection pooling.

From web server, connect to db as one user then SET ROLE to the database user. This gives you Column Security and easier auditing as well. See http://stackoverflow.com/questions/2998597/switch-role-after...

jimktrains2 · on Jan 7, 2016

The thing is that each application user now needs a corresponding DB user to use RLS. While this isn't a huge problem, it's different than how most (if not 99%?) of applications work.

anarazel · on Jan 7, 2016

No, RLS does not necessarily require separate database users. Using database users is one relatively obvious way to use the feature, but you can very well do something like 'SELECT myapp_set_current_user(...)' or something, and use a variable securely set therein for the row restrictions.

jimktrains2 · on Jan 8, 2016

Interesting. I didn't think about doing that. 9.2 makes that much easier to do http://dba.stackexchange.com/questions/97095/set-session-cus...

Jweb_Guru · on Jan 11, 2016

Yep. We do this where I work (the 9.4 equivalent using SECURITY BARRIER views) and it is extremely useful.

nickpeterson · on Jan 8, 2016

It's probably unsuitable for public facing sites, but for line of business applications it could be a real win. I would much rather have enforcement down to the data level.

ownagefool · on Jan 8, 2016

It's probably is sutible for public facing sites, just not globablly.

Little bobby tables doesn't need the ability to dump your users or credit cards, even if he does need access to all the blog posts.

I have to read up on the feature a bit more, but it sounds a potentially massive win if you build the support into an orm / framework.

elchief · on Jan 7, 2016

Yes, but it's not harder to have many db users vs many app users. There's even an extension to sync pg users with ldap

jimktrains2 · on Jan 7, 2016

I'm not saying it's _hard_, it's just _different_ and would be a challenging migration for established apps as I don't know of any framework's authentication system that works like that.

Also, why would you sync pg users in ldap? pg can auth against ldap.

jimktrains2 · on Jan 8, 2016

> Also, why would you sync pg users in ldap? pg can auth against ldap.

I figured out why: To pull roles/groups into the database from ldap.

ropiku · on Jan 8, 2016

Great that Heroku sponsored upsert and have support for 9.5 right now (in beta): https://blog.heroku.com/archives/2016/1/7/postgres-95-now-av...?

overcast · on Jan 7, 2016

Interesting, I wasn't aware PostgreSQL didn't have UPSERT until now. MySQL has INSERT on DUPLICATE functionality that is similar.

avidal · on Jan 7, 2016

Yep. Been a long requested feature. One of the reasons why the post states: "This feature also removes the last significant barrier to migrating legacy MySQL applications to PostgreSQL."

overcast · on Jan 7, 2016

Roger that.

avar · on Jan 7, 2016

What do they mean by "legacy" in this context? INSERT ... ODKU is not a legacy feature of MySQL, it's a currently supported first-class feature of the database, nor is MySQL itself a "legacy" database.

greenleafjacob · on Jan 7, 2016

If you are switching from MySQL to Postgres, then it's legacy by definition rather than intrinsic properties of MySQL.

scidev · on Jan 7, 2016

Legacy applications, not legacy MySQL feature.

brlewis · on Jan 7, 2016

I think they're referring to mysql. The sentence would have worked just as well without the word "legacy". I say this as someone who prefers PostgreSQL.

X-Istence · on Jan 7, 2016

No, they are referring to an application that is being moved from MySQL to PostgreSQL. The application is "legacy" in that it is an older version and the new version is "current".

Due to the English language however there can be some debate as to what they meant. In this case "legacy" is most likely meant to describe the "MySQL application" not "MySQL" itself.

brlewis · on Jan 7, 2016

You're wrong about which meaning is more likely.

A meaning that adds something to a sentence is a more likely meaning than one that adds nothing to a sentence. If you take "legacy" to mean "being migrated from" then the sentence becomes

This feature also removes the last significant barrier to migrating being-migrated-from MySQL applications to PostgreSQL.

It's more likely that if "being migrated from" was the intended meaning, they would have simply left the word out.

elbear · on Jan 7, 2016

Your comment assumes the author of the release notes has perfect command of the English language and they thought through in detail what the word "legacy" would mean in this context.

brlewis · on Jan 7, 2016

Not at all.

First, my comment says "more likely" so it isn't assuming anything.

Second, if we change "more likely" to "definitely", the assumption is merely that the sentence in question is written with the same command of the English language as the rest of the announcement, i.e. no egregiously redundant words.

pbreit · on Jan 7, 2016

From Google: "denoting software or hardware that has been superseded but is difficult to replace because of its wide use"

Maybe not a perfect word but very easy to understand in the context.

brlewis · on Jan 7, 2016

That definition does not help the argument that "legacy application" is a more likely meaning than "legacy mysql". In the sentence in question, the application hasn't been migrated yet, so clearly it has not been superseded.

orf · on Jan 7, 2016

Replace MySQL with Django. Do you assume Django itself legacy, or the Django based application? I parse it correctly as "a legacy application that uses MySQL".

brlewis · on Jan 7, 2016

That's a good question, but it's hard for me to objectively say how I'd immediately parse that, due to the close examination I've given that sentence.

You definitely have a point, though. How the phrase "legacy X applications" is interpreted likely depends on whether the context is an announcement by a competitor to X.

desdiv · on Jan 7, 2016

Ironically the first two google results for "UPSERT" is from wiki.postgresql.org.

overcast · on Jan 7, 2016

Shows how long it's been in the works, and requested.

jayess · on Jan 7, 2016

Can anyone suggest a good "getting started" tutorial for PostgreSQL/debian/php? I've been using Mysql for years and would like to give Postgres a try.

hardwaresofton · on Jan 7, 2016

http://www.postgresql.org/docs/9.5/static/tutorial-start.htm...

What particularly are you looking for in a "getting started" tutorial? Honestly, you should just plunge in, on some side project (or a mirror of whatever projects you've used MySQL on) and just compare.

This is a lot easier to say than do/live by, but I think you shouldn't invest in one tool choice when you haven't given the others a fair shake (once you have enough time to step back and think about your decision).

olefoo · on Jan 7, 2016

Install it, and then build something.

A few things you will want to look at that are different:

1. data types are much richer and more useful than in mysql

2. transactional DDL means migrations are atomic.

3. schemas are what mysql refers to as databases. Remember to set `search_path`.

4. roles and grants are somewhat more expressive and work differently than in mysql, but not that differently for the simpler use cases

5. database functions ( aka stored procedures ) are awesome as are extension languages.

tracker1 · on Jan 7, 2016

On point 5... love PLv8, which imho makes working with the newer JSON data types really nice.

btilly · on Jan 7, 2016

Yay! I like the changes.

But they are doing absolutely nothing about my biggest beef with PostgreSQL. Which is that there is absolutely no way to lock in good query plans. It always reserves the right to switch plans on you, and sometimes gives much, much, much worse ones. No other database does this to me. Even MySQL's stupid optimizer can be reliably channeled into specific query plans with the right use of temporary tables and indexes.

This is a problem because improvements don't matter if the query plan is "good enough". But they will care if you screw up. PostgreSQL usually does well, but sometimes screws up spectacularly.

The example that I have been struggling the most often with in the last few months is a logging table that I create summaries from. Normally I only query minutes to hours, but I set it up as a series of SQL statements so I first put the range in a table, and then have happened BETWEEN range_start AND range_end. PostgreSQL really, Really, REALLY wants to decide that the index on the timestamp is a bad idea, and wants to instead do a full table scan. Every time it does, summarization goes from under a second to taking hours.

Hopefully the new BRIN indexes will be understood by the optimizer in a way that makes it happier to use the index. But I'm not optimistic. And if I lean on it harder, I'm sure from past experience that I'll find something else that breaks.

ProblemFactory · on Jan 7, 2016

There is some discussion on why the Postgres team dislikes query hints here: https://wiki.postgresql.org/wiki/OptimizerHintsDiscussion

But perhaps I also have some practical advice to try.

I had a similar issue: I have a few tables with sensor data, 300-500 million rows, indexed among other things by event type. Some counting queries kept defaulting to full table scans. It turned out that this was because of limited statistics on the distribution of counts by event type.

The default_statistics_target config parameter sets how many entries Postgres keeps in the histogram of possible values per column, the default is 100 I think. Because my event types were not evenly distributed, the less frequent ones were missing from the statistics histogram altogether, and somehow this resulted in bad query plans.

As a fix, I upped the default_statistics_target to 1000, and to set it to 5000 for the biggest tables. Then after a vacuum analyze, the query planner started making sensible choices.

Another thing to try is perhaps reducing the random_page_cost config parameter from it's default of 4.0. On SSDs, random page costs are much closer to 1 than they are 4 (compared to long sequential reads).

btilly · on Jan 7, 2016

This is all good optimization advice, but I don't think it is applicable to my specific case.

My problem is not that PostgreSQL does not understand the distribution of my data. It does. The problem is that it comes up with a query plan without realizing that I'm only querying for a very small range of timestamps.

If this happens again, I'll have to try rewriting code to send it queries with hard-coded timestamps, cross fingers and pray. I find prayer quite essential with PostgreSQL sometimes because as ineffective as it is, at times I've got nothing else.

anarazel · on Jan 7, 2016

> My problem is not that PostgreSQL does not understand the distribution of my data. It does. The problem is that it comes up with a query plan without realizing that I'm only querying for a very small range of timestamps.

Which version did you reproduce that on? While the problem has not been generally addressed, the specifically bad case of looking up values at the "growing" end of a monotonically increasing data range has been improved a bit over the years (9.0 and then some incremental improvement in 9.0.4).

btilly · on Jan 7, 2016

PostgreSQL 9.4.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit

If it matters, it is an Amazon RDS instance.

koide · on Jan 12, 2016

Have you talked about this in the dev mailing list? I'm sure they would help and likely consider adding something to improve the next version.

anarazel · on Jan 7, 2016

I think this is a pretty important area to work on. Not in the direction of query hints, but rather have "approved" query plans, and an interface to see new query plans, and how much their costs differ.

That'd not only make production scenarios more reliable, but it'd also make it much easier to test tweaks to the cost model in practice.

The problem is that that's a rather significant project, and it's hard to get funding for that kind of work. The postgres companies aren't that big, and it's not an all that sexy feature marketing wise.

btilly · on Jan 7, 2016

It may not look like a sexy marketing feature. But it is the #1 reason why I don't recommend PostgreSQL.

I also don't think that query hints are a good way to do it. And I don't mind if the way to do it is somewhat cumbersome. This is very much a case where 20% of the work can give 99% of the benefit.

For example what about the following approach?

1. Add an option to EXPLAIN that will cause PostgreSQL's optimizer to spit out multiple plans it considered, with costs, and with a description of the plan that PostgreSQL can easily parse and fit to a query.

2. Add a PLAN command that can be applied to a prepared statement and will set its plan. It is an error to submit a plan that does not match the query.

And now in the rare case where I don't like a query's plan I can:

    EXPLAIN PLANS=3 (query);

Pick my desired plan from the list (hopefully)

Then in my code I:

    PREPARE foo AS (query);
    PLAN foo (selected plan);
    EXECUTE foo;

And now if I notice that a query performs worse than I think it should, I can make it do what I want it to.

anarazel · on Jan 7, 2016

> 1. Add an option to EXPLAIN that will cause PostgreSQL's optimizer to spit out multiple plans it considered, with costs, and with a description of the plan that PostgreSQL can easily parse and fit to a query.

The biggest problem with that approach is that the way query planning works isn't that a 100 different plans are fully built, cost evaluated, and then compared. Instead it's more like a dynamic programming approach where you iteratively build pieces of a query plan from ground up, and then combine those pieces to build the layer one up. Given the space of possible query plans, especially with several relations and indexes on each relation involved, such an approach is required to actually ever finish planning.

> Add a PLAN command that can be applied to a prepared statement and will set its plan. It is an error to submit a plan that does not match the query.

It's not easily, if at all in the generic case!, possible to prove that a specific plan matches a query. You could obviously try to build every possible plan and match against each of those, but that's computationally infeasible (we're talking factorial number of plans, depending on relations here).

So I think such an approach has no chance of working.

What's more realistic is a running queries in a "training" mode. That training mode would, matching on the specific parsetree, store the resulting plans in a table. Before exiting training mode you'd mark all these plans approved (after looking for bad cases obviously). After that preparing a new query still does the original query planning, but by default the query stored in the "approved plans" table would be used. The cost differential and the new plan would then be associated with the currently approved plan. Regularly the DBA (or whoever fulfills that role), checks the potential plans and approves new ones.

Based on a configuration option queries without approved plans would error out, raise a log message, or just work.

Now even that has significant problems because e.g. DDL will have the tendency to "invalidate" all the approved plans. But that's manageable in comparison to being woken up Friday night.

btilly · on Jan 7, 2016

I don't see how my objections are impossible.

On EXPLAIN, if you've passed my example PLANS=3 it would first do the plan in its usual way, and then try optimizing several more times, with some chance of randomly making suboptimal decisions at various decision points. It would keep doing this until it either had enough plans or else was making essentially random decisions and still couldn't find more.

I can see this requiring a significant refactor of existing code, but stochastically exploring "pretty good" plans doesn't require facing the whole tree of possible plans.

The DBA hopefully can recognize the desired plan if it turns up.

As for the PLAN command, I do not see the problem. I can look at the query and the EXPLAIN output, and I can figure out exactly how and where that query's conditions are being incorporated. You might need extra output from EXPLAIN to make it always possible to do in an automated way, but it should be possible.

Put another way, you propose that the database has a way to look at the query and a plan stored in the table and figure out how to execute that plan for that query. What would that table contain that couldn't be represented as a chunk of text supplied with a PLAN command?

anarazel · on Jan 7, 2016

> On EXPLAIN, if you've passed my example PLANS=3 it would first do the plan in its usual way, and then try optimizing several more times, with some chance of randomly making suboptimal decisions at various decision points. It would keep doing this until it either had enough plans or else was making essentially random decisions and still couldn't find more.

You'd not get anything useful by doing that. If you really consider this as a dynamic programming problem, at which place in the 'pyramid' of steps would you choose the worse plan? To quote the source:

        /*
	 * We employ a simple "dynamic programming" algorithm: we first find all
	 * ways to build joins of two jointree items, then all ways to build joins
	 * of three items (from two-item joins and single items), then four-item
	 * joins, and so on until we have considered all ways to join all the
	 * items into one rel.
	 *

If you'd just make some random 'bad' decisions, you'll not have a significant likelihood of finding actually useful good plans.

> As for the PLAN command, I do not see the problem. I can look at the query and the EXPLAIN output, and I can figure out exactly how and where that query's conditions are being incorporated. You might need extra output from EXPLAIN to make it always possible to do in an automated way, but it should be possible.

Good luck. Besides generating all possible plans postgres knows to generate, you very quickly essentially run into something equivalent to the halting problem.

> Put another way, you propose that the database has a way to look at the query and a plan stored in the table and figure out how to execute that plan for that query.

What I'm proposing is matching on the parse tree of the user supplied query. For each query there's exactly one parsetree the postgres parser will generate. We have a way (for the awesome pg_stat_statements module) of building a 'hash' of that parse tree, and thus can build a fairly efficient mapping of parsetrees to additional data. In contrast to that, for plans you can have a humongous number of plans for each user supplied query.

> What would that table contain that couldn't be represented as a chunk of text supplied with a PLAN command?

It would contain a, less ambiguous, version of the user supplied query. Now you could argue that you could add that to the PLAN command for matching purposes - but then we'd need guarantee that you could supply arbitrarily corrupt plan trees to postgres, without being able to cause harm. Something that'd cause significant slowdown during execution.

EDIT: Formatting

btilly · on Jan 8, 2016

You just use the approach used in simulated annealing. You set a threshold for making random decisions, and adjust that probability up and down as you try to optimize. For example you might say that in each step there is a 10% chance of adjusting the COST factor randomly by a factor between 0.2 to 5. If you've got 3 tables, most of the time you will come to the current optimal decision. Most of the rest of the time you will make one suboptimal choice. Run that a few times and you'll should several plans that are suboptimal but not actually horrible. Keep running and playing with the randomness factors until you either have enough plans, or are basically making entirely random choices and can't get enough.

As to the algorithm described in the comment, it is actually able to search through all plans. The dynamic programming bit just means that you don't have to unnecessarily traverse all possible logic paths that a naive recursive algorithm might. But if there is a good plan, it can be discovered.

It is unclear from the comment whether the algorithm can only find plans which involve adding one table at a time. This would not be unreasonable - for example I know that Oracle circa a decade ago did that to avoid having to consider too many query plans on large joins. I have no idea whether that ever got changed.

As to the parse tree, that seems like overkill to me. You have a query. It has a list of tables, and a list of join conditions. A query plan includes tables, indexes, filters, and so on. It could easily be augmented by stating which query condition was applied where.

To validate it you have to see that the lists of tables match, the list of applied conditions match, and each plan step could actually have the effect of doing that condition. If it all ties out, it is a valid plan. The fact that the plan might be crap is something that is explicitly left in the hands of the user - that's the whole point of the feature. But at that point it is clearly valid.

garyclarke27 · on Jan 8, 2016

I've always thought that there should be an option to allow the optimiser to try out different alternative plans during slack periods, in the background. It should then compare the performance of these alternatives to the original, then "change it's mind", if it finds a faster plan. It should use statistics to choose which plans to test, i.e. prioritise queries which are often called and which are slow.

hibikir · on Jan 8, 2016

This sounds like the classic Postgres problems on large, insert-only tables. The default settings for statistics gathering just aren't tuned for tables like this. Now, normally the problem is not full table scans, but using extremely inefficient join strategies, but chances are it's the same problem.

The typical solution is to modify the autovacuum settings for that table to recalculate the statistics a lot more often, and maybe even with much higher resolution, depending on your case.

You can also convince it that indices are the way to go by changing more basic settings about costs of reading a random page on disk vs reading sequentially, making full table scans more expensive, but tuning those settings away from realistic costs might have negative side effects for you.

I was able to have great success running complex queries on 100M+ row tables that were insert-only using this kind of trick, but YMMV. If nothing else fails, really experienced people are more than willing to help in the performance mailing list. They sure helped me quite a few times.

btilly · on Jan 8, 2016

The first thing that I tried was a full VACUUM ANALYZE and then re-ran the same query. It didn't help. Therefore modifying autovacuum won't help.

Adjusting internal costs is promising, but I'd like to avoid going there exactly because of the possible negative side effects that you mention.

Jweb_Guru · on Jan 11, 2016

You can adjust the costs on a per-session (or even per-transaction) basis. Depending on the nature of your query, it might be worth it.

keslerm · on Jan 7, 2016

You can push it in favor of certain options by disabling the one you don't want, such as sequential scan.

SET enable_seqscan = OFF;

We use these options a lot on tables that result in odd query plans to get them doing the best option..

btilly · on Jan 7, 2016

That's a random sledgehammer, but that is how I have been solving the problem. I've set enable_seqscan, enable_nestloop and enable_material to false and it is working at the moment. At first I only turned off enable_seqscan, but then I turned off the other two after we switched database hosts and the query went belly up.

What scares me is that this is unreliable, and according to the documentation, the optimizer is free to choose to ignore everything that I say whenever it wants. The fact that it already HAS done that to me does not provide me comfort.

Jweb_Guru · on Jan 11, 2016

Disabling sequential scans is generally reliable. It sets the planner cost of seq_scan excessively high, so the planner will only choose it if there's no other choice (at least, that's been my experience). Of course, if your query is complex enough to trigger GEQO that might not work out.

reactor · on Jan 7, 2016

Congrats to everyone involved, indeed an amazing opensource project that displays true integrity and discipline.

petergeoghegan · on Jan 7, 2016

Thanks

wingsonfire · on Jan 7, 2016

I am looking forward for Row Level Security, Though cell level security will be even more awesome to have.. Difficult to achieve in SQL. Only Apache Accumulo in NoSQL space has it.. But once we have it make sure no one has access to SSN column and we will be protected to one degree in data breaches.

noselasd · on Jan 7, 2016

Wouldn't you be able to do this in Postgres now ?

"RLS implements true per-row and per-column data access control"

uberneo · on Jan 7, 2016

Any practical use cases of Row level security ?

Sanddancer · on Jan 7, 2016

Imagine you're working in a place with a rather large sales/marketing department. Now, sales is a pretty cutthroat job, and some people will do any cheat possible to get ahead of their co-workers. In a typical sales database, someone who manages to borrow a bit of sql from a techie friend could potentially go in and get sales leads from co-workers, poaching their deals. With row level security, the sleezier co-worker isn't allowed to look at those rows to be able to poach.

brunoqc · on Jan 7, 2016

Do you mean as another security layer? I guess the sales people don't normally have direct access to the databases and the softwares already restricts which data they can see.

Maybe you mean just in case they manage to bypass the software restrictions.

tracker1 · on Jan 7, 2016

In some cases, people will defer to database level security restrictions. It really depends on how much application logic is in the database. Some applications are designed with as much logic as possible in the DB, including each user being a db user with credentialed access. Others will treat the DB as dumb storage with all access through a programatic API... with thin API shims over DBMS, the db security is paramount.

gnud · on Jan 7, 2016

Imagine how much simpler every single query becomes, when you don't have to do complex access checking as part of the criteria. You can just write a normal select/join! The database handles access checking for you.

Suddently, it's not that scary to ask the intern to generate a rather complex report - he can't show anyone data they're not allowed to see by accident.

Sanddancer · on Jan 7, 2016

Exactly. Defense in depth. Most people are going to play by the rules, but making it harder to break the rules is useful.

bryanlarsen · on Jan 7, 2016

https://github.com/begriffs/postgrest

andrewflnr · on Jan 7, 2016

To be more specific, with Postgrest you're supposed to be able to use database-level permissions as your client-side permissions. (I personally haven't tried it so can't say how well it works)

brlewis · on Jan 7, 2016

What can you do with row-level security that you can't do by setting permissions on an updatable view?

jeffdavis · on Jan 7, 2016

Normal views aren't designed for security. There are a number of ways that they can "leak" information that is supposed to be hidden.

The reason is that the optimizer reorders operations. So, a tricky person can write the query in a way that, for example, throws a divide-by-zero error if someone's account balance is within a certain range, even if they don't have permission to see the balance. Then they can run a few queries to determine the exact balance.

RLS builds on top of something called a "security barrier view" which prevents certain kinds of optimizations that could cause this problem.

It also offers a nicer interface that's easier to manage.

Sanddancer · on Jan 7, 2016

I may be wrong in the level of separation that pgsql provides in such situations, but it appears that a materialized view offers another level of isolation that would make such leaks more difficult to handle.

elchief · on Jan 7, 2016

For one, you don't need a view. And you'd probably need (in 9.4) insert, update, and delete triggers for anything beyond trivial row security.

ddlatham · on Jan 7, 2016

HBase has also had cell level security for a couple of years now.

https://blogs.apache.org/hbase/entry/hbase_cell_security

brunoqc · on Jan 7, 2016

Why would you have a SSN column if no one has access to it?

wingsonfire · on Jan 8, 2016

I meant no human user has access to it. But application user that connects to credit score engine, map reduce job, or something similar has access.

I know applications can be compromised but now you can freely share your DB freely with other teams to analyze or play around

ankimal · on Jan 7, 2016

http://www.postgresql.org/docs/9.5/static/brin-intro.html If like me you were looking for what BRIN index is all about.

avita1 · on Jan 7, 2016

Out of curiosity, has anyone managed to find something more detailed about the guts?

It sounds like it's basically a BTree that stops branching at a certain threshold, but I'm almost certainly wrong.

anarazel · on Jan 7, 2016

No, that's not really it, although you could see it as a very degenerate form of a btree. Basically it's using clustering inherent to the data - say a mostly increasing timestamp, autoincrement id, model number ... - to build a coarse map of the contents. E.g. saying "pages from 0 to 16 have the date range 2011-11 to 2011-12" and "pages from 16 to 48 have the date range 2012-01-01 to 2012-01-13". With such range maps (where obviously several overlapping ranges can exist) you can build a small index over large amounts of data.

Obviously single row accesses in a fully cached workload are going to be faster if done via a btree rather than such range maps, even if there's perfect clustering. But the price for having such an index is much lower, allowing you to have many more indexes. Additionally it can even be more efficient to access via BRIN if you access more than one row, due to fewer pages needing to be touched.

jakobegger · on Jan 8, 2016

There was a talk about index internals at pgconf.eu that also covered the new BRIN indexes.

Slides are here: http://hlinnaka.iki.fi/presentations/Index-internals-Vienna2...

sandGorgon · on Jan 7, 2016

anybody know how quickly does RDS upgrade to newer versions of Postgres.

I'm really, really keen to use 9.5 jsonb with its insert/update changes.

andor436 · on Jan 7, 2016

While you wait, check out http://stackoverflow.com/a/23500670/229006

My plan is to rely on these functions for now, and switch to the native implementations once 9.5 is production ready on RDS.

sandGorgon · on Jan 7, 2016

this is so cool... thanks!!

fuhrysteve · on Jan 7, 2016

They have a policy not to say anything or make any promises. That said, using history as a guide, it seems to take them about 2.5 months after a major release to add support.

on Jan 7, 2016

[deleted]

Jweb_Guru · on Jan 7, 2016

That isn't true, according to their own documentation:

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_U...

dandigangi · on Jan 7, 2016

I swear... I will redesign and develop Postgre's site for free.

spacemanmatt · on Jan 8, 2016

I'm curious, what do you think needs to be redesigned?

dandigangi · on Jan 11, 2016

I wish I could say it's circa Web 2.0 but it's still stuck even farther back than that. I mean it works which is great but I loathe spending time on it because it's such a poor experience.

jimktrains2 · on Jan 8, 2016

Send a message to pgsql-www

systems · on Jan 7, 2016

how does postgresql upsert compare to ms sql's merge statement

i want to look deeper into this, but didnt have the time but from the little i read, seems ms sql merge is more powerful

jeltz · on Jan 7, 2016

The new PostgreSQL syntax is more convenient to use in the UPSERT use case while the MERGE syntax is more convenient to use when doing complicated operations on many rows of data (for example when merging one table into another, with a non-tricial merge logic).

The reason PostgreSQL went with this syntax is that the goal was to create a good UPSERT and getting the concurrency considerations right with MERGE is hard (I am not sure of the current status, but when MERGE was new in MS SQL it was unusable for UPSERT) and even when you have done that it would still be cumbersome to use for UPSERT.

EDIT: The huge difference is that PostgreSQL's UPSERT always requires a unique constraint (or PK) to work, while MERGE does not. PostgreSQL relies on the unique constraint to implement the UPSERT logic.

dsp1234 · on Jan 7, 2016

I am not sure of the current status, but when MERGE was new in MS SQL it was unusable for UPSERT

I've used MERGE as an UPSERT using MATCHED/NOT MATCHED and SERIALIZABLE/HOLDLOCK since it was introduced in mssql 2008. It was one of the first features I upgraded my code to use, and it worked out of the box with no issues.

jeltz · on Jan 7, 2016

See this blog post for what I am talking about: https://www.mssqltips.com/sqlservertip/3074/use-caution-with...

If PostgreSQL had gone the same route as MS SQL I would have expected a similar set of bugs. I suspect all of this have been fixed by now, but I do not follow MS SQL.

manigandham · on Jan 7, 2016

Lots of databases have MERGE but it's different from the typical UPDATE OR INSERT logic in terms of use cases, table requirements and concurrency control.

Here's a great post from Postgres team showing why they didn't just implement merge themselves:

http://www.postgresql.org/message-id/CAM3SWZRP0c3g6+aJ=YYDGY...

cjauvin · on Jan 7, 2016

I wonder how much time it will take to appear in the Ubuntu apt repo? Should it be already there (I don't see it yet)?

Edit: I meant apt.postgresql.org of course, not the official Ubuntu repo..

jeltz · on Jan 7, 2016

No idea, but the PostgreSQL community distributes official Debian and Ubuntu packages at apt.postgresql.org. They should already have 9.5, or if not have it very soon.

chbrown · on Jan 8, 2016

Every time there's a minor version update I have to remind myself the sequence of upgrade incantations. It's pretty simple, but here's a gist that might help anyone upgrading from 9.4 to 9.5 with Homebrew:

https://gist.github.com/chbrown/647a54dc3e1c2e8c7395

jhealy · on Jan 7, 2016

pglogical (http://2ndquadrant.com/en-us/resources/pglogical/) claims to allow cross version upgrades from 9.4 to 9.5 with minimal downtime, but the documentation seems fairly light-on.

Has anyone come across a guide to using it for upgrades?

desmondrd · on Jan 7, 2016

Upsert is something I expect so commonly in modern databases nowadays. Happy to see it here with Postgres.

tmaly · on Jan 7, 2016

this is excellent news. I really enjoy using postgresql in one of my current projects. I look forward to using upsert and the new indexes

omarforgotpwd · on Jan 7, 2016

What an absolutely fantastic project. The recent releases have all been very exciting.

gionn · on Jan 7, 2016

Bye bye mongodb.

gcb0 · on Jan 7, 2016

search in 5% the time it would take to search a btree? anyone can see that with actual data?

gcb0 · on Jan 8, 2016

http://pythonsweetness.tumblr.com/post/119568339102/block-ra...

the very first example points to BRIN indexes resulting in smaller index than btree but with much longer search time... so i guess the 5% time figure was very use-case specific?

mmaunder · on Jan 7, 2016

Unless I'm mistaken MySQL has had this for almost a decade with "ON DUPLICATE KEY UPDATE". I'm seeing a lot more about PSQL here and in the news. I've always found it to be unfriendly and slow. Why the new attention? Is there really something about PSQL that makes it better than MySQL these days? It used to be transactions, but InnoDB made that moot years ago.

We do over 20,000 queries per second on one of our production mysql DB's and I'm not sure I'd trust anything else with that: http://i.imgur.com/sLZzXhS.png

Just curious if I'm missing out on some new awesomeness that PostgreSQL has or if it's just marketing.

sigil · on Jan 7, 2016

> Is there really something about PSQL that makes it better than MySQL these days?

In a word: correctness.

Yes, MySQL has an UPSERT implementation. Like so many things MySQL rushed out the door, it's also buggy and unpredictable. Did you know UPSERTing into a MySQL table with multiple unique indexes can result in duplicate records? Did you know MySQL's ON CONFLICT IGNORE will insert records that violate other not-NULL constraints? [1]

I've used both MySQL and PostgreSQL for over a decade, and working around the many MANY misbehaviors and surprises in MySQL requires continuous dev effort. PostgreSQL on the other hand is correct, unsurprising, and just as performant these days.

MySQL is what happens when you build a database out of pure WAT [2].

[1] https://wiki.postgresql.org/wiki/UPSERT#MySQL.27s_INSERT_......

[2] https://www.destroyallsoftware.com/talks/wat

mmaunder · on Jan 7, 2016

Nah. MySQL rocks. I've been using it since 1998 at Credit Suisse, 2000 at eToys.com where we used it to run the entire company from warehouse to web. I used it at the BBC in 2003 for a high traffic Radio 1 application and I've used it since then on my own companies with serious volume including a job search engine featured in NYTimes and Time Mag in 2005 and feedjit.com doing real-time traffic on over 700,000 sites. We use it for Wordfence now which is where the image link comes from I posted earlier with 20K TPS. All very high traffic with consequences if it screws up. I've never run into any of the issues you mention.

You say "upserting into a mysql table". Which storage engine? MyISAM? InnoDB? I find MySQL to be both reliable and incredible durable i.e. it handles yanking the power cord quite well. The performance also scales up linearly for InnoDB even for very high traffic and concurrency applications.

We use redis, memcached and other storage engines - by no means are we tied to mysql. But for what it does, it does it incredibly well.

I'm also completely open to using PostgreSQL and I was hoping someone could give me a compelling reason to switch to it or to use it.

brobinson · on Jan 8, 2016

I've worked at a company that was doing > 20,000 TPS (transactions/sec) on a single PG instance with no problem. That is baby-tier usage for a DB.

As far as why you should give up MySQL like I did four years ago: http://grimoire.ca/mysql/choose-something-else

simoncion · on Jan 7, 2016

> I find MySQL to be both reliable and incredible durable i.e. it handles yanking the power cord quite well.

Mmm. Assuming that your underlying disks don't lie, losing only the data that was in-flight to the WAL (or whatever is the equivalent in your DB of choice) is the absolute worst data loss you should see from a real SQL database in that situation. [0]

Think about the case where an error causes the DB software to crash... You really can't make good data robustness guarantees if an unexpected crash endangers more than just the data that's in-flight to the WAL.

I personally have had the disks that back a rather large (but relatively low-update-volume) Postgres DB drop out on multiple occasions and never lost any data at all. This shouldn't be unexpected behavior. :)

> You say "upserting into a mysql table". Which storage engine? MyISAM? InnoDB?

"In addition, an INSERT ... ON DUPLICATE KEY UPDATE statement against a table having more than one unique or primary key is also marked as unsafe. (Bug #11765650, Bug #58637)" [1]

Given that this is an unqualified assertion, straight from the MySQL docs, it seems safe to say that this is a weakness in MySQL's UPSERT-alike, rather than any limitation of a particular underlying storage engine. (Here are some bug reports that also make unqualified statements about the behavior at issue: [2][3])

> ...I was hoping someone could give me a compelling reason to switch to it or to use it.

shrug Both MySQL and Postgres are capable databases. Postgresql has better documentation, appears to have substantially better internal architecture, and -from what I remember of my early days with MySQL- has far fewer hidden sharp edges than MySQL.

I'm not here to tell you to change what DB you're currently using for your production stuff... I don't think anyone is. For my projects, I have had substantially better experiences with Postgres than MySQL.

EDIT: It occurred to me in the shower that I didn't make "loss of in-flight WAL data" sufficiently clear. Unless you explicitly ask for a mode where writes are acknowledged before the DB believes that they're safe on disk, [4] then the only "data loss" in the scenarios I described in the comment would be data that had been transmitted to the server, but not acknowledged as committed to the DB. So, correctly-written client programs would experience this "data loss" only as a transaction that failed to commit, maybe followed by DB unavailability. Sorry for the ambiguity. :(

[0] http://www.postgresql.org/docs/9.4/static/wal-reliability.ht...

[1] http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.h...

[2] http://bugs.mysql.com/bug.php?id=58637

[3] http://bugs.mysql.com/bug.php?id=72921

[4] I say "believes they're safe" because underlying storage can always lie, and there's not a goddamn thing you can reliably do about it if it does.

colanderman · on Jan 7, 2016

If all you care about is QPS, by all means, stick with MySQL.

People like myself use Postgres because it has a much richer feature set. See http://stackoverflow.com/a/5023936/270610 for some examples. Personally I find MySQL beyond frustrating due to its lack of… well almost all of those. Recursive CTEs in particular, but arrays and rich indexing are pretty core too.

Postgres's query optimizer is far more advanced too. MySQL doesn't even optimize across views, which discourages good coding practices.

The documentation is fantastic. Complete and well-written, covers the nuances of every command, expression, and type. MySQL's doesn't hold a candle to it.

Don't know what you mean about "unfriendly". Help is built into the command-line tool, and like I said, the documentation is fantastic. Maybe MySQL is a little more "hand-holdy", but I don't care for such things so I wouldn't know.

atombender · on Jan 7, 2016

It's a long time since MySQL was faster than Postgres.

Back in the early 2000s, LAMP people on Slashdot were benchmarking MyISAM tables to Postgres' 6.5/7.x's fully transactional engine. Unfortunately, the reputation as being slow stuck among developers.

Postgres particularly shines on multicore systems, thanks to some clever internal design choices. Having a sophisticated cost-based query planner also helps.

As for unfriendly: Care to amplify? In my work, I've found the opposite to be true.

For example, the very first thing you tend to encounter as a new developer is "how to create a user". For MySQL, it turns out that using GRANT to grant a permission creates a user, which is counterintuitive; GRANT also sets the password, and promotes the use of cleartext passwords. By comparison, Postgres has "createuser", as well as a full-featured set of ALTER USER commands. The difference between "mysql" and "mysqladmin" is also completely unclear.

The almost complete lack of warts and legacy cruft in Postgres significantly removes the possibility of confusion, uncertainty and information overload. MySQL's manuals are littered with "if X is enabled then this behavior is different, and in versions since 5.7.3.5 this behavior has been changed slightly, and 5.7.3.6 has a bug that silently swallows errors", etc. MySQL's historical date and NULL handling alone is worth a chapter of any book.

Postgres also has a level of strictness above MySQL, which is in itself instructive. You know when you're doing something wrong. Postgres never accepts bad input. It always requires a strictly correct configuration setup.

Plus: Just type \h in psql. It has a complete reference of the entire SQL syntax.

rodgerd · on Jan 8, 2016

> LAMP people on Slashdot were benchmarking MyISAM tables to Postgres' 6.5/7.x's fully transactional engine.

It wasn't just the /. crowd. Back in the 3.5 days the MySQL devs were doing that too, and writing long discourses on why transaction safety was a bunch of crap and a crutch for bad application developers.

gnud · on Jan 7, 2016

Off the top of my head:

  - CTE's
  - Arrays/JSON type
  - partial indexes
  - transactional DDL
  - NOTIFY
  - Materialized views
  - Schemas
  - PostGIS
  - Row level security (which is new in PG)

_8o9c · on Jan 7, 2016

I'm no expert but in my (limited) experience, there are some super handy datatypes and features that PostgreSQL supports that MySQL doesn't:

- Arrays, particularly with GIN indexes. This makes things like tagging fantastic in Postgres. Instead of putting your tags in another table, you throw them in an array and you can do all kinds of things like set intersection-like queries.

- JSON. Postgres can store data as JSON and index and query the JSON. This essentially gives you MongoDB type queries.

I'm sure there's more but these are my favourite Postgres features.

davidw · on Jan 7, 2016

Mysql always seemed to be fast like a bike going downhill with no brakes.

Postgres has always taken a more 'solid' approach. One instance that made my jaw drop when I realized it: in the past (has this been fixed?), DDL (alter table, create table, etc...) were not transactional in Mysql. You could get 50% through a series of them, and find your database 100% fucked up.

That said, over the years Mysql has been improving too, for sure.

saurik · on Jan 7, 2016

MySQL still does not have transactional DDL.

brobinson · on Jan 8, 2016

Neither does Oracle!

__jal · on Jan 7, 2016

Some of it is historic - Mysql has gotten much better in recent years at supporting the parts of being an RDBMS that matters the most when money is riding on it.

So honestly, at this point I do think some of it is impressions from the past which are no longer valid. But still, Mysqlhas done/does all sorts of things that defy the spec, convention or just common sense (I don't know if this has been fixed, but at least for many years, April 30th was treated as a valid date, and there was some profound weirdness of which I can't quite recall the details involving locale stuff).

Postgres generally takes the position that data should always be safe first and speedy sometime later. It also assumes the operator understands their tools. That second one means in comparison with Mysql, people think it is unfriendly. It isn't (if you want to see unfriendly, go work with Oracle), it just expects that its friends learn about it. Which is of course good advice when you're dealing with complicated software on which a lot of money tends to ride.

And as the PG devs have said for years, they don't compete with Mysql. They compete with Oracle. There's no reason to switch if you're happy with Mysql.

rimantas · on Jan 8, 2016

What's wrong with April 30th? And operator understanding their tools applies to MySQL as well. You can set it be more strict, but default is lax. Not sure if reverse is possible with PG.

spacemanmatt · on Jan 8, 2016

PP probably intends to refer to February 30th (or 31st or 32nd), a popular handle to MySQL's (ahem) surprising date handling.