Hacker News new | past | comments | ask | show | jobs | submit login
A Hairy PostgreSQL Incident (ardentperf.com)
155 points by tanelpoder on Feb 11, 2022 | hide | past | favorite | 41 comments



To avoid the issue of queries with IN() polluting pg stat statements, one can in most of the case replace

WHERE a IN()

With

WHERE a = ANY()

No matter the number of options in the ANY operator, it'll hash to the same query id. Semantics between ANY and IN are very slightly different, but 95% of the case it'll get the job done and provide better stats (by having all queries analyzed together).


> Semantics between ANY and IN are very slightly different, but 95% of the case it'll get the job done and provide better stats

And it’ll be more reliable, because `any` will be happy to take an empty array while `IN` will be very cross if given an empty tuple.


> `any` will be happy

> `in` will be very cross

I like this. People should anthropomorphize their operators more.


This is awesomely well written. There's a certain self-congratulatory style that a lot of more senior developers sometimes find themselves falling into, that I really feel like this avoids; it's wonderfully engaging.

Unfortunately I still don't find myself understanding it very well... And I'm a professional backend engineer. I just spend all day writing Python; I keep the database hits of my code to the minimum that I can, and if something is broken, I figure out how to fix it, and I call it a day. I don't know very much about SQL. Where should I start?


I would suggest that learning SQL (the syntax) isn't the crux--it's just syntax you learn over time based off need, just like learning python--what you need is motivation. I'd suggest watching this video to get a flavor of how much leverage using the database can give you: https://www.youtube.com/watch?v=wTPGW1PNy_Y If the database can make you more productive and more efficient and make your life easier, you now have incentive to use it more.

Another thing that took my use of databases to the next level, was using PostGraphile (https://www.graphile.org/postgraphile/). It takes a PostgreSQL database and provides a GraphQL endpoint for it. Use that as your only backend. You can use PostgreSQL row-level security to do access control (who can see or modify what). You can use PL/pgSQL to write business logic for higher level calculations. You can use constraints and triggers to do data validation. Each of these areas has an initially steep learning curve, and wonky syntax, but once you get over them you can implement many backends extremely quickly with them. And using PostGraphile where you only have SQL available (well, you can also add javascript extensions, but try to limit those) will be a great forcing function that will require you to scour stackoverflow and the excellent PostgreSQL docs to learn many of the useful corners of SQL you'd never learn otherwise.


Thank you for writing this. This is one of the most helpful comments I've read in my 7 years on HN.

A lot of this high-level engineering stuff feels so opaque. You outlined a clear heuristic: find a motivation to learn the tool, add artificial limits if necessary for your learning goals, then scout relevant resources to help you get unstuck. It seems simple when written out, but I always find myself trying to overload on uncontextualized information first, then feel confused about why I still don't know how to "make a neural net" or what have you.


Interesting, didn't know about PostGraphile. Having recently learned about EdgeDB [1] launch here in HN [2], seems the latter is going to each the former's lunch.

[1] https://www.edgedb.com/

[2] https://news.ycombinator.com/item?id=30290225


I had to use the CTE trick few times to solve same nasty query plan in Postgres. While it worked well for me I would still prefer query hints. They are explicit, so the intent is obvious: “Dear future developer/me mind the execution plan”. A CTE might get removed by mistake or not be reevaluated if it’s still needed after Postgres version/statistics/index changes. Also as mentioned if you are still before pg14 there is this issue that you will have to materialize your CTE to keep the hack working.

I remember that the most irritating brownout I got due to a query plan change was when a 50 millisecond query started to use a seq scan, but on a table under 10 million rows, so it wasn’t obviously hanging like in the article, but took like 2-3s. It was just enough to fly under radar for our monitoring, but to make some important process much slower.


If you explicitly add 'materialized' to the CTE declaration -- that is:

  with dummy_table_name as materialized
    (select stuff from table ...)
the intent becomes explicit. (FWIW, while the nonstandard "materialized" keyword here doesn't have the grammatical form of hints in other DBs, I've still described it in my org as "the only hint Postgres supports" because the closest equivalent in, say, Oracle, is a hint -- the apparently undocumented /+ materialize /)


Heh, it’s probably been over a decade since I regularly had to do this sort of thing, but literally my first thought on seeing that query was “Hmmm, I’d try rewriting that to force a different execution plan” and lo and behold, that was the answer. It’s kind of comforting that not all my experience is obsolete!


There's only one PostgreSQL incident story, retold countless times. There's a bad query plan, we know what the right plan is, the tools to detect the problem are poor, and PostgreSQL won't let us force a fixed plan, so we have to do some unholy hack to nudge it towards the right plan. They even relied upon another pretty substandard aspect of PostgreSQL: the fact that CTEs are optimization barriers. For god's sake, it's 2022.

On SQL Server, we see it building the bitmap in the graphical query plan basically right away. We can see it in Query Store, and we can see it in sp_whoisactive. The actual query plan being used in production is listed. Then, a couple minutes later, you've deployed a query hint and the bitmap is gone. This is not a hypothetical scenario I am imagining; I have done exactly this, for exactly this problem with the query plan trying to build a bitmap on a huge table. This is a minor blip for SQL Server users and not a disaster that turns into a blog post.

At what point does PostgreSQL change their minds about officially supporting query hints in the base system? This isn't the first time I've read basically this exact blog post and then written this exact comment on HN.


This. I have years of experience on Sql Server and just recent months on Postgresql, but I found the latter lacks simple and easy troubleshooting tooling. Bad query plan issues in Sql Server are not really an issue, something like the problem of this post would have take minutes to discover and fix using query store, that you can turn on and off when needed (I usually left it on, space is capped). Also if there were multiple plans computed and cached you could have forced sql server to use a specific one until a proper fix was deployed.


Actually there's two ;-) The other one of course being the notorious txid wraparound!


Yup I’ve had to debug an issue exactly like this. An IN statement that when populated with <7 parameters would happily use an index but when populated with >=7 parameters would decide to do a full table scan. I was quite dismayed to discover Postgres doesn’t support index hints like MySQL does when I came to fix it.


Does MySQL run into these types of issues? ... I've yet to run into this, but I believe MySQL supports query hints as well.


Anekdote:

Recently fixed an issue for performance ( MS SQL ) by just using "With INDEX index-name".

Very small PR actually. I couldn't write an blog post about it actually.


I mentioned it in an earlier thread, but why can't we just select from indexes directly as if they were tables or views?

I mean sure, most of the time the DB is clever. But sometimes the developer knows best, why not provide the tools?

    select i.OrderNumber, i.OrderDate, o.OrderShipped
    from ix_Orders_OrderDate i
    join Orders o on o.OrderNumber = i.OrderNumber
    where i.OrderDate > '2022-02-01'


Index names aren't global; they are table-scoped. So you have to specify the table name. If you squint your eyes, there's not much difference between typing "ix_Orders_OrderDate i" and typing "Orders i WITH (INDEX(ix_Orders_OrderDate))". SQL Server lets you do this.


No, index named are global in PostgreSQL and the SQL standard.


The ANSI SQL standard doesn't talk about indexes at all. I'm sure you're right about PostgreSQL; I'm no expert on that implementation. I took his post to be asking about SQL Server since that's what we were talking about in the subthread (hence my SQL Server-specific response). In SQL Server indexes are table-scoped.


I hit this recently too, where a complex query was taking ~300ms, and I decided to fold some sub queries into a CTE, and it promptly (pun intended) went down to ~30ms.

Plus point, it made my query easier to understand too!


Is there a way to force a particular plan instead of letting the db decide? It's unfortunate the author needed to use a hack to trick the db into using the correct plan.


Other databases have optimizer hints for that, but apparently there’s resistance to adding hints into Postgres [1]. I find hints very useful (in Oracle world) for quickly working around some optimizer glitch (and figure out the long term solution later).

There actually is a “pg_hint_plan” [2] Postgres module that adds some hints to Postgres optimizer (but it’s not part of Postgres core as I understand):

[1] https://wiki.postgresql.org/wiki/OptimizerHintsDiscussion

[2] https://github.com/ossc-db/pg_hint_plan


I always use the rule of thumb that if you need optimizer hints for your query, its time to fix the query or the indexes. Of course in the teeth of an outage that generally isn't an option.

The tools on SQL server for query optimization are better then postgres, but that doesn't always mean problems like the OP don't still occur. I have had more than one SQL server query plan go to hell over time as the tables grew and the optimizer started to make odd decisions.

Edit: to make this a little clearer, I regard the SQL server query optimizer system as "lawful evil". It follows its own rules relentlessly.


Yep, when the app code and the schema/partitioning/index design works _with_ the database as intended, not _against_ the database engine’s intended use, a whole class of SQL performance and plan stability problems tends to go away.

Yep, agreed, hints should be temporary workarounds (or just experimentation aid when developing/testing SQL performance to validate some hypothesis). But many of these temporary workarounds end up being permanent temporary workarounds :-) That’s why Oracle now has hints and parameters to make the optimizer ignore all other hints in the SQL code :-)


Working out which hints are operating must be difficult.


For a cost-based-optimizer to work, the cost tuneables must be good. These are very difficult to get right in practice.


How would you fix the query or indexes here? The CTE seems clearly a hack, effectively functioning as a hint. It doesn’t seem like it counts as fixing the query.

I’m skeptical of this rule of thumb. I’m sure people misuse hints. But if someone thinks they “need” hints, my guess is they’ve tried changing the query and indexes.


Use named prepared statements (the ones where you have to manually DEALLOCATE to free up resources on the server). It is to my understanding that a named prepared statement runs five times using both a custom query plan and a generic query plan, and from the sixth execution onwards it picks the faster of the two and sticks to it for the rest of the prepared statement's lifetime (the query planner never gets called again, for better or worse).

[Thread: SELECT slows down on sixth execution](https://postgrespro.com/list/thread-id/2070689)

If I'm wrong on this, I would desperately like someone more informed on this to correct me because it seems to me that you can force postgres to stick to a query plan, thus obviating the need for query hints


Reading up on this (https://www.postgresql.org/docs/current/sql-prepare.html), it seems that there are a number of exceptions where the statement will be re-planned:

- DDL modifications to the used objects

- updated planner statistics for the used objects

- modification to search path

I would expect a major Postgres version upgrade to hit these conditions, so the OP problem would still occur. Nevertheless, thanks for the pointer - I had mostly forgotten about this possibility.


I'd like to correct myself on this point, it seems that if Postgres picks custom_plan over generic_plan it will always re-plan (that is what custom_plan is, it takes the prepared statement parameters into account while generic_plan always uses the same plan regardless of the parameters). So it's a toss up whether Postgres will actually lock in the static query plan over the dynamic query plan.


Goodness, I absolutely love hearing war stories like these. They do so much more in your imagination than some dry architecture description. And they show up the usual bleating of DevOps types.

At Amazon they do a portion of the onboarding training like this, describing how things went south, often for trivial reasons, what happened and how it was fixed. More of this on HN please!


This reminds me of my course project from a database class. ui-optimizer-postgres [1]. It's more of learning tool. It used gtk to allow you to modify the plan in real time. This was 6 years ago, so my memory is a fuzzy.

Write up: https://github.com/josephmate/ui-optimizer-postgres/blob/REL...

[1] https://github.com/josephmate/ui-optimizer-postgres/tree/REL...

PS: Definitely don't use this in production because the DB blocks, waiting for someone to confirm the plan ;)


I've hit this before, sometimes without an upgrade, where the postgres planner does something unreasonable and since I'm not a DB expert, it's hard to debug.

The solution was to confine the planner's ability to do bad things. CTEs (like the article) are one way. Another is to pull work up into the application layer (for instance, keep the IN filter in the query but overfetch and filter the < piece in the app... Not always feasible but has saved me in the past).


So postgres still has this bug? This is an unsatisfying ending.


A bit funny, one never reads about such bizarre problems with commercial databases. DB2, SQLServer, Oracle...


All the commercial enterprise database performance consultants are busy spending the money they make from fixing these critical apps and expensive problems ;-)

Edit: Can’t speak much for MSSQL and DB2, but there are lots of similar problems in the Oracle world too, sometimes due to an app design that works against the intended use of the DB/optimizer, sometimes due to a database optimizer bug and sometimes just due to the complexity of estimating the optimal plan based on limited statistical summaries of the “shape of your data” and expectation that the optimization won’t take longer than a few milliseconds.


I regularly fight bad execution plan problems like that on Oracle (more often than not after minor upgrades). And yes you can read about problems like that looking at the documented incidents on the Oracle support site.

And no, hints are not always the solution. Sometimes they even are the culprit.

If I had a wish for Postgres, I'd rather have something like SQL profiles (or similar) where you can register an alternative plan with an existing query, not hints. Very often a full deployment in order to change a single SQL isn't an option.


Selection bias. You don't read many stories about them in general. I can't remember the last time DB2 hit hn front page for any reason. Search shows most of them are 7+ years old and the most recent is "Db2, where are you? Big Blue is oddly reluctant to discuss recent enhancements to its flagship database"


There are enough bizarre problems in Oracle, including random ORA-600 errors, but at least you can nail down the execution plan you want.


some day I'll share my war stories with oracle and rac. definitely on par with this story.

that said, I'd still much rather spend my days inside Postgres than inside oracle.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: