PostgreSQL EXPLAIN Output Explained

munk-a · on May 28, 2021

The article touched on some caveats but missed what I think is a big one - you really want to capture any detailed explains from environments as close to production as possible. Different table statistics can cause the planner to go in wildly different directions and while faster is always better it is very easy to accidentally get caught up trying to sink a lot of effort into making a query more performant that was running slow due to thrashing in ram on a dev-box.

Explain (analyze at least - which you should always use) is a lot less theoretical than you might assume. That can make it a bit more onerous to execute but it ends up adding a lot of value to the statistics when you gather them.

Oh also - query caching on postgres is a thing so if you're worried about performance from a cold state don't forget to clear caches before executing. And if anyone has any good suggestions around tools to screw up table statistics I haven't found a good one that I like yet.

samokhvalov · on May 28, 2021

> Different table statistics can cause the planner to go in wildly different directions

Exactly. That's why my team and I (Postgres.ai) have developed Database Lab Engine [1] and a chatops tool for SQL optimization, Joe bot [2], both are open-source (AGPLv3).

EXPLAIN (ANALYZE, BUFFERS) has to be executed on the same-size DB, with properly adjusted Postgres configuration.

Interesting, that the machine you might using for query plan troubleshooting, can have less RAM and different hardware in general – it doesn't matter for the planner. Even shared_buffers doesn't matter – you can set effective_cache_size matching production (this trick we use in Database Lab when hardware is weaker than on production).

As for the cache states – very good point as well. I'm advocating for buffers- or rows-centric approach: first, optimization should be done to reduce the numbers of buffers or, if you're working with "logical" (dump/restore) copy of the database rather than "physical" (PGDATA copy, keeping the same data layout, including bloat, etc.) – the fewer the numbers, the better. Only then, you pay attention to timing – and keep in mind what can happen under the worst conditions (everything is read from disk), if it makes sense.

[1] https://postgres.ai/products/how-it-works

[2] https://postgres.ai/products/joe

samokhvalov · on May 28, 2021

> tools to screw up table statistics

Perhaps you already know these, but just in case:

- https://github.com/ossc-db: pg_dbms_stats, pg_store_plans, pg_hint_plan

- https://github.com/HypoPG/hypopg

ganomi · on May 28, 2021

To get production EXPLAINS for problematic queries you can activate auto_explain on a postgres instance. For my transactional system i have set it up to log EXPLAINS for all queries that take more than 2000 ms.

munk-a · on May 28, 2021

Auto_explain is a pretty great tool to spread knowledge on yea - I've actually built out a lot of functionality related to our DB handle where I work and one of the features I added was a software configuration to establish a threshold that could also be impacted by other runtime variables. We've used this to track specific classes of queries over time and figure out what's going wrong and it can be advantageous (if you know a query sometimes does run long) to capture explains of it executing quickly - sometimes you'll get really helpful information like the query planner changing it's mind when passing a threshold of so many rows and know clearly what you want the query planner to decide to do.

If you're a small enough shop to consider it I highly recommend setting up something to automatically explain queries meeting some criteria on production or using some analysis stack (like new relic) to just capture all the query executions within certain time windows.

These tools all come with costs and should never just run continuously on production if you're getting no benefit from them, but the value can be quite significant.

samokhvalov · on May 28, 2021

Great extension, yes. There is overhead when enabling the timing and buffers options, but sometimes it's not big [1]

But auto_explain solves only part of the task – you can see what happened, but cannot see the answers to "what if" questions. ("What if I used this index?")

[1] https://www.pgmustard.com/blog/auto-explain-overhead-with-ti...

skunkworker · on May 29, 2021

It's good to auto explain but I would also add that going in and running explain (analyze, buffers) is really beneficial to seeing how much the query uses buffers and how many pages it has to load from disk.

comboy · on May 28, 2021

I would say you need it on production environment.

Exact same configuration is not enough. You want shared buffers and disk cache to look the same as it looks on production and you also want the same common queries running in the background.

I mean, "need" in case of a busy database and being at a high optimization level where small details matter. You can catch more obvious stuff with much less care.

mrjaeger · on May 29, 2021

Do you know how to clear the cache inside of a running Postgres instance? All of the articles online say to just restart the db, but that isn’t feasible in some cases I’ve come across, such as when trying to do testing against a remote db spun up to test against more prod like data. Like you said the query perf against a cold cache vs. something that has had a lot of rows loaded into the shared buffed can be quite different!

firloop · on May 28, 2021

Great writeup. I use EXPLAIN a lot in development as a gut check — “does this descending index do what I thought it would? how expensive is that subquery?” Highly recommend looking at it early on, it helps me catch silly mistakes well before production.

ezekg · on May 28, 2021

I use an awesome service called PgMustard [0] for parsing and debugging slow queries. It has saved me a lot of time, and has helped me resolve some pretty big (and complicated) bottlenecks.

[0]: https://pgmustard.com

michristofides · on May 28, 2021

Thanks for the shout out, I’m half the team behind pgMustard, happy to answer questions here if anyone has any

goodpoint · on June 4, 2021

Warning: it's paid and requires sign-up with github/google even for a test.

takeda · on May 28, 2021

The site mentioned in the article also has a series that goes more in depth how to read and understand explain output:

https://www.depesz.com/tag/unexplainable/

hermanradtke · on May 28, 2021

An alternative to https://explain.dalibo.com/ is https://tatiyants.com/pev

Both have pros and cons about how they visualize things.

tclancy · on May 28, 2021

Came here to suggest the same. If you hit the gears on the left and choose view: compact and graph: cost you can get a decent overview of the hot spots in complicated queries quickly.

MrOxiMoron · on May 28, 2021

PEV is great, helped me figure out and fix specific query issues that only happened on production

airstrike · on May 28, 2021

I appreciate the first image in TFA is supposed to just be funny but it would actually be useful to have an output like that. Some of those analyses are tougher than others to code but a subset of them are not entirely out of the realm of possibility.

michristofides · on May 28, 2021

I hope we’re not truly a consultants nightmare, but we’ve got quite a few of these covered in pgMustard (15+ tip types) and working to add more.

eyelidlessness · on May 29, 2021

I've found Postgres EXPLAIN output completely unhelpful for as long as I've been using Postgres, and... this article didn't help.

> Find the lowest node where the estimated row count is significantly different from the actual row count.

> ...

> Under the heading “rows x” you see by what factor PostgreSQL overestimated or underestimated the row count. Bad estimates are highlighted with a red background.

Am I missing something? Everything actually shown displays identical row count estimates/actual, and red/yellow/orange associated with accurate estimates. What am I not seeing??

michristofides · on May 29, 2021

You’re quite right, the example given doesn’t have bad row estimates, and other cells are highlighted in red/orange/yellow for different reasons (proportion of time taken, in the case shown).

For an intro to this that goes through several examples, I highly recommend a conference talk[1] by Josh Berkus in 2015/16 that he gave a few times. It has aged pretty well and I’ve not yet seen the basics covered better.

[1]: https://youtu.be/mCwwFAl1pBU

eyelidlessness · on May 29, 2021

Thank you for validating that I’m not crazy for not seeing the things described in the article in its examples! It’s become kind of a running non-joke of “I don’t think I should be feeling impostor syndrome but I keep being scolded to read the EXPLAIN, it keeps being mystery meat every time I try, and I keep having better outcomes applying what I’ve learned every other way”.

If I have time this weekend I’ll check out the video.

firloop · on May 29, 2021

The numbers can be useful, but I mainly pay attention to the steps the query planner takes. Is it doing an expensive loop over data? Is it using an index? Those can often be more illuminating than the numbers – you get a feel over time for what operations are actually expensive.

eyelidlessness · on May 29, 2021

I’m glad it’s helped you. I’m saying that I never got that feel over time, and I was hoping for something illuminating in the article, but it is describing things that aren’t actually in the examples it provides... or there’s something I’m not seeing? It does me no good to be told “look for where the planner and execution are different”, and they’re the same in the example, or “you’ll see this tool highlight those differences red” and there’s no difference I can find. I’m open to the possibility I’m missing something but I’ve stared at these and similar EXPLAIN results and tooling analyses for endless hours and can’t see what I’m supposed to learn from them.

efxhoy · on May 28, 2021

One thing I learned about EXPLAIN this week is that it doesn't show constraint checks. I was trying to delete about 40k rows from a table and it was taking hours and I couldn't figure out why. ANALYZE EXPLAIN showed nothing indicating anything about reading any of the other tables than the FROM and the USING table.

The table I was deleting from had 20 foreign key constraints referencing it, and a couple of them didn't have an index on the referencing column and were big (a few million rows). Added indexes to all of them, took a couple of minutes to build, and the DELETE ran in a few seconds.

Sometimes the answer to a performance issue can't be found in EXPLAIN. And always remember to properly index your foreign key constraints.

michristofides · on May 28, 2021

EXPLAIN ANALYZE would have shown you referential integrity (RI) triggers taking most of the time, but it’s still a bit of a leap to work out that it’s due to missing foreign key indexes if you don’t already know

NeutralForest · on May 28, 2021

Probably one of the best resources to understand indexes and the output of `EXPLAIN ANALYZE` would be https://use-the-index-luke.com/

syastrov · on May 29, 2021

Wish I had a tool that could suggest things to do like the “cartoon” at the top of the article.