Fun with SQL: Window Functions in Postgres

btilly · on June 4, 2018

Window functions are actually part of the SQL 2003 standard.

Here is the best way that I've found to think about them. Sometimes you run across a SQL problem that makes you think, "This would be easy if I just sorted my table, and ran a program over it breaking it into partitions then processing each partition in order." Whenever you do that, window functions will solve that problem.

Note that in practice I've found that queries using window functions often are best done by something of a form: SELECT ... FROM (SELECT ...) foo WHERE ... GROUP BY ... where the window functions are all in the inner query, and the outer query's group by collapses each partition into a single row.

blattimwind · on June 4, 2018

Window functions are super handy to turn a correlated subquery into a join, because they make group-wise limits fairly easy (just put a ROW_NUMBER() concerning the relevant partition, i.e. what would be your GROUP BY in the correlated subquery, in your query and select from that query filtering on rank; then join on whatever you previously correlated on).

Basically: SELECT ... FROM (SELECT ..., ROW_NUMBER() OVER (PARTITION BY my_group [ORDER BY...]) AS rank) AS foo WHERE foo.rank <= 10

Without window functions you can usually only do this using a correlated subquery, which usually is much slower for fairly obvious reasons.

da_chicken · on June 4, 2018

Yup, if all you ever learn is one windowing function, the "greatest N per group" pattern with ROW_NUMBER() is easily the most common.

btilly · on June 4, 2018

That's another common variation, using a where in the outer query to select only a few of the rows from the window function.

And I've also used them as a better performing version of a full outer join - do a select of the combined fields from a window function query that is selected off of a UNION ALL. You have to jump a lot of hoops to do it, but the performance win can be worthwhile.

dbkaplun · on June 4, 2018

There's a nice set of challenges for learning window functions and testing your window function knowledge: https://www.windowfunctions.com/

pjungwir · on June 4, 2018

I've noticed that ntile(n) can give unexpected results when there are fewer input rows than n, so cume_dist might be a safer choice (or percentile_disc or percentile_cont). I wrote some more details here:

https://stackoverflow.com/questions/27883999/how-to-calculat...

This behavior means that ntile(100) is not the same thing as percentile.

But I've always wondered if I was just misunderstanding something. I wonder what others think?

da_chicken · on June 4, 2018

ntile() is one of those functions that just never seems to work the way users expect regardless of the RDBMS. I don't know if the spec's definition was just poorly thought out or what, but it's often flaky, poorly performing, or both in many RDBMSs.

In PostgreSQL, you should generally use cume_dist(), percentile_cont() and percentile_disc(). Most later releases of other RDBMS have similar window functions that work more like users expect. I'm not certain what's up, but it's definitely led me to believe that ntile() was just poorly thought out at the spec level.

pfarrell · on June 5, 2018

I had a project where I had to build a tool to relocate customer data to another database (for customers who had grown large enough to prevent them from sharing db resources with other customers). This was on SQL Server. I used NTile to help create constant size batches to feed into my etl tool. Worked great for that. Haven't used it for statistical bucketing purposes.