Hacker News new | past | comments | ask | show | jobs | submit login
More on Postgres Performance (craigkerstiens.com)
154 points by icey on Jan 10, 2013 | hide | past | favorite | 42 comments



Worth reading. I had no idea you can do conditional indexes. Is that something other dbs support? (looks like mysql doesn't)


I think postgres calls them "partial indexes". See http://www.postgresql.org/docs/9.2/static/indexes-partial.ht...

... and they're great! Postgres is very smart about using them, too, so they can be a lifesaver when you have a large table of which you only [typically] use a small bit.

I've created partial indexes over geospatial data. There are 1M doctors in the US, so answering the question "how many doctors are in San Francisco" requires a join between indexes of all doctors in the US and all businesses in SF. Using partial indexes, we carved up the country into a bunch of polygons and were able to frame the question as "how many doctors are in San Francisco in polygon 34?". Much faster.


I've actually seen cases where a partial index can cause query performance to fall on its face.

Just recently, we had a case with a partial index where the predicate was a boolean field, but the indexed field was a timestamp, sorted descending. The planner chose that index when given a query that had an ORDER BY ... DESC on the timestamp, and query times spiked several orders of magnitude. Dropping the index (which was only used by that query) caused it to drop from 2.5s to .7ms.

Yes, indexes are that powerful.


Because of irritating query planner outcomes like these, I've always wished postgresql just offered an API where you hand it the query plan. It seems like the bulk of questions from experienced users of postgresql (and all other SQL databases, really) relate to how to express the query to the planner does what you wanted it to do in the first place. Seems easier to just hand it the plan when you know what you want.


Oracle and DB2 both offer ways to provide hints to the database about how you want the query processed, and the ability to store those hints independently of the query - Stored Outlines and Plan Stability being the features in question. This is particularly useful when you can't risk the plan changing in production and screwing you over - in that situation it's preferable to experience a gradual degradation->optimisation cycle personally.

It's one thing that I find it frustrating that Postgres (which I otherwise love) doesn't offer - I understand that they don't like optimiser hints, but at least the ability to say 'keep using the current plan until I give you permission to do otherwise' would be invaluable.


I've seen that on non-partial indexes when doing and order by query that has an index. The query planner sometimes preferentially chooses the ordering index to do an index scan, then filters out all the items that don't matche the query predicate. This is great when you're likely to take the last 10 timestamps, but falls flat when you have to check 100k items.

One sneaky way to deal with this is to order by an expression that can't be indexed, like id+0. I've had 4 order of magnitude speed ups when that sort of query tweaking pointed the planner at the right index. (And in that case, a partial index)


Interesting. Did it choose a better suited index? If so, could you have created a combination of the two to get optimal results? (I guess at .7ms you wouldn't even bother looking)


Was it "several orders of magnitude", or just the factor of ~3.5 represented by the 2.5s -> 0.7s change; it can't be both.


0.7 milliseconds.


I'm impressed at my ability to misread that. My bad.


I wonder how the performance of your partial index would compare to a pgsql materialized view implementation? [0] Did you by chance explore such an option?

[0]: http://www.pgcon.org/2008/schedule/attachments/63_BSDCan2008...


One stuff to keep in mind is that Partial index, index are internal implementation of the database. If you start using materialized view it is visible to the application.

I think you are adding another entity that you must now support and that the application must specifically call.

If your application supports multiple databases, i don't think you want to have specific code to support optimization. Of course you need to do this :)


Materialized views are inherent to oracle. They might refer to them as snapshots. In mssql they are known as indexed views. There is no implementation in pgsql but it is often discussed [0]. I gave up on data portability for nontrivial applications a long time ago. If you mean instead a view across multiple databases, it might be possible using pgsql foreign data wrappers but definitely not performant.

[0]: http://wiki.postgresql.org/wiki/Materialized_Views


you could try PostGIS's spatial indices for a less manual/static approach.

spatial indices of the R-tree variety work quite well as long as the dimensionality isn't too high (i.e., your 2D data would be good).

note: the original 1984 paper on r-trees is a good one (although the performance section is a bit weak): http://postgis.refractions.net/support/rtree.pdf


That's a really nice use-case, thanks for detailing it.


You can also do "functional" indexes, where the index predicate is the result of running a function against the indexed column (e.g., UPPER(text_field), or the like). One important caveat: the function must be defined as "immutable" — that is, it returns the same value for a given input whenever it's invoked.

Even better: you can combine functional and conditional indexes.


(Just for reference/googleability, some other conventional names for such functions are pure or referentially transparent.)


(In this case, "immutable" is not just a property of the function, but the exact keyword you must use while defining the function.)


I learned yesterday that Oracle also supports this. It's a VERY useful feature!


Idempotent, not immutable.


A function f is idempotent if

  f(f(x)) = f(x)
Which is not what's needed here.


Yes, we call that idempotent, but in PostgreSQL, such functions are called (and must be declared) "immutable".


Actually, the docs explain better, no modification + same result, so it's actually immutable and idempotent.

IMMUTABLE indicates that the function cannot modify the database and always returns the same result when given the same argument values.


Idempotent is still the wrong word. Idempotent means that the function can be composed multiple times with no effect. The term I think you are looking for is that it must be a pure function.


"Idempotent means that the function can be composed multiple times with no effect."

Close, but not quite. The first application may have an effect. The second application and so on may not.

For example, the HTTP spec defines GET as idempotent. An initial GET request for a URI may have a side-effect, but, per the spec, subsequent requests return the same content and have no further side-effects.

Edit - reading the GP, I want to clarify that an immutable function in PostgreSQL is idempotent, since the N+1 applications have no side effect and return the same result as the 1st. Also, the 1st has no side effect.


    per the spec, subsequent requests return the same content
That's not the sense in which HTTP GET is idempotent. The spec says nothing about returning the same content. In fact, servers are free to return uncacheable content that changes on every GET request, and many do so (for example, a page that contains the current time is a valid HTTP GET response). The HTTP definition of idempotence only applies to side-effects on the server.

You can formally reconcile this definition with the mathematical definition as follows: an HTTP request is expected to cause some operation on the server (an operation is a function with the same domain and range). This operation takes the server from one state X to a new state Y. Since HTTP GET is supposed to be idempotent, if you make a GET request to a server in state X, it is free to transition into some state Y != X, but if you reissue the request to the server in state Y, it should remain in state Y.

So basically, an IMMUTABLE function in PostgreSQL is trivially idempotent in terms of operations on the server, in that it can have no effect at all on the state of the database. But as far as mathematical properties of the function go, it does not need to be idempotent.


I agree, and I misspoke on the point which you corrected me on.


eh!?

A pure function is an idempotent function which has no side-effects. For example no logging.

You can have an impure idempotent function: a function always producing the same output for given input but that is impure because it has side-effects, like logging or sending a heartbeat on the network or whatever.

But you cannot have a pure non-idempotent function.

Any idempotent function --and hence any pure function-- shall always be giving the same output given some input (and hence can be cached etc.) but the idempotent function can have side-effects (but not side-effects that would modify the "idempotentness" of the function itself.


A pure function does not need to be idempotent. For example f(x) = x * 2 is a pure function, but applying it twice multiplies by four instead of two.

In computer science, another sense of the word idempotent has arisen, which relates to operations on a mutable identity. In this sense, you might say "UPDATE table SET col1=0 WHERE id=25" is an idempotent operation on databases: applying it to a database in some state results in a new database state, but applying it again to that new database state results in that same database state again (this is the sense in which HTTP GET requests are idempotent, for example).

An IMMUTABLE function is a pure function, at least with respect to the database, since it cannot mutate the database which is the side effect we care about. It is not necessarily idempotent: given that f is an IMMUTABLE function, and f(2) == 4, I can't say for sure that f(4) == 4.


Getting support for partial indexes from within Rails is one of the reasons I wrote db_leftovers, which lets you manage indexes, foreign keys, and CHECK constraints from a simple DSL:

https://github.com/pjungwir/db_leftovers

Although db_leftovers works with Postgres and MySQL, partial indexes only work with Postgres, because MySQL doesn't have them.


Looks very cool. Is it stable and ready for wide use or still a bit experimental?

p.s. Definitely agree about not changing migrations after 'git push'. Better to make a new reversing migration to change things back if needed and yet another to get it right.


I've used it regularly on a few projects for the last few months or so, but the Postgres code is definitely better exercised than the MySQL code. There are pretty good tests for both though.


Oracle does not support partial indexes but since Oracle does not index NULL values you can emulate it with a functional index. (Only really useful if used for unique indexes, otherwise to kludgey.) I do not think DB2 supports partial indexes either, and I have no idea if they can be emulated.

The only two I know support them are PostgreSQL and MSSQL.


Oracle has index sharding though - would that help in the use case of a very large table where typically only a specific, small chunk of it is accessed?


I believe support for this is fairly common among the usual suspects but is by no means universal, as you point out.


There's a variety of great features similar to this detailed in a couple of Why Postgres posts. The conditional index is one which is mentioned in here - http://www.craigkerstiens.com/2012/05/07/why-postgres-part-2...


That's a good run down that I had skimmed through before. I must have missed the conditional index section.


MSSQL calls them filtered indexes. Best thing since sliced bread.


Tangent: I met Craig yesterday randomly in a coffee shop. He seemed really smart and gave me a few Postgres performance tips. I didn't know he was a big deal. I guess those are the cool twists of fate people talk about happening in the Bay Area.


I met him a few months ago in Oxford, UK when he was speaking. Really enjoyed his talk and he was very approachable in the bar afterwards.


I thought ratio based tuning is old-fashioned.


tl;dr

Use EXPLAIN ANALYZE and read the manual.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: