Representing Trees in PostgreSQL

Todd · on Nov 21, 2014

A good book on the subject is Joe Celko's Trees and Hierarchies in SQL for Smarties.

http://www.amazon.com/Hierarchies-Smarties-Edition-Kaufmann-...

He spends a chapter on each of the models outlined in this post: adjacency, path, and nested set models.

mceachen · on Nov 22, 2014

If you're storing a tree in an RDBMS, please look into the closure table algorithm rather than adjacency, nested set, or materialized paths.

If you're using Rails (3.2 through 4.1), try this: http://mceachen.github.io/closure_tree/

Like it says in the README:

* Fetch your whole ancestor lineage in 1 SELECT. * Grab all your descendants in 1 SELECT. * Get all your siblings in 1 SELECT. * Fetch all descendants as a nested hash in 1 SELECT. * Find a node by ancestry path in 1 SELECT. * 2 SQL INSERTs on node creation * 3 SQL INSERT/UPDATEs on node reparenting

None of the other approaches above have even remotely similar performance characteristics. If your tree is small (tens of nodes), you won't care. If it's bigger, you will.

Todd · on Nov 23, 2014

I wasn't aware of closure trees before, thanks. The presentation that you link to by Bill Karwin, along with a few other resources are in the comments by coleifer and chdir.

platz · on Nov 21, 2014

I've done nested set before; it's interesting - very fast for queries, but requires a lengthy insert/update cost. Also, team members were absolutely clueless as to what was really going on. I'm not sure nested sets are really much faster than what modern rdbms's can provide today.

tempVariable · on Nov 21, 2014

In addition to expensive insert/update, it is necessary to keep the left and right boundaries across ALL of your entries in perfect order. If the boundaries get out of whack, fixing the tree is a nightmare scenario.

I worked on a multi-tenant application with distinct trees present in one table and with one tree per table and so on. Fun fun fun!

alistairbayley · on Nov 21, 2014

I think the nested intervals model, a refinement of nested sets, solves this slow update problem: http://www.rampant-books.com/art_vadim_nested_sets_sql_trees...

But I've never had to use it, so I am just guessing.

Same article, different site: https://communities.bmc.com/docs/DOC-9902

And a paper: http://www.sigmod.org/publications/sigmod-record/0506/p47-ar...

Here's a comparison of the different approaches in a matrix: http://vadimtropashko.wordpress.com/2008/08/09/one-more-nest...

rickmode · on Nov 22, 2014

One just needs to be aware that nested sets eventually hit a limit.

I experimented with a variation by Vadim Tropashko using something called "Farey fractions" [1]. These represent the intervals as a 2x2 matrix of four integers rather than two floating point values.

The numbers are effectively limited to 32-bit values in the matrix since a multiplication is required (resulting in 64-bit intermediate results).

It was very interesting, but couldn't model a file system hierarchy well. It could roughly 10^32 items in the best case, but a very small number (hundreds) in edge cases.

For example, it maxes out at a depth of 17 with 100 items at each level, or a depth of 34 with 10 items each. This might be fine for modeling some hierarchies, but definitely not a file system. The edge case is if the fraction extends along one edge. So if we have 10 items per level, and add a child at the leftmost edge each time, we create the most costly fractional subdivision. Do this 35 times and you hit an math overflow.

[1] Check out the Chapter 5 and Errata links: http://vadimtropashko.wordpress.com/“sql-design-patterns”-bo...

kpmah · on Nov 21, 2014

There is also ltree http://www.postgresql.org/docs/9.3/static/ltree.html

Adjacency lists also don't perform that badly with recursive queries in my experience.

rickmode · on Nov 21, 2014

This.

The project I'm on used materialized paths, which lead to great pain. I investigated nested intervals ... and they could't achieve the tree depths we needed (we were modeling a file system tree).

We are back to adjacency lists (using a parent ID) but redesigned to avoid the need for recursive ancestor and descendant queries.

_But_ the RDBMS doesn't support recursive queries and I've been curious about PostgreSQL's recursive query support. I played with it, but not on a fully loaded database with deep trees of data.

Does PostgreSQL recursive query support work well with deep trees (> 100 levels) on tables with tens of millions or more rows?

andrewflnr · on Nov 21, 2014

Can you elaborate on how materialized path caused pain? Was it more performance or maintenance? I would have expected MP to be a good fit for modeling a file system.

rickmode · on Nov 22, 2014

First, renames and moves require updating all descendants.

Second, the materialized path's length exceeded the database's indexable length limit. (MySQL, the DB in question, has a default index limit of 767 bytes, so only first 767 bytes are indexed.)

There are ways around these issues (like using "UPDATE ... WHERE" rather than using an ORM to walk the tree and update... sigh). We also had other app-specific / design-specific issues too that swamped these issues performance wise.

I'm hopefully we won't need ancestor and descendant queries again in our re-design. But I'm keeping PostgreSQL with its Common Table Expression stuff--the thing that facilities recursive queries--in my back pocket. It's that or use a stored procedure to build the capability by hand.

pjungwir · on Nov 21, 2014

Here is a blog post I wrote about using a recursive query to pull out this kind of tree:

http://illuminatedcomputing.com/posts/2014/09/postgres-cte-f...

The problem I was tackling was making the tree sorted like you see in threaded, scored comments.

_pgmf · on Nov 21, 2014

Another option for storing tree structures is the closure table [1]. It acts as a sort of "many-to-many" junction table between the tree nodes, storing the relationships between each.

[1] https://coderwall.com/p/lixing/closure-tables-for-browsing-t...

chdir · on Nov 22, 2014

+1, Related reading (all from Bill Karwin of SQL Antipatterns fame):

http://stackoverflow.com/questions/192220/what-is-the-most-e...

http://www.slideshare.net/billkarwin/models-for-hierarchical...

http://karwin.blogspot.in/2010/03/rendering-trees-with-closu...

rosser · on Nov 21, 2014

Other comments have mentioned this as well, but recursive CTEs are a very effective technique for representing trees, and they work very well in PostgreSQL.

gtaylor · on Nov 21, 2014

We have used these for our discussion features on Pathwright. With ltree, you don't have as much flexibility when sorting by multiple values. With the Recursive CTE, you can go nuts and still end up very efficient.

I believe Disqus uses this, as well: http://cramer.io/2010/05/30/scaling-threaded-comments-on-dja...

crescentfresh · on Nov 21, 2014

We are using CTEs in SQL Server (2008 pre R2), with a not too large object graph, and find the query performance inadequate for our realtime needs. We ended up flattening and caching graph representations in order to remove the bottleneck. The CTE query was the starting point to further queries throughout our applications so it benefitted all applications to introduce this layer.

Of course, then we had two problems on our hands. :)

techtalsky · on Nov 21, 2014

There's a great Ruby on Rails gem called acts_as_sane_tree (after the non-recursive acts_as_tree) that uses postgresql's recursive queries. I'm using it on a project and have found it useful with good performance: https://github.com/chrisroberts/acts_as_sane_tree

btown · on Nov 21, 2014

Looks like the main fork hasn't been updated since 2012. Is there an updated version you recommend for Rails 4 compatibility?

techtalsky · on Nov 21, 2014

Hm, I didn't realize that. I'm using it on a Rails 4 project and haven't had any problems.

elliotec · on Nov 21, 2014

I've used gem awesome_nested_set with lots of success in the past. https://github.com/collectiveidea/awesome_nested_set. Updated quite often, Rails 4 supported

techtalsky · on Nov 21, 2014

Yeah, the Nested Set model is cool but it does not use recursive postgresql queries under the hood: http://mikehillyer.com/articles/managing-hierarchical-data-i...

savagej · on Nov 21, 2014

I've used django-mptt to represent a tree of human phenotypes before. It was fast and quite an easy api to use. I was using PostgreSQL as the backend.

https://www.djangopackages.com/packages/p/django-mptt/

robertfw · on Nov 21, 2014

I've also used django-mptt and just wanted to throw another vote behind it in case anyone is considering using it. Our data structure was a good fit for MPTT (many reads, few writes) so I can't comment on how it would behave in the opposite scenario. Nice API, very easy to use from a developer standpoint. We were using MySQL.

jhgg · on Nov 21, 2014

Likewise. I've done some heavy modification ontop of django-mptt to support limiting tree depth when querying, calculating total children, querying siblings, etc...

https://gist.github.com/jhgg/32a379e34c8a56303295

batbomb · on Nov 21, 2014

First off: The first example is exactly what a relational database is for. It's a "tree" structure only because it's several 1:many joins. It is true that ORMs aren't that amenable to composition and lots of dynamic joins, but that's not the fault of the database, it's the fault of the ORMs.

That being said, I've tried all these methods before across a few different DBMS.

IMO a good way to go about this all is to actually just reimplement a file system; You have a caching virtual file system in your application, and store data as key:parentKey:name (equivalent-ish dentry:parentDentry:fileName) in every table which contains child nodes. It's fast, often more predictable, and definitely more portable (as you aren't relying on DMBS-specific constructs). It's also amenable to partitioning/sharding by parent key. You can drastically reduce the amount of queries that are being sent. Also, if you use a b+tree as an index for your paths, you can invalidate cache/subtrees pretty fast in your application.

Of course, you end up duplicating functionality of the database, and if there is a lot of latency between you and the DB, this might not be the best method (or it might, depending).

It would be nice if MySQL finally supported CTEs.

dharbin · on Nov 21, 2014

There's also the ltree module which is built-in to postgres: http://www.postgresql.org/docs/9.1/static/ltree.html

barosl · on Nov 21, 2014

Does anyone remember how Drupal handles the hierarchical comments? It maintains a "sort key" field for each comment, which consists of multiple index numbers for all levels, much like the section numbers in Wikipedia.

1

1.1

1.2

2

2.1

2.1.1

2.1.2

2.1.2.1

In this way, displaying comments in a tree form is trivial. Just ORDER BY the sort key. I find it brilliant for applying to such an application.

twerquie · on Nov 21, 2014

This pattern is commonly called "materialized path", if anyone is trying to search for it.

diplomatpuppy · on Nov 21, 2014

A similar system is also heavily used in the construction world in their "Work Breakdown Structure" - a borrowing on how they document airplane parts that are part of an assembly, that are part of a module, that part of a structure.

http://en.wikipedia.org/wiki/Work_breakdown_structure

vkjv · on Nov 21, 2014

Thanks for the name! We do this internally to store various levels of product data in a single mongo collection (e.g., style, variant).

barosl · on Nov 21, 2014

Thanks! I wanted to know what it is called.

Someone · on Nov 21, 2014

But beware of what may happen when you get over 9:

To support that, you will have to be careful when picking collation order, and binary collation will not do it. It may be better to use fixed-width numbers with leading zeroes instead, but that, of course, may unnecessarily grow the space used by the table.

And of course, that also is what RCS and SCCS use for numbering revisions.

Retra · on Nov 21, 2014

Or use your decimal as a place extension instead of a hierarchal separator. Your tree would look like this:

Then you just ensure that '.' sorts after all digits. (Probably by using a different symbol.) Basically, the '.' means "insert after."

bojanz · on Nov 21, 2014

I've done research on this and other problems, based on Drupal's solutions: https://bojanz.wordpress.com/2014/04/25/storing-hierarchical...

ketralnis · on Nov 21, 2014

In postgres specifically you can use an int array field rather than a string

jeltz · on Nov 21, 2014

PostgreSQL has a specific type for this, ltree.

m0ppers · on Nov 21, 2014

with recursive is actually VERY quick in postgresql (adjacency model). currently using that and thoroughly tested it. simple, very quick and no hassle to update it (as the other options).

Compared to a mysql with nested set postgres using with recursive is a life changer :D

netghost · on Nov 21, 2014

There are some more tricks you can do using postgres' arrays to efficiently query the data: http://monkeyandcrow.com/blog/hierarchies_with_rails/

Namely using the && operator let's you make use of indices on the materialized paths. We've been using it in production for years to great effect.

joesmo · on Nov 21, 2014

> We could have retrieved all the appropriate data with a single query, but that means reconstructing the tree from a flat set of rows we got back from PostgreSQL. Doing that sort of thing in the view would be hienous.

Implement a view helper to take the flat data and return a hierarchical structure that you can render.

smoyer · on Nov 21, 2014

And store the data in PostgreSQL as a modified-preordered list so that the data is already in the order you need to efficiently recreate the object graph (note that this implies many reads for each write, since tree modifications become the costly operation).

jimbokun · on Nov 21, 2014

If it's a small tree, you could encode it as a single JSON object and query on it using PostgreSQL's JSON support.

http://schinckel.net/2014/05/25/querying-json-in-postgres/

mappu · on Nov 21, 2014

Random data point: I have a hierarchal authentication system under MySQL (no recursive query support). My schema does the most obvious thing of having user.parent as a user_id (adjacency model).

My most common query is to find whether user X is a child of user Y.

My solution was to use application-level triggers to maintain a separate lookup table, so i can simply do `user_id IN (SELECT child_id FROM lookup_table WHERE parent_id = Y)` as an additional search clause. With appropriate indexes it's very fast to query, and i can do partial updates to maintain the lookup table.

If my most common tree query was something else (e.g. enumerate children in sorted order) then i'd need some other data structure.

kissgyorgy · on Nov 21, 2014

I don't know if ActiveRecord is capable of doing this, but here is how to do it with Django ORM:

in models.py:

    class Node(models.Model):
        parent = models.ForeignKey('self', related_name='children')

in the view:

    nodes = Node.objects.prefetch_related('children')

in template:

    {% for node in nodes %}
        {% for subnode in node.children.all %}
        ...
        {% endfor %}
    {% endfor %}

This will make exactly two queries; one for all the parent nodes, and one for the children nodes. Django will make the pairing automagically, so you don't have to do it in view code.

caust1c · on Nov 21, 2014

Unfortunately, this still only gets the children of the root. If you want to be able to recursively get all the children and grandchildren of a root, you're going to be doing (N - leaves) queries where N is the number of Nodes in the tree.

However, there is of course a django package to help with this and gathers all the nodes you care about into one query:

https://github.com/django-mptt/django-mptt

stephen_mcd · on Nov 21, 2014

You can easily render a tree many levels deep in O(n) time using a single DB query as an adjacency list with a lookup dictionary:

http://blog.jupo.org/2010/01/26/linear-traversal-of-adjacenc...

No need for mptt either.

herge · on Nov 21, 2014

If you ever need to deal with trees deeper than one level in django, I'd highly recommend looking at (as mentioned in another comment):

https://www.djangopackages.com/packages/p/django-mptt/

mattdeboard · on Nov 21, 2014

What are the upsides to this vice using a graph database like Neo4j?

mikesname · on Nov 21, 2014

You get all the benefits of a mature and very full-featured RMDBS.

Neo4j is a great database, and I use it myself for a project that involves lots of deep hierarchies. But it has a fairly sparse feature-set where schema enforcement and data integrity checking are concerned; you basically have to add all that stuff yourself at the application level which can amount to a lot of work.

btown · on Nov 21, 2014

To second this, if you're using multiple languages to write to your database, you almost certainly want that schema enforcement at the DB level. We decided not to go with Neo as our canonical database for exactly that reason. Its main strength is probably as a follower or slave to a RDBMS or log file, which is how it seems to be used in enterprise - when you want to do graph-based analytics, you bulk-load a snapshot of your non-graph-stored DB into Neo, then run read-only workloads.

threeseed · on Nov 22, 2014

Another option is to add PostgreSQL support to Titan (http://thinkaurelius.github.io/titan/).

ashmud · on Nov 22, 2014

I used to use code based on Kendall Willets' code. His original site is offline. Archive here: https://web.archive.org/web/20110928135313/http://willets.or...

Line from the page: "I just picked up a copy and it looks great! You are right about the whole approach and my stuff stinks." - Joe Celko, author of SQL for Smarties.

jeremyevans · on Nov 21, 2014

Not sure about ActiveRecord, but Sequel has supported using recursive common table expressions for loading all descendants in a given branch (or branches) of a tree for over 4 years using the rcte_tree plugin: http://sequel.jeremyevans.net/rdoc-plugins/classes/Sequel/Pl...

maaku · on Nov 21, 2014

> There’s also a twist on the Nested Sets model, called the nested interval model, where the nodes are given two numbers that represent the numerator and denominator of a fraction, but it doesn’t seem so popular, and was too complex to wrap my head around!

...and the author just wrote-off the best performing generic tree solution, because "it was too complex to wrap my head around!"

:facepalm:

chx · on Nov 21, 2014

Bojan Živanović and I have a writeup on this whole tree storage business https://bojanz.wordpress.com/2014/04/25/storing-hierarchical... here.

quipo · on Nov 22, 2014

Covered this and several other techniques here: http://www.slideshare.net/quipo/trees-in-the-database-advanc...

jaunkst · on Nov 22, 2014

A graph database comes to mind. It's a different design problem when dealing with deep relationships vs top level relationships. Sometimes you need more deep inherent relational mappings for alogithms to be effecient.

glibgil · on Nov 21, 2014

Should be titled representing paths. Trees are easy: id, parent_id. Paths are tricky.