Hacker News new | past | comments | ask | show | jobs | submit login
Representing Trees in PostgreSQL (woss.name)
164 points by mathie on Nov 21, 2014 | hide | past | favorite | 58 comments



A good book on the subject is Joe Celko's Trees and Hierarchies in SQL for Smarties.

http://www.amazon.com/Hierarchies-Smarties-Edition-Kaufmann-...

He spends a chapter on each of the models outlined in this post: adjacency, path, and nested set models.


If you're storing a tree in an RDBMS, please look into the closure table algorithm rather than adjacency, nested set, or materialized paths.

If you're using Rails (3.2 through 4.1), try this: http://mceachen.github.io/closure_tree/

Like it says in the README:

* Fetch your whole ancestor lineage in 1 SELECT. * Grab all your descendants in 1 SELECT. * Get all your siblings in 1 SELECT. * Fetch all descendants as a nested hash in 1 SELECT. * Find a node by ancestry path in 1 SELECT. * 2 SQL INSERTs on node creation * 3 SQL INSERT/UPDATEs on node reparenting

None of the other approaches above have even remotely similar performance characteristics. If your tree is small (tens of nodes), you won't care. If it's bigger, you will.


I wasn't aware of closure trees before, thanks. The presentation that you link to by Bill Karwin, along with a few other resources are in the comments by coleifer and chdir.


I've done nested set before; it's interesting - very fast for queries, but requires a lengthy insert/update cost. Also, team members were absolutely clueless as to what was really going on. I'm not sure nested sets are really much faster than what modern rdbms's can provide today.


In addition to expensive insert/update, it is necessary to keep the left and right boundaries across ALL of your entries in perfect order. If the boundaries get out of whack, fixing the tree is a nightmare scenario.

I worked on a multi-tenant application with distinct trees present in one table and with one tree per table and so on. Fun fun fun!


I think the nested intervals model, a refinement of nested sets, solves this slow update problem: http://www.rampant-books.com/art_vadim_nested_sets_sql_trees...

But I've never had to use it, so I am just guessing.

Same article, different site: https://communities.bmc.com/docs/DOC-9902

And a paper: http://www.sigmod.org/publications/sigmod-record/0506/p47-ar...

Here's a comparison of the different approaches in a matrix: http://vadimtropashko.wordpress.com/2008/08/09/one-more-nest...


One just needs to be aware that nested sets eventually hit a limit.

I experimented with a variation by Vadim Tropashko using something called "Farey fractions" [1]. These represent the intervals as a 2x2 matrix of four integers rather than two floating point values.

The numbers are effectively limited to 32-bit values in the matrix since a multiplication is required (resulting in 64-bit intermediate results).

It was very interesting, but couldn't model a file system hierarchy well. It could roughly 10^32 items in the best case, but a very small number (hundreds) in edge cases.

For example, it maxes out at a depth of 17 with 100 items at each level, or a depth of 34 with 10 items each. This might be fine for modeling some hierarchies, but definitely not a file system. The edge case is if the fraction extends along one edge. So if we have 10 items per level, and add a child at the leftmost edge each time, we create the most costly fractional subdivision. Do this 35 times and you hit an math overflow.

[1] Check out the Chapter 5 and Errata links: http://vadimtropashko.wordpress.com/“sql-design-patterns”-bo...


There is also ltree http://www.postgresql.org/docs/9.3/static/ltree.html

Adjacency lists also don't perform that badly with recursive queries in my experience.


This.

The project I'm on used materialized paths, which lead to great pain. I investigated nested intervals ... and they could't achieve the tree depths we needed (we were modeling a file system tree).

We are back to adjacency lists (using a parent ID) but redesigned to avoid the need for recursive ancestor and descendant queries.

_But_ the RDBMS doesn't support recursive queries and I've been curious about PostgreSQL's recursive query support. I played with it, but not on a fully loaded database with deep trees of data.

Does PostgreSQL recursive query support work well with deep trees (> 100 levels) on tables with tens of millions or more rows?


Can you elaborate on how materialized path caused pain? Was it more performance or maintenance? I would have expected MP to be a good fit for modeling a file system.


First, renames and moves require updating all descendants.

Second, the materialized path's length exceeded the database's indexable length limit. (MySQL, the DB in question, has a default index limit of 767 bytes, so only first 767 bytes are indexed.)

There are ways around these issues (like using "UPDATE ... WHERE" rather than using an ORM to walk the tree and update... sigh). We also had other app-specific / design-specific issues too that swamped these issues performance wise.

I'm hopefully we won't need ancestor and descendant queries again in our re-design. But I'm keeping PostgreSQL with its Common Table Expression stuff--the thing that facilities recursive queries--in my back pocket. It's that or use a stored procedure to build the capability by hand.


Here is a blog post I wrote about using a recursive query to pull out this kind of tree:

http://illuminatedcomputing.com/posts/2014/09/postgres-cte-f...

The problem I was tackling was making the tree sorted like you see in threaded, scored comments.


Another option for storing tree structures is the closure table [1]. It acts as a sort of "many-to-many" junction table between the tree nodes, storing the relationships between each.

[1] https://coderwall.com/p/lixing/closure-tables-for-browsing-t...



Other comments have mentioned this as well, but recursive CTEs are a very effective technique for representing trees, and they work very well in PostgreSQL.


We have used these for our discussion features on Pathwright. With ltree, you don't have as much flexibility when sorting by multiple values. With the Recursive CTE, you can go nuts and still end up very efficient.

I believe Disqus uses this, as well: http://cramer.io/2010/05/30/scaling-threaded-comments-on-dja...


We are using CTEs in SQL Server (2008 pre R2), with a not too large object graph, and find the query performance inadequate for our realtime needs. We ended up flattening and caching graph representations in order to remove the bottleneck. The CTE query was the starting point to further queries throughout our applications so it benefitted all applications to introduce this layer.

Of course, then we had two problems on our hands. :)


There's a great Ruby on Rails gem called acts_as_sane_tree (after the non-recursive acts_as_tree) that uses postgresql's recursive queries. I'm using it on a project and have found it useful with good performance: https://github.com/chrisroberts/acts_as_sane_tree


Looks like the main fork hasn't been updated since 2012. Is there an updated version you recommend for Rails 4 compatibility?


Hm, I didn't realize that. I'm using it on a Rails 4 project and haven't had any problems.


I've used gem awesome_nested_set with lots of success in the past. https://github.com/collectiveidea/awesome_nested_set. Updated quite often, Rails 4 supported


Yeah, the Nested Set model is cool but it does not use recursive postgresql queries under the hood: http://mikehillyer.com/articles/managing-hierarchical-data-i...


I've used django-mptt to represent a tree of human phenotypes before. It was fast and quite an easy api to use. I was using PostgreSQL as the backend.

https://www.djangopackages.com/packages/p/django-mptt/


I've also used django-mptt and just wanted to throw another vote behind it in case anyone is considering using it. Our data structure was a good fit for MPTT (many reads, few writes) so I can't comment on how it would behave in the opposite scenario. Nice API, very easy to use from a developer standpoint. We were using MySQL.


Likewise. I've done some heavy modification ontop of django-mptt to support limiting tree depth when querying, calculating total children, querying siblings, etc...

https://gist.github.com/jhgg/32a379e34c8a56303295


First off: The first example is exactly what a relational database is for. It's a "tree" structure only because it's several 1:many joins. It is true that ORMs aren't that amenable to composition and lots of dynamic joins, but that's not the fault of the database, it's the fault of the ORMs.

That being said, I've tried all these methods before across a few different DBMS.

IMO a good way to go about this all is to actually just reimplement a file system; You have a caching virtual file system in your application, and store data as key:parentKey:name (equivalent-ish dentry:parentDentry:fileName) in every table which contains child nodes. It's fast, often more predictable, and definitely more portable (as you aren't relying on DMBS-specific constructs). It's also amenable to partitioning/sharding by parent key. You can drastically reduce the amount of queries that are being sent. Also, if you use a b+tree as an index for your paths, you can invalidate cache/subtrees pretty fast in your application.

Of course, you end up duplicating functionality of the database, and if there is a lot of latency between you and the DB, this might not be the best method (or it might, depending).

It would be nice if MySQL finally supported CTEs.


There's also the ltree module which is built-in to postgres: http://www.postgresql.org/docs/9.1/static/ltree.html


Does anyone remember how Drupal handles the hierarchical comments? It maintains a "sort key" field for each comment, which consists of multiple index numbers for all levels, much like the section numbers in Wikipedia.

1

1.1

1.2

2

2.1

2.1.1

2.1.2

2.1.2.1

In this way, displaying comments in a tree form is trivial. Just ORDER BY the sort key. I find it brilliant for applying to such an application.


This pattern is commonly called "materialized path", if anyone is trying to search for it.


A similar system is also heavily used in the construction world in their "Work Breakdown Structure" - a borrowing on how they document airplane parts that are part of an assembly, that are part of a module, that part of a structure.

http://en.wikipedia.org/wiki/Work_breakdown_structure


Thanks for the name! We do this internally to store various levels of product data in a single mongo collection (e.g., style, variant).


Thanks! I wanted to know what it is called.


But beware of what may happen when you get over 9:

  1
  1.1
  1.2
  10
  2
  2.1
To support that, you will have to be careful when picking collation order, and binary collation will not do it. It may be better to use fixed-width numbers with leading zeroes instead, but that, of course, may unnecessarily grow the space used by the table.

And of course, that also is what RCS and SCCS use for numbering revisions.


Or use your decimal as a place extension instead of a hierarchal separator. Your tree would look like this:

  1 
  11
  12
  1.0
  2
  21
Then you just ensure that '.' sorts after all digits. (Probably by using a different symbol.) Basically, the '.' means "insert after."


I've done research on this and other problems, based on Drupal's solutions: https://bojanz.wordpress.com/2014/04/25/storing-hierarchical...


In postgres specifically you can use an int array field rather than a string


PostgreSQL has a specific type for this, ltree.


with recursive is actually VERY quick in postgresql (adjacency model). currently using that and thoroughly tested it. simple, very quick and no hassle to update it (as the other options).

Compared to a mysql with nested set postgres using with recursive is a life changer :D


There are some more tricks you can do using postgres' arrays to efficiently query the data: http://monkeyandcrow.com/blog/hierarchies_with_rails/

Namely using the && operator let's you make use of indices on the materialized paths. We've been using it in production for years to great effect.


> We could have retrieved all the appropriate data with a single query, but that means reconstructing the tree from a flat set of rows we got back from PostgreSQL. Doing that sort of thing in the view would be hienous.

Implement a view helper to take the flat data and return a hierarchical structure that you can render.


And store the data in PostgreSQL as a modified-preordered list so that the data is already in the order you need to efficiently recreate the object graph (note that this implies many reads for each write, since tree modifications become the costly operation).


If it's a small tree, you could encode it as a single JSON object and query on it using PostgreSQL's JSON support.

http://schinckel.net/2014/05/25/querying-json-in-postgres/


Random data point: I have a hierarchal authentication system under MySQL (no recursive query support). My schema does the most obvious thing of having user.parent as a user_id (adjacency model).

My most common query is to find whether user X is a child of user Y.

My solution was to use application-level triggers to maintain a separate lookup table, so i can simply do `user_id IN (SELECT child_id FROM lookup_table WHERE parent_id = Y)` as an additional search clause. With appropriate indexes it's very fast to query, and i can do partial updates to maintain the lookup table.

If my most common tree query was something else (e.g. enumerate children in sorted order) then i'd need some other data structure.


I don't know if ActiveRecord is capable of doing this, but here is how to do it with Django ORM:

in models.py:

    class Node(models.Model):
        parent = models.ForeignKey('self', related_name='children')

in the view:

    nodes = Node.objects.prefetch_related('children')

in template:

    {% for node in nodes %}
        {% for subnode in node.children.all %}
        ...
        {% endfor %}
    {% endfor %}

This will make exactly two queries; one for all the parent nodes, and one for the children nodes. Django will make the pairing automagically, so you don't have to do it in view code.


Unfortunately, this still only gets the children of the root. If you want to be able to recursively get all the children and grandchildren of a root, you're going to be doing (N - leaves) queries where N is the number of Nodes in the tree.

However, there is of course a django package to help with this and gathers all the nodes you care about into one query:

https://github.com/django-mptt/django-mptt


You can easily render a tree many levels deep in O(n) time using a single DB query as an adjacency list with a lookup dictionary:

http://blog.jupo.org/2010/01/26/linear-traversal-of-adjacenc...

No need for mptt either.


If you ever need to deal with trees deeper than one level in django, I'd highly recommend looking at (as mentioned in another comment):

https://www.djangopackages.com/packages/p/django-mptt/


What are the upsides to this vice using a graph database like Neo4j?


You get all the benefits of a mature and very full-featured RMDBS.

Neo4j is a great database, and I use it myself for a project that involves lots of deep hierarchies. But it has a fairly sparse feature-set where schema enforcement and data integrity checking are concerned; you basically have to add all that stuff yourself at the application level which can amount to a lot of work.


To second this, if you're using multiple languages to write to your database, you almost certainly want that schema enforcement at the DB level. We decided not to go with Neo as our canonical database for exactly that reason. Its main strength is probably as a follower or slave to a RDBMS or log file, which is how it seems to be used in enterprise - when you want to do graph-based analytics, you bulk-load a snapshot of your non-graph-stored DB into Neo, then run read-only workloads.


Another option is to add PostgreSQL support to Titan (http://thinkaurelius.github.io/titan/).


I used to use code based on Kendall Willets' code. His original site is offline. Archive here: https://web.archive.org/web/20110928135313/http://willets.or...

Line from the page: "I just picked up a copy and it looks great! You are right about the whole approach and my stuff stinks." - Joe Celko, author of SQL for Smarties.


Not sure about ActiveRecord, but Sequel has supported using recursive common table expressions for loading all descendants in a given branch (or branches) of a tree for over 4 years using the rcte_tree plugin: http://sequel.jeremyevans.net/rdoc-plugins/classes/Sequel/Pl...


> There’s also a twist on the Nested Sets model, called the nested interval model, where the nodes are given two numbers that represent the numerator and denominator of a fraction, but it doesn’t seem so popular, and was too complex to wrap my head around!

...and the author just wrote-off the best performing generic tree solution, because "it was too complex to wrap my head around!"

:facepalm:


Bojan Živanović and I have a writeup on this whole tree storage business https://bojanz.wordpress.com/2014/04/25/storing-hierarchical... here.


Covered this and several other techniques here: http://www.slideshare.net/quipo/trees-in-the-database-advanc...


A graph database comes to mind. It's a different design problem when dealing with deep relationships vs top level relationships. Sometimes you need more deep inherent relational mappings for alogithms to be effecient.


Should be titled representing paths. Trees are easy: id, parent_id. Paths are tricky.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: