Representing Graphs in PostgreSQL

osigurdson · 2025-02-17T13:53:18 1739800398

It is always good to know, at what point does "Postgres as X" break down. For instance, I know from experience that Postgres as timeseries DB (without add-ons) starts to break down in low billions of rows. It would be great to know that for graph DBs as well. I think a lot of people would prefer just to use Postgres if they can get away with it.

demaga · 2025-02-17T14:17:18 1739801838

I've done almost exactly the kind of thing described in article for a couple millions of rows. It broke down when I needed to find "friends of friends" of depth 6. None of the optimizations I could come up with helped. Neither did Apache Age.

Maybe I'm just not skilled enough to handle such a workload with Postgres. But Neo4j handled it easily

eurg · 2025-02-17T14:30:17 1739802617

I'm not sure if you are serious or joking. "6 degrees of separation" is the famous radius for how many steps are needed such that everyone is connected.

https://en.wikipedia.org/wiki/Six_degrees_of_separation

demaga · 2025-02-17T16:09:47 1739808587

Except that it wasn't about people at all? I was just referring to a _type_ of query

ffsm8 · 2025-02-17T16:21:30 1739809290

What kind of data was it, of I may ask?

Graph databases have a very narrow usecase, and it's almost always in relation to people - at least ime.

Though the data type isn't really important for the performance question, the amount of data selected is. So a 6-level depth graph of connections that only connect 2-3 entities would never get into performance issues. You'd be able to go way beyond that too with such a narrow window. (3 entity Connections on 6 level join would come out to I believe ~750 rows)

If you're modeling something like movies instead, with >10 actors per production you're looking at millions of rows.

derefr · 2025-02-18T22:42:38 1739918558

Network analysis of financial transactions to detect "layering" (onion-routed money laundering) paths, probably.

zaphirplane · 2025-02-18T07:11:50 1739862710

Too funny, enjoy the vote

JoelJacobson · 2025-02-18T00:26:43 1739838403

I tested PostgreSQL vs Neo4j with a friends-of-friends query two years ago, and to my surprise it was Neo4j that crashed, not PostgreSQL.

https://github.com/joelonsql/graph-query-benchmarks

I haven't tried the latest versions of both databases though.

sureglymop · 2025-02-17T14:36:18 1739802978

I've been running into exactly that problem. Which time series add-on would you recommend looking into?

osigurdson · 2025-02-17T15:25:03 1739805903

We ended up using ClickHouse after trying Timescale and InfluxDB. ClickHouse is great but important to spend a day or two understanding the data model to make sure it fits what you are trying to do. I have no affiliation with ClickHouse (or any company mentioned).

physicles · 2025-02-18T14:01:49 1739887309

We’ve been using InfluxDB 1.x since 2017 and I’m itching to get off of it. Currently we’re at about a trillion rows, and we have to ship our data to Snowflake to do large-scale aggregations (like hourly averages). I put a bunch of effort into building this on top of InfluxDB and it’s not up to the task.

Any sense if ClickHouse can scale to several TB of data, and serve giant queries on the last hour’s data across all sensors without getting OOMKilled? We’re also looking at some hacked together DuckDB abomination, or pg_duck.

osigurdson · 2025-02-18T21:02:38 1739912558

A trillion rows should be no issue at all for ClickHouse. Your use case sounds a bit more typical for it (our data is simulated so there is no single / monotonically increasing clock). I don't know though, is this ~100KS/s data acquisition type stuff (i.e. sound or vibration data for instance)? If so, it wouldn't be possible to push that into ClickHouse without pre-processing.

sureglymop · 2025-02-17T16:56:44 1739811404

Interesting.. But clickhouse is an entirely different database engine, no? I will look into it, thank you!

osigurdson · 2025-02-18T14:08:11 1739887691

Yes, it is a columnar DB. A lot of things feel pretty familiar as it has a SQL like query language. The data model is different though.

The nice thing is, once you understand the data model it becomes very easy to predict if it will fit your use case or not as there is really no magic to it.

efxhoy · 2025-02-18T00:23:25 1739838205

Timescale is definitely worth a look. Pg_partman gets you part of the way. We ended up going with bigquery for our workload because it solved a bigger bag of problems for our needs (data warehouse). It’s very hard to beat for big… queries.

physicles · 2025-02-18T14:06:57 1739887617

I never understood the rationale behind TimescaleDB — if you’re building a time series database using row-oriented storage, you’ve already got one hand tied behind your back.

What does your testing strategy look like with bigquery? We use snowflake, but the only way to iterate and develop is using snowflake itself, which is so painful as to impact the set of features that we have the stomach to build.

efxhoy · 2025-02-19T18:21:43 1739989303

Testing strategy? What’s that? I kid, but just a bit. Our use case is a data warehouse. We use DBT to build everything. Each commit is built in CI to a CI target project. Each commit gets its own hash prefixed in front of dataset names. Each developer also has their own prefix for local development. The dev and ci datasets expire and are deleted after like a week. We use data tests on the actual data for “foreign keys”, checking for duplicates and allowed values. But that’s pretty much it. It’s very difficult to do TDD for a data warehouse in sql.

My current headache is what to do with an actually big table, 25 billion rows of json, for development. It’s going to be some DBT hacks I think.

God help you if you want to unit test application code that relies on bigquery. I’m sure there are ways but I doubt they don’t hurt a lot.

physicles · 2025-02-21T05:08:05 1740114485

Interesting strategy with appending the commit hash to the dataset name. If one of those commits is known to be good and you want to “ship” it, do you then rename it?

What are you doing with that JSON? What’s the reason why you can’t get a representative sample of rows onto your dev machine and hack on that?

zozbot234 · 2025-02-17T15:02:41 1739804561

"Postgres as graph DB" starts to break down when you try to do serious network analysis with it, using specialized algorithms that are heavy on math - as opposed to merely using 'graphs' as the foundion of your data model, which is what graph databases mostly get used for. It's more about "what your actual use case is" than "how much data you have".

rpcope1 · 2025-02-17T20:16:30 1739823390

I also found this. Computing a transitive closure, for example, on a dataset that's not a pure DAG was absurdly painful for data that I stored in any SQL database. It ended up being cheaper and much, much faster, where I could fit the data, to just buy an old workstation with more than half a terabyte of RAM and just do all the computations in memory using networkx and graph-tool.

I presume that for larger real world graph datasets, maybe there's some better algorithms and storage methods. I couldn't figure out neo4j fast enough, and it wasn't clear that I could map all of the stuff like block modeling into it anyways, but it would be very useful for someone to figure out a better production ready storage backend for networkx at least where some of the data could be cached in SQLite3.

redbell · 2025-02-17T14:37:59 1739803079

> "Postgres as X"

This reminds me of "Just Use Postgres for Everything": https://news.ycombinator.com/item?id=33934139

scotty79 · 2025-02-17T14:14:42 1739801682

any recursive query is absurdly slow, so I think it scales wonderfully only if you match your representation to your workload

Frankion · 2025-02-18T15:33:26 1739892806

Do you have more details about it?

My first thought is sharding and materialized views.

jascha_eng · 2025-02-17T14:34:29 1739802869

I mean there is a lot of extensions and optimizations that you can pull on a postgres instance before you truly have to abandon it.

For timeseries I'd argue with timescale there is very few use cases that ever outgrow postgres. But I might be biased.

j45 · 2025-02-17T14:25:54 1739802354

I don’t think there’s a set predictable level of where Postgres as x can breakdown.

If you plan for it, for example store your graph in an export friendly format it should be ok.

Using postgres as long as you can get away with it is a no brainer. It’s had so many smart people working on it for so long it really comes thru. It’s nice to see the beginner or intro level tools for it evolving quickly similar to what has made MySQL approachable for so many years.

eatonphil · 2025-02-17T13:35:09 1739799309

I thought this was going to be about querying a Postgres table with a graph query language (SQL/PGQ).

Indeed this is coming to Postgres eventually.

https://www.postgresql.org/message-id/flat/a855795d-e697-4fa...

https://ashutoshpg.blogspot.com/2024/04/dbaag-with-sqlpgq.ht...

gunalx · 2025-02-17T13:47:29 1739800049

Like apache age (postgres with cypher support) https://age.apache.org/

eatonphil · 2025-02-17T13:59:09 1739800749

What I'm talking about is going to be committed into Postgres.

rounce · 2025-02-17T13:56:54 1739800614

Postgraphile[1] does this via GraphQL, or have i misunderstood what you're referring to?

1: https://www.graphile.org/postgraphile

eatonphil · 2025-02-17T13:59:23 1739800763

Nope I'm talking about something getting committed to Postgres itself.

rounce · 2025-02-17T14:07:00 1739801220

Ah I see, yes it'd be pretty awesome to have Postgres be even more of a Swiss army knife of databases.

Epa095 · 2025-02-17T12:45:06 1739796306

There is also https://age.apache.org/, an extension to facilitate graph queries in postgresql.

CharlieDigital · 2025-02-17T12:55:35 1739796935

Glad they used Cypher. One of the best and most intuitive query languages I've used personally.

alabastervlog · 2025-02-17T14:59:54 1739804394

I had to use neo4j on a project some years ago and hated it (it didn’t help that the client forced its use for really dumb and bad reasons, wasting tons of time in the process) but Cypher is great. Only thing about it I liked. It’s very good.

manojlds · 2025-02-17T12:59:03 1739797143

Yeah Cypher was so good and made everything so intuitive. Loved working with Neo4J for a prototype and then had to face the licensing issues and huge costs and had to abandon it.

mark_l_watson · 2025-02-17T14:20:44 1739802044

There are open source alternatives to Neo4J, and also very good scalable open source graph databases with query languages similar to Cypher. I think we are in agreement that it is a good idea to choose platform that scales both for compute and all other costs, like licensing costs.

CharlieDigital · 2025-02-17T14:23:52 1739802232

Can you list them?

Not sure if you everyone reading is aware, but Neo4j did end up turning Cypher into the openCypher spec and that allowed a lot of other graph stores to adopt Cypher as a query language.

smidgeon · 2025-02-17T13:15:46 1739798146

I got bitten by the same, we were looking at switching to memgraph (which is also Cypher, but C++ rather than Java), sadly abandoned for other reasons ...

CharlieDigital · 2025-02-17T13:24:20 1739798660

What reasons? Would love to know for future ref.

If your org is amenable to Google Spanner or already using Spanner, it has preview support for Cypher and graph semantics: https://cloud.google.com/spanner/docs/graph/overview

smidgeon · 2025-02-17T14:11:27 1739801487

Unrelated to the tech, just the graph-based algorithm (recommender stuff) that we had in mind did not give us the results we were hoping for ... and other things were queued up to be tried ...

mark_l_watson · 2025-02-17T14:18:13 1739801893

I agree with you about Cypher. For a couple of decades I was a RDF and then RDF + SPARQL advocate, but I am a fairly recent convert to property graphs and Cypher.

ikanreed · 2025-02-17T14:55:41 1739804141

I was going to say, this is what SPARQL was designed for and it's much better at than SQL.

lokimedes · 2025-02-17T12:49:16 1739796556

It was pretty dead last time I checked.

yonisto · 2025-02-17T12:56:25 1739796985

This time it must be me... Bilbo is not Frodo's father (Frodo is actually Bilbo's first and second cousin, once removed)

ddkto · 2025-02-17T14:24:50 1739802290

Yes, but Bilbo does adopt him to make him his heir

simpaticoder · 2025-02-17T13:27:38 1739798858

Very good, and also Frodo refers to Bilbo as his "uncle".

hobs · 2025-02-17T14:10:00 1739801400

It's also just right up front in the fellowship, he adopts Frodo because he never got around to finding a partner, implied because of his stewardship of the ring.

pella · 2025-02-17T14:06:25 1739801185

If you need some "Routing Graph" - check the pgRouting library https://pgrouting.org/

  "pgRouting library contains following features:
  All Pairs Shortest Path, Johnson’s Algorithm
  All Pairs Shortest Path, Floyd-Warshall Algorithm
  Shortest Path A*
  Bi-directional Dijkstra Shortest Path
  Bi-directional A\* Shortest Path
  Shortest Path Dijkstra
  Driving Distance
  K-Shortest Path, Multiple Alternative Paths
  K-Dijkstra, One to Many Shortest Path
  Traveling Sales Person
  Turn Restriction Shortest Path (TRSP)"\*

see more: https://docs.pgrouting.org/latest/en/pgRouting-concepts.html...

mycall · 2025-02-17T14:25:03 1739802303

Is it normally preferred to use some library such as this over a more traditional approach of nested sets, adjacency lists or ltree module?

pella · 2025-02-17T14:50:21 1739803821

PG Database + "The Boost Graph Library (BGL)" C++ ( https://www.boost.org/doc/libs/1_87_0/libs/graph/doc/index.h... )+ PostGIS ==> PGRouting

Just check the Workshop : https://workshop.pgrouting.org/3.0/en/index.html

Pedestrian Routing:

https://workshop.pgrouting.org/3.0/en/basic/pedestrian.html#...

Graph views:

https://workshop.pgrouting.org/3.0/en/basic/graph_views.html

Create a Network Topology:

https://workshop.pgrouting.org/3.0/en/advanced/chapter-12.ht...

t43562 · 2025-02-17T13:13:18 1739797998

ltree is the one I ended up using. I don't like it much. The way it works allows you to setup a tree but you can't easily move things around within a tree once they're in.

I forgot to say that it's also obviously only a tree, not a graph. It's still useful for some cases.

It has a very good way of querying which is probably fast if you have very deep trees but in the end we didn't need the deep trees so it was chosen for the wrong reasons.

https://www.postgresql.org/docs/current/ltree.html

You essentially have a path field in your record which lists the ids of all the records that are parents of the current one plus the current one.

e.g. if your records are regions of the world and the id is the country name then the path might look like this:

Earth.Europe.Poland

There is support for queries like "what are all the descendants of Europe"

    SELECT name FROM regions WHERE path <@ 'Earth.Europe';

Or you can use matching:

    SELECT name FROM regions WHERE path ~ '*.Europe.*';

simpaticoder · 2025-02-17T13:29:34 1739798974

"CTE" refers to Postgres' "common table expression" syntax, the "WITH" keyword. See https://www.postgresql.org/docs/current/queries-with.html

eurg · 2025-02-17T14:35:48 1739802948

Not just Postgres. It's in the SQL standard, and widely supported.

mceachen · 2025-02-17T17:58:28 1739815108

Even by SQLite!

voodooEntity · 2025-02-17T13:11:22 1739797882

Im a big fan of GraphDatabase's since about 10 years. I even wrote my own "in memory graph storage" in golang for a specific use case that none of the big GraphDatabase's could cover at the time.

That said - i WISH people would embrase the existing GraphDatabases more and make the hosters support them as standard, rather than abusing existing relational databases for graph purposes.

And to make it clear,i'm not talking about my own experimental one, i mean stuff like Neo4j, OrientDB etc.

taurknaut · 2025-02-17T13:49:12 1739800152

> rather than abusing existing relational databases for graph purposes.

In my experience, the vast majority of graphs can be embedded in relational databases just fine and most people don't want general graph querying. People just don't like optimizing queries (or equivalently the schema to enable such queries).

I personally have never seen a pitch for graph databases that makes them seem attractive for more than data exploration on your local machine.

voodooEntity · 2025-02-17T15:16:24 1739805384

Well im working since multiple years on a private lets call it "research" project which deeply relies on growing/deep structured graphs.

I don't think that GraphDBs should a default choice, but there are cases in which they just perform better.

Could i write my research project with a relational DB? Yes - i tried - and it sucked xD

hobs · 2025-02-17T14:12:23 1739801543

Yep, something like adjacency lists have solved a lot of dumb recursive queries/"graph queries" for me, usually because the graph doesn't change much and is just a few nodes deep.

belter · 2025-02-17T13:25:09 1739798709

When you look at AWS Neptune database implementation details, it looks realy close to the Aurora backend... :-)

https://docs.aws.amazon.com/neptune/latest/userguide/migrati...

voodooEntity · 2025-02-17T13:41:23 1739799683

Hrhr very true.

Also, the last time i checked neptune, it could only be run in AWS but there was no public application image for local tinkering/dev. But its some years ago to be fair - maybe have changed.

jsumrall · 2025-02-17T14:34:29 1739802869

My personal opinion is that nobody should touch OrientDB with a 10 ft pole.

voodooEntity · 2025-02-17T15:05:23 1739804723

I started with orientdb than switched to Neo4j. Orientdb was good for starting tbf, but we talking about 10+ years back. Now i would definately default to Neo4j

esafak · 2025-02-17T22:48:36 1739832516

SJC_Hacker · 2025-02-17T13:30:04 1739799004

The only problem with representing graphs on an RDBMS is the queries. So whats the big issue if queries are added which can handle the graph cases?

voodooEntity · 2025-02-17T13:43:31 1739799811

Thats like saying why do we need json we can represent it also in a textfile but accessing values can be more tricky.

Sure you can represent a GraphDB in a RDBMS - but graphdbs are optimized for their specific use case. Therefor for example locking of resources is optimized for said purpose. When going a "Edge" and "Node" table approach, locking is by default rather unoptimized in many RDBMS. Just one simple example.

SJC_Hacker · 2025-02-17T14:30:43 1739802643

> Thats like saying why do we need json we can represent it also in a textfile but accessing values can be more tricky.

If your text processing stack works good enough, and JSON only solves edge cases, then I think its a legit argument against adding another tech to the stack.

Its not like everyone immediately switched over to JSON even if they could have.

And the analogy may not be the best - in the case of graph databases, they don't do what RDBMS can do, at least not as efficiently.

So if we need some graph functionality in our RDBMS, do we split the database and use a "real" graph database or just incorporate the graph functionality into the existing DB? The choice seems quite easy for me, because there are whole host of issues in splitting DBs (duplication of information, inconsistency, etc.)

voodooEntity · 2025-02-17T15:08:05 1739804885

Well i took the best example that came to my mind at that moment. may not be the best analogy but well its what it is.

My point was not about people enrichhing an existing RDMBS concept with Graph, it was about using RDBMS as Graph (as only purpose). So maybe i wasn't exact enaugh in my definition. Therefor:

"If your purpose is to use the benefits of Graph and you want to use it for this purpose only, use a GraphDB and dont use a RDBMS and make it a GraphDB."

I hope thats better now.

gjtorikian · 2025-02-17T14:44:02 1739803442

It’s a neat trick! A simpler choice would be to use Postgres’ own ltree data type: https://www.postgresql.org/docs/15/ltree.html

I wrote about how we use it here: https://www.yetto.app/blog/post/how-labels-work/

jitl · 2025-02-17T15:02:16 1739804536

The issue with ltree is that it requires fanout on write when moving a node. The downside is you need to update every recursive descendant of the moved node to rewrite their ancestor path for the new location in the tree, so a move is O(subtree nodes), but you get to scan a subtree in using btree prefix search which is great.

That makes it good for relatively static and flat data like your label hierarchy, which probably receives a new label or a repainting within a tree relatively infrequently, and might have a p95 depth of 10. That makes it also a good fit for actual human (parent, child) relationships since parent-child change is usually infrequent and the total children per layer in the tree is also small.

For a tree like a Notion workspace where depth can be quite deep, and layers can easily have hundreds or hundreds of thousands of siblings, the cost at write-time when you reparent an ancestor in the worst case isn’t feasible.

For something like a social graph I’m not sure how to use ltree at all since a social graph isn’t a dag.

robertclaus · 2025-02-17T13:40:25 1739799625

I've found the main advantage to this over a more specialized graph database is integration and maintenance alongside the rest of your application. Real world data is rarely just (or even mostly) graph data.

wslh · 2025-02-17T12:47:08 1739796428

If you are interested in the subject, also take a look at NetworkDisk[1] which enable users/devs of NetworkX[2] to work with graphs via SQLite.

[1] https://networkdisk.inria.fr/

[2] https://networkx.org/

ForHackernews · 2025-02-17T16:06:41 1739808401

Surprised to see this article with no mention of LTree: https://www.postgresql.org/docs/current/ltree.html

First-party data structure for tree (directed, acyclic graph) data

dangoodmanUT · 2025-02-17T14:18:24 1739801904

This is where you start to see major limitations of OLTP, and need to look at specialized graph DBs (or indexes really) that take advantage of index-free adjacency

hknlof1 · 2025-02-17T13:57:11 1739800631

At DuckCon #6 there was a talk given about implementing SQL/PGQ as an extension to DuckDB

https://m.youtube.com/watch?v=QDdTbhSR2Vo

https://duckdb.org/community_extensions/extensions/duckpgq.h...

idefine · 2025-02-17T14:39:56 1739803196

I recently came across https://kuzudb.com/ which takes a lot of inspiration from DuckDB. Seems promising for embedded.

jakozaur · 2025-02-17T19:46:05 1739821565

There was a very good discussion about PostgreSQL as graph database on Hacker News about a year ago: https://news.ycombinator.com/item?id=35386948

zamalek · 2025-02-17T17:25:55 1739813155

This doesn't seem to cover infinite recursion, which is one of the ugliest parts of doing this type of query with CTEs. You effectively need to concatenate all of the primary keys you've seen into a string, and check it each recursive iteration.

The approach I came up with was to instead use MSSQL hierarchy IDs (there's a paper describing them, so they can be ported to any database). This specifically optimizes for ancestor-of/descendant-of queries. The gist of it is:

1. Find all cycles/strong components in the graph. Store those in a separate table, identifying each cycle with an ID. Remove all the nodes in the cycle, and insert a single node with their cycle ID instead (turning the graph into a DAG). The reason we do this is because we will always visit every node in a cycle when doing an ancestor-of/descendant-of query.

              +---+         
      +-------> B +------+  
    +-+-+     +-^-+    +-v-+
    | A |       |      | C |
    +---+       |      +-+-+
              +-+-+      |  
              | D <------+  
              +---+

Becomes:

    +---+  +---+
    | A +--> 1 |
    +---+  +---+
    +---+---+   
    | 1 | B |   
    | 1 | C |   
    | 1 | D |   
    +---+---+

2. You effectively want to decompose the DAG into a forest of trees. Establish a worklist and place all the nodes into it. The order in this worklist may be something that can affect performance (you want the deepest/largest tree first). Grab nodes out of this worklist and perform a depth-first search, removing nodes from the worklist as you traverse them. These nodes can then be inserted into the hierarchy ID table. You should insert all child nodes of each node, even if it has already been inserted before, but only continue traversing downwards if the node hasn't yet been inserted.

           +---+              
      +----> B +----+         
    +-+-+  +---+  +-v-+  +---+
    | A |         | D +--> E |
    +-+-+         +-^-+  +---+
      |    +---+    |         
      +----> C +----+         
           +---+

Becomes:

           +---+  +---+  +---+
       +---> B +--> D +--> E |
     +-+-+ +---+  +---+  +---+
     | A |                    
     +-+-+ +---+  +---+       
       +---> C +--> D |       
           +---+  +---+

Or, as hierarchy IDs

    A /1
    B /1/1
    D /1/1/1
    E /1/1/1/1
    C /1/2
    D /1/2/1

Now, you do still need a recursive CTE - but it's guaranteed to terminate. The "interior" of the CTE would be a range operation over the hierarchy IDs. Basically other.hierarchy >= origin.hierarchy && other.hierarchy < next-sibling(origin.hierarchy) (for descendant-of queries). You need to recurse to handle situations involves nodes like D in the example above (if we started at C, we'd get D in the first iteration, but would need to recurse to the first D node in order to find E).

This was two orders of magnitude faster than a recursive CTE using strings to prevent infinite recursion.

The major disadvantage of this representation is that you can't modify it. It has to be calculated from scratch each time a node in the graph is changed. The dataset I was dealing with was 100,000s of nodes for each customer, took under a second, so that wasn't a problem. You could probably also identify changes that don't require a full rebuild (probably any change that doesn't involve a back edge or cross edge), but I never had to bother so didn't solve it.