It is always good to know, at what point does "Postgres as X" break down. For instance, I know from experience that Postgres as timeseries DB (without add-ons) starts to break down in low billions of rows. It would be great to know that for graph DBs as well. I think a lot of people would prefer just to use Postgres if they can get away with it.
I've done almost exactly the kind of thing described in article for a couple millions of rows. It broke down when I needed to find "friends of friends" of depth 6.
None of the optimizations I could come up with helped. Neither did Apache Age.
Maybe I'm just not skilled enough to handle such a workload with Postgres. But Neo4j handled it easily
I'm not sure if you are serious or joking. "6 degrees of separation" is the famous radius for how many steps are needed such that everyone is connected.
Graph databases have a very narrow usecase, and it's almost always in relation to people - at least ime.
Though the data type isn't really important for the performance question, the amount of data selected is. So a 6-level depth graph of connections that only connect 2-3 entities would never get into performance issues. You'd be able to go way beyond that too with such a narrow window. (3 entity Connections on 6 level join would come out to I believe ~750 rows)
If you're modeling something like movies instead, with >10 actors per production you're looking at millions of rows.
We ended up using ClickHouse after trying Timescale and InfluxDB. ClickHouse is great but important to spend a day or two understanding the data model to make sure it fits what you are trying to do. I have no affiliation with ClickHouse (or any company mentioned).
We’ve been using InfluxDB 1.x since 2017 and I’m itching to get off of it. Currently we’re at about a trillion rows, and we have to ship our data to Snowflake to do large-scale aggregations (like hourly averages). I put a bunch of effort into building this on top of InfluxDB and it’s not up to the task.
Any sense if ClickHouse can scale to several TB of data, and serve giant queries on the last hour’s data across all sensors without getting OOMKilled? We’re also looking at some hacked together DuckDB abomination, or pg_duck.
A trillion rows should be no issue at all for ClickHouse. Your use case sounds a bit more typical for it (our data is simulated so there is no single / monotonically increasing clock). I don't know though, is this ~100KS/s data acquisition type stuff (i.e. sound or vibration data for instance)? If so, it wouldn't be possible to push that into ClickHouse without pre-processing.
Yes, it is a columnar DB. A lot of things feel pretty familiar as it has a SQL like query language. The data model is different though.
The nice thing is, once you understand the data model it becomes very easy to predict if it will fit your use case or not as there is really no magic to it.
Timescale is definitely worth a look. Pg_partman gets you part of the way. We ended up going with bigquery for our workload because it solved a bigger bag of problems for our needs (data warehouse). It’s very hard to beat for big… queries.
I never understood the rationale behind TimescaleDB — if you’re building a time series database using row-oriented storage, you’ve already got one hand tied behind your back.
What does your testing strategy look like with bigquery? We use snowflake, but the only way to iterate and develop is using snowflake itself, which is so painful as to impact the set of features that we have the stomach to build.
Testing strategy? What’s that? I kid, but just a bit. Our use case is a data warehouse. We use DBT to build everything. Each commit is built in CI to a CI target project. Each commit gets its own hash prefixed in front of dataset names. Each developer also has their own prefix for local development. The dev and ci datasets expire and are deleted after like a week. We use data tests on the actual data for “foreign keys”, checking for duplicates and allowed values. But that’s pretty much it. It’s very difficult to do TDD for a data warehouse in sql.
My current headache is what to do with an actually big table, 25 billion rows of json, for development. It’s going to be some DBT hacks I think.
God help you if you want to unit test application code that relies on bigquery. I’m sure there are ways but I doubt they don’t hurt a lot.
Interesting strategy with appending the commit hash to the dataset name. If one of those commits is known to be good and you want to “ship” it, do you then rename it?
What are you doing with that JSON? What’s the reason why you can’t get a representative sample of rows onto your dev machine and hack on that?
"Postgres as graph DB" starts to break down when you try to do serious network analysis with it, using specialized algorithms that are heavy on math - as opposed to merely using 'graphs' as the foundion of your data model, which is what graph databases mostly get used for. It's more about "what your actual use case is" than "how much data you have".
I also found this. Computing a transitive closure, for example, on a dataset that's not a pure DAG was absurdly painful for data that I stored in any SQL database. It ended up being cheaper and much, much faster, where I could fit the data, to just buy an old workstation with more than half a terabyte of RAM and just do all the computations in memory using networkx and graph-tool.
I presume that for larger real world graph datasets, maybe there's some better algorithms and storage methods. I couldn't figure out neo4j fast enough, and it wasn't clear that I could map all of the stuff like block modeling into it anyways, but it would be very useful for someone to figure out a better production ready storage backend for networkx at least where some of the data could be cached in SQLite3.
I don’t think there’s a set predictable level of where Postgres as x can breakdown.
If you plan for it, for example store your graph in an export friendly format it should be ok.
Using postgres as long as you can get away with it is a no brainer. It’s had so many smart people working on it for so long it really comes thru. It’s nice to see the beginner or intro level tools for it evolving quickly similar to what has made MySQL approachable for so many years.
I had to use neo4j on a project some years ago and hated it (it didn’t help that the client forced its use for really dumb and bad reasons, wasting tons of time in the process) but Cypher is great. Only thing about it I liked. It’s very good.
Yeah Cypher was so good and made everything so intuitive. Loved working with Neo4J for a prototype and then had to face the licensing issues and huge costs and had to abandon it.
There are open source alternatives to Neo4J, and also very good scalable open source graph databases with query languages similar to Cypher. I think we are in agreement that it is a good idea to choose platform that scales both for compute and all other costs, like licensing costs.
Not sure if you everyone reading is aware, but Neo4j did end up turning Cypher into the openCypher spec and that allowed a lot of other graph stores to adopt Cypher as a query language.
I got bitten by the same, we were looking at switching to memgraph (which is also Cypher, but C++ rather than Java), sadly abandoned for other reasons ...
Unrelated to the tech, just the graph-based algorithm (recommender stuff) that we had in mind did not give us the results we were hoping for ... and other things were queued up to be tried ...
I agree with you about Cypher. For a couple of decades I was a RDF and then RDF + SPARQL advocate, but I am a fairly recent convert to property graphs and Cypher.
It's also just right up front in the fellowship, he adopts Frodo because he never got around to finding a partner, implied because of his stewardship of the ring.
ltree is the one I ended up using. I don't like it much. The way it works allows you to setup a tree but you can't easily move things around within a tree once they're in.
I forgot to say that it's also obviously only a tree, not a graph. It's still useful for some cases.
It has a very good way of querying which is probably fast if you have very deep trees but in the end we didn't need the deep trees so it was chosen for the wrong reasons.
Im a big fan of GraphDatabase's since about 10 years. I even wrote my own "in memory graph storage" in golang for a specific use case that none of the big GraphDatabase's could cover at the time.
That said - i WISH people would embrase the existing GraphDatabases more and make the hosters support them as standard, rather than abusing existing relational databases for graph purposes.
And to make it clear,i'm not talking about my own experimental one, i mean stuff like Neo4j, OrientDB etc.
> rather than abusing existing relational databases for graph purposes.
In my experience, the vast majority of graphs can be embedded in relational databases just fine and most people don't want general graph querying. People just don't like optimizing queries (or equivalently the schema to enable such queries).
I personally have never seen a pitch for graph databases that makes them seem attractive for more than data exploration on your local machine.
Yep, something like adjacency lists have solved a lot of dumb recursive queries/"graph queries" for me, usually because the graph doesn't change much and is just a few nodes deep.
Also, the last time i checked neptune, it could only be run in AWS but there was no public application image for local tinkering/dev. But its some years ago to be fair - maybe have changed.
I started with orientdb than switched to Neo4j. Orientdb was good for starting tbf, but we talking about 10+ years back. Now i would definately default to Neo4j
Thats like saying why do we need json we can represent it also in a textfile but accessing values can be more tricky.
Sure you can represent a GraphDB in a RDBMS - but graphdbs are optimized for their specific use case. Therefor for example locking of resources is optimized for said purpose. When going a "Edge" and "Node" table approach, locking is by default rather unoptimized in many RDBMS. Just one simple example.
> Thats like saying why do we need json we can represent it also in a textfile but accessing values can be more tricky.
If your text processing stack works good enough, and JSON only solves edge cases, then I think its a legit argument against adding another tech to the stack.
Its not like everyone immediately switched over to JSON even if they could have.
And the analogy may not be the best - in the case of graph databases, they don't do what RDBMS can do, at least not as efficiently.
So if we need some graph functionality in our RDBMS, do we split the database and use a "real" graph database or just incorporate the graph functionality into the existing DB? The choice seems quite easy for me, because there are whole host of issues in splitting DBs (duplication of information,
inconsistency, etc.)
Well i took the best example that came to my mind at that moment. may not be the best analogy but well its what it is.
My point was not about people enrichhing an existing RDMBS concept with Graph, it was about using RDBMS as Graph (as only purpose). So maybe i wasn't exact enaugh in my definition. Therefor:
"If your purpose is to use the benefits of Graph and you want to use it for this purpose only, use a GraphDB and dont use a RDBMS and make it a GraphDB."
The issue with ltree is that it requires fanout on write when moving a node. The downside is you need to update every recursive descendant of the moved node to rewrite their ancestor path for the new location in the tree, so a move is O(subtree nodes), but you get to scan a subtree in using btree prefix search which is great.
That makes it good for relatively static and flat data like your label hierarchy, which probably receives a new label or a repainting within a tree relatively infrequently, and might have a p95 depth of 10. That makes it also a good fit for actual human (parent, child) relationships since parent-child change is usually infrequent and the total children per layer in the tree is also small.
For a tree like a Notion workspace where depth can be quite deep, and layers can easily have hundreds or hundreds of thousands of siblings, the cost at write-time when you reparent an ancestor in the worst case isn’t feasible.
For something like a social graph I’m not sure how to use ltree at all since a social graph isn’t a dag.
I've found the main advantage to this over a more specialized graph database is integration and maintenance alongside the rest of your application. Real world data is rarely just (or even mostly) graph data.
This is where you start to see major limitations of OLTP, and need to look at specialized graph DBs (or indexes really) that take advantage of index-free adjacency
This doesn't seem to cover infinite recursion, which is one of the ugliest parts of doing this type of query with CTEs. You effectively need to concatenate all of the primary keys you've seen into a string, and check it each recursive iteration.
The approach I came up with was to instead use MSSQL hierarchy IDs (there's a paper describing them, so they can be ported to any database). This specifically optimizes for ancestor-of/descendant-of queries. The gist of it is:
1. Find all cycles/strong components in the graph. Store those in a separate table, identifying each cycle with an ID. Remove all the nodes in the cycle, and insert a single node with their cycle ID instead (turning the graph into a DAG). The reason we do this is because we will always visit every node in a cycle when doing an ancestor-of/descendant-of query.
+---+
+-------> B +------+
+-+-+ +-^-+ +-v-+
| A | | | C |
+---+ | +-+-+
+-+-+ |
| D <------+
+---+
Becomes:
+---+ +---+
| A +--> 1 |
+---+ +---+
+---+---+
| 1 | B |
| 1 | C |
| 1 | D |
+---+---+
2. You effectively want to decompose the DAG into a forest of trees. Establish a worklist and place all the nodes into it. The order in this worklist may be something that can affect performance (you want the deepest/largest tree first). Grab nodes out of this worklist and perform a depth-first search, removing nodes from the worklist as you traverse them. These nodes can then be inserted into the hierarchy ID table. You should insert all child nodes of each node, even if it has already been inserted before, but only continue traversing downwards if the node hasn't yet been inserted.
+---+
+----> B +----+
+-+-+ +---+ +-v-+ +---+
| A | | D +--> E |
+-+-+ +-^-+ +---+
| +---+ |
+----> C +----+
+---+
Becomes:
+---+ +---+ +---+
+---> B +--> D +--> E |
+-+-+ +---+ +---+ +---+
| A |
+-+-+ +---+ +---+
+---> C +--> D |
+---+ +---+
Or, as hierarchy IDs
A /1
B /1/1
D /1/1/1
E /1/1/1/1
C /1/2
D /1/2/1
Now, you do still need a recursive CTE - but it's guaranteed to terminate. The "interior" of the CTE would be a range operation over the hierarchy IDs. Basically other.hierarchy >= origin.hierarchy && other.hierarchy < next-sibling(origin.hierarchy) (for descendant-of queries). You need to recurse to handle situations involves nodes like D in the example above (if we started at C, we'd get D in the first iteration, but would need to recurse to the first D node in order to find E).
This was two orders of magnitude faster than a recursive CTE using strings to prevent infinite recursion.
The major disadvantage of this representation is that you can't modify it. It has to be calculated from scratch each time a node in the graph is changed. The dataset I was dealing with was 100,000s of nodes for each customer, took under a second, so that wasn't a problem. You could probably also identify changes that don't require a full rebuild (probably any change that doesn't involve a back edge or cross edge), but I never had to bother so didn't solve it.