Hacker News new | past | comments | ask | show | jobs | submit login

It is always good to know, at what point does "Postgres as X" break down. For instance, I know from experience that Postgres as timeseries DB (without add-ons) starts to break down in low billions of rows. It would be great to know that for graph DBs as well. I think a lot of people would prefer just to use Postgres if they can get away with it.



I've done almost exactly the kind of thing described in article for a couple millions of rows. It broke down when I needed to find "friends of friends" of depth 6. None of the optimizations I could come up with helped. Neither did Apache Age.

Maybe I'm just not skilled enough to handle such a workload with Postgres. But Neo4j handled it easily


I'm not sure if you are serious or joking. "6 degrees of separation" is the famous radius for how many steps are needed such that everyone is connected.

https://en.wikipedia.org/wiki/Six_degrees_of_separation


Except that it wasn't about people at all? I was just referring to a _type_ of query


What kind of data was it, of I may ask?

Graph databases have a very narrow usecase, and it's almost always in relation to people - at least ime.

Though the data type isn't really important for the performance question, the amount of data selected is. So a 6-level depth graph of connections that only connect 2-3 entities would never get into performance issues. You'd be able to go way beyond that too with such a narrow window. (3 entity Connections on 6 level join would come out to I believe ~750 rows)

If you're modeling something like movies instead, with >10 actors per production you're looking at millions of rows.


Network analysis of financial transactions to detect "layering" (onion-routed money laundering) paths, probably.


Too funny, enjoy the vote


I tested PostgreSQL vs Neo4j with a friends-of-friends query two years ago, and to my surprise it was Neo4j that crashed, not PostgreSQL.

https://github.com/joelonsql/graph-query-benchmarks

I haven't tried the latest versions of both databases though.


I've been running into exactly that problem. Which time series add-on would you recommend looking into?


We ended up using ClickHouse after trying Timescale and InfluxDB. ClickHouse is great but important to spend a day or two understanding the data model to make sure it fits what you are trying to do. I have no affiliation with ClickHouse (or any company mentioned).


We’ve been using InfluxDB 1.x since 2017 and I’m itching to get off of it. Currently we’re at about a trillion rows, and we have to ship our data to Snowflake to do large-scale aggregations (like hourly averages). I put a bunch of effort into building this on top of InfluxDB and it’s not up to the task.

Any sense if ClickHouse can scale to several TB of data, and serve giant queries on the last hour’s data across all sensors without getting OOMKilled? We’re also looking at some hacked together DuckDB abomination, or pg_duck.


A trillion rows should be no issue at all for ClickHouse. Your use case sounds a bit more typical for it (our data is simulated so there is no single / monotonically increasing clock). I don't know though, is this ~100KS/s data acquisition type stuff (i.e. sound or vibration data for instance)? If so, it wouldn't be possible to push that into ClickHouse without pre-processing.


Interesting.. But clickhouse is an entirely different database engine, no? I will look into it, thank you!


Yes, it is a columnar DB. A lot of things feel pretty familiar as it has a SQL like query language. The data model is different though.

The nice thing is, once you understand the data model it becomes very easy to predict if it will fit your use case or not as there is really no magic to it.


Timescale is definitely worth a look. Pg_partman gets you part of the way. We ended up going with bigquery for our workload because it solved a bigger bag of problems for our needs (data warehouse). It’s very hard to beat for big… queries.


I never understood the rationale behind TimescaleDB — if you’re building a time series database using row-oriented storage, you’ve already got one hand tied behind your back.

What does your testing strategy look like with bigquery? We use snowflake, but the only way to iterate and develop is using snowflake itself, which is so painful as to impact the set of features that we have the stomach to build.


Testing strategy? What’s that? I kid, but just a bit. Our use case is a data warehouse. We use DBT to build everything. Each commit is built in CI to a CI target project. Each commit gets its own hash prefixed in front of dataset names. Each developer also has their own prefix for local development. The dev and ci datasets expire and are deleted after like a week. We use data tests on the actual data for “foreign keys”, checking for duplicates and allowed values. But that’s pretty much it. It’s very difficult to do TDD for a data warehouse in sql.

My current headache is what to do with an actually big table, 25 billion rows of json, for development. It’s going to be some DBT hacks I think.

God help you if you want to unit test application code that relies on bigquery. I’m sure there are ways but I doubt they don’t hurt a lot.


Interesting strategy with appending the commit hash to the dataset name. If one of those commits is known to be good and you want to “ship” it, do you then rename it?

What are you doing with that JSON? What’s the reason why you can’t get a representative sample of rows onto your dev machine and hack on that?


"Postgres as graph DB" starts to break down when you try to do serious network analysis with it, using specialized algorithms that are heavy on math - as opposed to merely using 'graphs' as the foundion of your data model, which is what graph databases mostly get used for. It's more about "what your actual use case is" than "how much data you have".


I also found this. Computing a transitive closure, for example, on a dataset that's not a pure DAG was absurdly painful for data that I stored in any SQL database. It ended up being cheaper and much, much faster, where I could fit the data, to just buy an old workstation with more than half a terabyte of RAM and just do all the computations in memory using networkx and graph-tool.

I presume that for larger real world graph datasets, maybe there's some better algorithms and storage methods. I couldn't figure out neo4j fast enough, and it wasn't clear that I could map all of the stuff like block modeling into it anyways, but it would be very useful for someone to figure out a better production ready storage backend for networkx at least where some of the data could be cached in SQLite3.


> "Postgres as X"

This reminds me of "Just Use Postgres for Everything": https://news.ycombinator.com/item?id=33934139


any recursive query is absurdly slow, so I think it scales wonderfully only if you match your representation to your workload


Do you have more details about it?

My first thought is sharding and materialized views.


I mean there is a lot of extensions and optimizations that you can pull on a postgres instance before you truly have to abandon it.

For timeseries I'd argue with timescale there is very few use cases that ever outgrow postgres. But I might be biased.


I don’t think there’s a set predictable level of where Postgres as x can breakdown.

If you plan for it, for example store your graph in an export friendly format it should be ok.

Using postgres as long as you can get away with it is a no brainer. It’s had so many smart people working on it for so long it really comes thru. It’s nice to see the beginner or intro level tools for it evolving quickly similar to what has made MySQL approachable for so many years.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: