Hacker News new | past | comments | ask | show | jobs | submit login
What Every Competent Graph DBMS Should Do (kuzudb.com)
85 points by semihsalihoglu on Jan 12, 2023 | hide | past | favorite | 35 comments



I've now been conditioned to say "ok, when can we expect the Jepsen?" when I see a new database. Although in this case the phrase "in-process" (and it apparently being built on top of Apache Arrow: https://github.com/kuzudb/kuzu/blob/master/external/arrow/ap... ) may make that nonsense, but the readme does also say "Serializable ACID transactions"


I don't quite follow the question here but to clarify one thing: Kùzu is currently integrating Arrow not as a core storage structure but as a file format from which we can ingest data. We are writing Kùzu's entire storage, so it's our own design. It has three components: Vanilla columns for node properties, columnar compressed sparse row (https://tinyurl.com/2r2s4wpe) join indices and relationship properties, and a hash index for primary keys of node records. We don't use Arrow to store db files.

For serializability: yes, we support serializable transactions. So when you insert, delete or update node, rel records, you get all or nothing behavior (e.g., if you rollback none of your updates will be visible).

That said, supporting ACID transactions is a compeletely separate design decision in DBMSs, so our (or other systems') mechanisms to support transactions (for example whether it's based on write ahead logging or not) and storage designs are generally mutually exclusive decisions.


As sibling comments have said, "if you rollback none of your updates will be visible" is a claim but Jepsen is designed to prove such a thing. Same with the claims of transactional behavior, and claims of Serializable

My _personal_ opinion is that databases written in footgun languages need more formal verification of claims, not less, but I guess one needs to also take into consideration the size of the audience who would be impacted by any mistake when deciding if it's worth the investment or not

---

Please don't use link shorteners on HN; we're adults here and are capable of seeing long URLs, whereas link shorteners (a) die (b) obscure the destination

https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_spars...


This is actually a very good point and unfortunately formal verification is not the current norm when developing DMBSs or complex systems software. If you asked how do you know DBMS X is correctly implementing snapshot isolation, no one will be able to provide a lesson for you. This is true for every system out there I'm aware of. This goes more deeply actually about questions like this: "how do you know you implement SQL or Cypher correctly". In fact, we can't even sometimes very formally answer "what is the meaning of this query X". Back in the day my PhD advisor Jennifer Widom, who had a background in programming languages, talked a lot about how we put DBMSs into everywhere, e.g., planes that are flying up in the air, yet we can't really precisely state what the outputs of queries should be. She had some excellent writing on this for implementation of triggers that were coming out of different groups. She was one of the pioneers of triggers before triggers existed in SQL. And as I assume you might know better than me, Leslie Lamport is known as a pioneer in the distributed systems space for formal verification of programs.

Your comment on shortened URLs is noted. This is my first HN post ever :), so I might need to pick up the customs.


provide a lesson -> provide a proof.


I don't Jepsen has any relevance to embedded (in-process) databases, nor to any single-node databases. The properties its testing for are related to network partition tolerance.


But the tool that jepsen uses to check whether a schedule is serializable would still be useful for any database that supports concurrent operations.


I agree with that. For example, Jepsen has analyzed non-replicated PostgreSQL: https://jepsen.io/analyses/postgresql-12.3

There are plenty of mistakes you can make in single-process code while implementing transactions.


Agreed.


I love seeing Cypher growing.

IMO, the best database query language I've used (various SQL's, document DBs, graph DBs).


Hard agree Cypher is the easiest way to read complex graph operations.


SQL is getting a Property Graph Query extension that should finally match the ease of querying of typical graph DB. That said, RDBMS should cover the basic use case quite well unless you are doing very exotic things with network analytics and such.


Yes, this will be interesting to see. This is the SQL/PGQ extension that's coming in 2023. On the one hand, as I articulate in my post, GDBMSs are based on relational principles, so building a system that can seamlessly query in relational and graph model is a good idea. On the other hand, there is this: (1) no SQL extension into a graph-based model has been successful historically (e.g., into RDF or XML or even json). That seems to confuse users, so sticking to a single model seems like a good idea. (2) RDBMSs that implement SQL/PGQ will not be very performant unless they change their cores with several techniques I and my research group has been advocating for (e.g., how to do predefined joins, worst-case optimal joins, factorization etc.).

One of the core developers of Kùzu, Guodong Jin, actually wrote his PhD thesis on a project called GRainDB, which showed how to integrate predefined joins into DuckDB, so an RDBMS can be made efficient on "graph workloads". He did this excellent work: https://www.vldb.org/pvldb/vol15/p1011-jin.pdf, https://www.cidrdb.org/cidr2022/papers/p57-jin.pdf, which articulates a good plan for how RDBMSs can be efficient in a mixed relational-graph model.

Well, it'll be very interesting to see where SQL/PGQ goes...


Nicely written article, but it seems to conflate "what it should do" with "how it should do it". So, for example, it talks a lot about optimizations (which feel like implementation decisions).

I'm not saying they aren't important in this DB, but they might not be as relevant in another. For example, Datomic doesn't go this direction, but its ability to avoid read locks on long-running queries might dance around many optimization issues.

FWIW, I like the use of Cypher, which I first encountered in Neo4j. It has always felt pretty intuitive to me, though I don't know all of its limitations and strengths.


I tried to talk about "what it should do" at a higher-level. As a comparison, if you want to read something that is similar in spirit but discusses "how GDBMSs should do" certain things, you can look at the link I have in footnote 4. This is what I call the CWI list of how GDBMSs can be architected to be competent. The CWI list's items are specific implementation and optimization "techniques", like columnar storage, vectorized processing etc.

I can happily say that this list comes from Peter Boncz, who was an early pioneer of analytical RDBMSs, and Peter and I have similar opinions on these core architectural decisions here. And Kùzu currently adopts (or we are implementing) 11 of 12 techniques there.

FYI, in my next posts, I will talk about "how a GDBMS should do" features 1 and 2 in my list, using specific query processing techniques we have implemented in Kùzu.


What I still want is a neat way to solve data deduplication. Say you've got a whole lot of tuples saying x = y, then it would be nice to be able to reason about all the equivalence classes (returning the minimum id or w/e).

This is kind of the opposite of a join (where the join is the pullback and the operation described above is more like a pushout).

Symmetric + Transitive closure gets you some of the way there, but aggregation is messy and the paths between nodes get complicated if your tuples aren't ordered nicely (say you have x -> y <- z -> w <- q -> r <- etc.)


Interesting. I think the required computation is different for 2 different cases you have in your post.

Case 1: If you have = relationships (let's call the type of the relationship eq). Then what you need is to identify weak connectivity. If you want to do find the equivalence class of a single node, say with primary key "foo", you can do this with a Cypher query using Kleene star:

"MATCH (a)-[:eq*]->(b) WHERE a.pk = "foo" RETURN DISTINCT b"

This finds the "weakly connected component" of node "foo". If you want to find all equivalence classes, that's equivalent to finding all weakly connected components and you should probably call that algorithm from the algorithms library built over the GDBMS (e.g., https://tinyurl.com/mtx98c5s). You can do it in Cypher but it will be very slow.

Case 2: You have <= relationships, so relationships are not symmetric, so you want to find strong connectivity of nodes. I won't type it here but you could do this with a Cypher query for a single node but for all nodes again you can call the strongly connected components algorithm from the algorithms library of the GDBMS.

You can do some of these natively in Datalog, but if you implement it yourself in a native query language, but the computations is likely to be quite slow. So it might still be better to call a specialized algorithm that's implemented in a library above the DBMS you are using.

Hope this helps.


Great read, thank you. Also nice that you brought in RDF databases to help expand and refine the problem space. When I come across a graph database topic here at HN and it leads with an LPG approach, 99% of the time, there is hardly a mention of RDF. That leaves everyone a little worse off.


I agree. As I say in the blog, there is no perfect data model. RDF is great fit for certain things, e.g., when storing dbpedia-like datasets, and doing "automatic" advanced OWL reasoning over it by implementing OWL rules. For example, if a record t X is of type Y and there is a rule that says Y is of type Z, then by OWL reasoning rules, X is of type Z. So if a user scans all Z's, a system that does automatic OWL reasoning should return record t. RDFox is an interesting system that attempts this and has a niche space. Overall the semantic web community has thought greatly about these types of problems and pioneered these types computations.

Property GDBMSs can do manual reasoning queries if you ask questions with Kleene star but they would have to implement OWL rules and "OWL rule" data type, and URI data type etc. There is no other way I'm aware of. I am convinced with the right architecture this should be possible and if we see interest in this, we might work on this in Kùzu. But currently we definitely plan to support URIs as first class citizen data types so at least common queries over URI-heavy datasets and manual reasoning can be done efficiently, and that should cover most use cases for applications that want to model their data as triples of URIs.


I made the mistake of learning Neo4j with Cypher before learning SQL, and every interaction with SQL since feels like I'm using some outdated monster


If it makes you feel any better, I don't know Cypher and SQL still feels like an outdated monster.


Not just SQL but MongoDB and Elasticsearch also seem horrible by comparison. The only other query language that IMHO is as intuitive as Cypher is Splunk.


Every time I need to do a nested sub-select, a little part of me dies.


just one question for the author, why does this system feel more like a relational system than a graph database system ? you ask users to define schema (which I think neo4j doesn't do) and also I think the concept of this rel table which I find in your blog is not present in any other graph db


I think it's a mistake for any DBMS to not support a schema. I bet this hurts Neo4j a lot and no question that this is a mistake. In fact some GDBMS, including Kùzu or TigerGraph supports a schema. I think MemGraph does too, though I might be wrong. Schema allows systems to do many core computations efficiently, most importantly scans of data, which is the backbone operation of DBMSs. In fact, because of this every semi-structured graph model historically has been extended with a schema.

In practice, if you want DBMSs to be performant, you need to structure your data. It's one thing to optionally support a semi-structured model, which is for example great when building initial proof of concepts when you want to develop something quickly. It's another thing to not support putting a structure on the data, which you'll want when you finally take your application to production and will care about performance.


I realized, I forgot to complete my sentence here: "...because of this every semi-structured graph model historically has been extended with a schema." Examples include, XML and RDF. More relevant to this discussion: there is an ongoing effort to define a graph schema in GQL, which is an industry-academia effort including I think all major players: Neo, Tiger, Oracle etc. (https://www.gqlstandards.org/home).

You can search for this on the link: "GQL will incorporate this prior work, as part of an expanded set of features including regular path queries, graph compositional queries (enabling views) and schema support."


A schema isn’t only for performance, but also for constraints. In Neo4j, people often add redundant data for performance, for example if you often need the number of nodes up to 3 hops away, or they want a certain type of link not to have cycles, but there’s little to no help from the system in keeping those constraints.

Whenever I hear the claim any “noSQL” store is so much simpler than a SQL database, I mention performance (efficient storage and querying) and constraints and then say “give it 50 years”.


The lack of a Schema does hurt Neo4j performance. Properties are stored "willy nilly" on a linked list of bytes per node/relationship. No order, an "age" property can be: 45, 38.5, "adult", [18,19], false... and that makes a terrible mess when aggregating, sorting, filtering, searching, etc.


In the match clause, i don't see the foreign key, how those tables get to know each other to join ?


When you insert the Transfer records in my example, you indicate that they are "edges/relationships" between "Account" node records. The system interprets the values in those records implicitly as foreign keys to the Account records. This is what I mean by "predefined joins" in my example. When you ingest your relationship records, you predefine a join to the system and the GDBMS uses those Transfer records to join node records.

Hope this helps.


in gdbms, there is the concept of "predefined joins" where you define a relationship edge that directly connects 2 nodes, i don't know how these guys do it, but at least in neo4j that's the concept

edit: just saw their blog, they ask you to define a "rel table" and that has to define all the joins and they load it from there


What graph db the cool kids use today to start dabbling in the art?


We are building one (https://github.com/kuzudb/kuzu)! :)


What are your thoughts on EdgeDB?


I can't say something very insightful as I don't know it closely. My shallow understanding was that it is a modern json/xml-like dbms supporting a nested object (tree-like) model and path-expression-based query languages of such systems. If this is accurate, such models are graph-based as well, so the feature set on my blog would largely apply to them as well.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: