I think this is a cop-out. Someone in the 1990s could have written the same thin...

obi1kenobi · on March 4, 2024

Two problems I see here, based on the research I've done in high-performance graph algorithms:

- It's hard to find a "good-enough" graph implementation. The best hashtable is only a handful of percent better than the built-in ones. The best graph impl is 1000x or more better than any generic built-in one could be, so there's much more incentive to specialize (and people already specialize hashtables for just a handful of percent speedup!)

- The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset. The baseline complexity of implementing a graph algorithm for a small dataset is pretty low, and the real problems come in later / at larger scale. So in graphs there's less incentive to learn a complex library's API when "I could just hack it myself," unlike for hashtables where the API is simple and doing it myself is much harder.

DSMan195276 · on March 4, 2024

> The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset.

I would disagree with this, it's actually really easy to make one if you're willing to do away with many features (which aren't essential, but provide performance benefits). Implementing one is just something you never have to do in most modern languages.

brazzy · on March 4, 2024

> The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset.

Have you tried doing it? My experience was that it was surprisingly simple. We may have different expectations for what is "reasonable", of course.

bruce343434 · on March 4, 2024

More than just a handful of percent[1], but ok

[1] https://probablydance.com/2017/02/26/i-wrote-the-fastest-has...

obi1kenobi · on March 4, 2024

The work you cited is very impressive and very welcome.

But you seem to be implying that `std::unordered_map` is the default choice one would use, which in my experience is not accurate -- it is well-known to have serious perf shortcomings, and everyone I know uses some other implementation by default. Even so, the delta from `std::unordered_map` to the improved hashtable in the blog post is impressive, and just shy of 10x.

Graph algorithms frequently have 10x improvements from one state-of-the-art approach to the next -- for example, here's one from my own research[1]. The delta between state-of-the-art and "good default" in graph algorithms would often be around 100-1000x. And comparing state-of-the-art to the equivalent of an `std::unordered_map` would be another 10-100x on top of that, so 1000-100000x total.

[1]: https://dl.acm.org/doi/10.1145/3210377.3210395

bruce343434 · on March 5, 2024

Whoah, thank you for sharing. I only knew that just like dictionaries, there are quite a few implementation choices when making a graph, depending on what operations the algorithms needs to do often, and how sparse the data is.

PaulHoule · on March 4, 2024

I've been involved with RDF and sometimes it seems the gap between "too simple to use RDF" and "too hard to use RDF" is tiny.

For instance I tried to pitch a data processing library a bit like

https://www.knime.com/

but where RDF graphs (roughly like a JSON document) get passed over the "lines" but found that the heavy hitters in this space believed this sort of product has to use columnar execution to be "fast enough". You can certainly build something that can do operations on a dynamic RDF graph (really a set of triple) but in principle you could compile code that treats native data structures as if they were in RDF... You might get pretty good in speed but it won't be as fast as native and hard to make it easier to code for than native.

jes5199 · on March 4, 2024

RDF is an XML standard, isn't it? there was a XML era for collections and dicts, too. That is now largely considered to have been a mistake

zozbot234 · on March 4, 2024

RDF is not dependent on XML. There is a XML representation of RDF, but alternatives include JSON (using JSON-LD) and simple textual formats such as N3 and Turtle.

PaulHoule · on March 4, 2024

One trouble w/ RDF is that there are two official ways to do ordered collections, the RDF list which is basically a LISP list, and then RDF collections where the order is encoded in predicates. Neither is well-supported in most tools, for instance SPARQL lacks the list handling capabilities that you'd see in

https://docs.arangodb.com/3.12/aql/

or

https://www.couchbase.com/products/n1ql/

The XMP spec, for instance, hacks Dublin Core by adding ordering information because... It matters what order the authors are in. Dublin Core on the other hand seems to be developed for cataloging elementary school libraries and they were huge fans of doing the easy stuff and leaving out anything moderately hard, so Dublin Core looks like it has a 1968 level of sophistication and MARC is so much more 2000s. People come to RDF, see all these problems that are ignored, and come to the conclusion RDF is not for them.

hobofan · on March 4, 2024

There is a good repo that collects many of the issues with RDF[0], which I would have loved to have known about earlier in my journey with RDF.

[0]: https://github.com/w3c/EasierRDF

hobofan · on March 4, 2024

While there are (commonly used) alternative serialization formats, RDF is married to XML data model, as all core datatypes of RDF are defined in terms of XSD (XML Schema Definition) datatypes and very closely mimics it's data model.

If you want to have an RDF that is independent of that (e.g. based on Apache Arrow so that it's compatible with modern big data tooling) you might as well start from scratch.

abeppu · on March 4, 2024

And some languages (e.g. java, scala) have standard libraries with interfaces that describe ordered collections, dictionaries, etc, but offer multiple implementations so the programmer can pick based on their specific considerations, but still benefit from library code written around the interface.

Hackbraten · on March 4, 2024

It also uses hint/marker interfaces (eg. `RandomAccess`) so the library code can choose among algorithms internally while still presenting an uniform API to the client.

boothby · on March 4, 2024

As somebody who's actually done the work, I don't think you've really spent time with the question. The problem is that graphs, in general, aren't special. They're a structural dumping ground.

Data structures are specialized graph representations. The source code you write is in a specialized graph representation. For the overwhelming majority of programmers, the fact that something can be viewed as a graph is much like saying "oh well that file is just a really long sequence of bits, so it's effectively an integer." It's not wrong, but is it useful?

gilleain · on March 4, 2024

What do you mean by "textual graph literal"?

jes5199 · on March 4, 2024

Textual array literal: [1,2,3]

Textual dict literal: {"a": 1}

Textual graph literal: ???

WorldMaker · on March 4, 2024

We've got a couple common graph literals today: Graphviz's "dot" language [0] and the Mermaid language (in part inspired from "dot") [1]. Mermaid is increasingly common in documentation today, given Github supports it almost everywhere Markdown is supported. Parsing dot or Mermaid as embedded literals in another language doesn't seem that complicated (Mermaid's parser is written in JS and already runs anywhere you can run a JS parser).

Though one interesting thing to note here is that both languages are edge list languages and are optimized for algorithms that are useful for edge lists, particularly display/diagramming. That gets back to the article's point that there are other useful representations and those can matter for efficiency of algorithms.

They also can matter for efficiency of a graph literal in a source document. Edge lists are great for sparse graphs, but if you have a dense graph it might be easier to write as a literal with a massive spreadsheet of an adjacency graph. Especially if it can just be a spreadsheet with modern affordances like scrolling and sticky rows/columns and such. There's no perfect answer for "every" graph.

[0] https://graphviz.org/doc/info/lang.html

[1] https://mermaid.js.org/intro/#flowchart

vonwoodson · on March 4, 2024

Textual graph literal: {"a->b", "b->c", "c->a"}

Thinking about the programming language DOT https://en.wikipedia.org/wiki/DOT_(graph_description_languag...

This may be an effective way to express these graphs.

jes5199 · on March 4, 2024

I like DOT a lot. Way back in the day we had a semi-serious proposal to use it in the Puppet configuration language to describe dependencies

Groxx · on March 4, 2024

Same - DOT is a great way to go from zero to functional in near minimal time. It's also pretty trivial to generate since it's not order dependent, which is great for lightweight automation.

Fine-tuning layout can be a real hassle though, sadly. I haven't found any quick tools for that yet.

kevindamm · on March 4, 2024

The way I approach fine-tuning DOT layout is to add subgroups where it seems appropriate, add a 1px border for the subgroup and see where the layout engine is situating it next to other nearby vertices/edges. Sometimes I may have to put a border around a few subgroups, then attempt to adjust size of vertices and entire area to nudge it to a local minima. Note: I don't attempt to adjust the size of the subgroups, I'm not sure that even works anyway, but maybe it depends on the choice of layout algorithm, too. Padding and other style on the vertex-box may help, too. It's been a few years for me, tbh.

Deciding where the appropriate subgroups are is a bit of an art. Sometimes it's obvious, as in bipartite graphs that are intentionally bipartite. Or, if there is a staged layout like for pipeline architectures. Sometimes it's not obvious even when it seems it should be, like when graphviz really wants to make a certain edge really short. Be ready to backtrack sometimes. Then I usually remove the subgroup border after I'm done, but a few times they have been useful to leave there.

One thing I really like about DOT is that adding hyperlinks to the vertices and edges that translate decently into the compiled output is really nice. I had an oncall dashboard that made liberal use of this feature that I still think back on fondly sometimes.

whitten · on March 4, 2024

Can you point to where I can learn more about this ? E.G. an example and explanation exists for hyper link embedding ?

I am especially interested in syntax suitable to be used in creating something to input into https://www.viz-js.com and creation of SVGs with embedded hyperlinks.

Izkata · on March 4, 2024

If you're using subgraphs, with an edge across subgraph boundaries, it is slightly order dependent - a node will appear in the area it was first mentioned in. If you define all the nodes/subgraphs first and all the edges at the bottom you'd never notice this though.

gre · on March 4, 2024

  array{ 1 2 3 }
  dict{ { "a" 1 } }
  graph{ { "a" "b" } { "b" "c" } { "c" "a" } }
  AVL{ { "a" 1 } { "b" 2 } }

Factor has some syntax like this. All you have to do is pass the `{...}` to the prefix (array/dict/graph/AVL) and it knows how to construct one.

https://github.com/factor/factor/blob/master/extra/trees/avl...

superlopuh · on March 5, 2024

Datalog is a strong contender: https://www.learndatalogtoday.org/chapter/1

gilleain · on March 4, 2024

Heh, one possibility is:

AB 0:1(C),0:2(D)

For a three node graph with edges between vertex 0 and 1, and vertex 0 and 2, vertex labels 'A' and 'B' and edge labels 'C', and 'D'. Not great to parse (as I sadly found), but possible to read.

aebtebeten · on March 4, 2024

https://lobste.rs/s/uhmhum/hunt_for_missing_data_type#c_1na1...

> matrix multiplication and permanents are known to be non-cheap to compute, requiring worse-than-quadratic time! So any sort of good graph algorithm must dig deeper and be more specialized for the task at hand