I think this is a cop-out. Someone in the 1990s could have written the same thing about collections, or dictionaries, but we eventually came up with good-enough compromises that Python, Ruby, and Javascript all basically do the same thing. They don't implement literally every case, but they are good enough for "small" data, where the definition of "small" has grown to be quite large by human standards.
I think the real problem is syntax. Someone needs to come up with a textual graph literal that fits in source code, and it'll be good enough for the small type. Switch to a custom ubergraph when you exceed, idk, ten million entries.
Two problems I see here, based on the research I've done in high-performance graph algorithms:
- It's hard to find a "good-enough" graph implementation. The best hashtable is only a handful of percent better than the built-in ones. The best graph impl is 1000x or more better than any generic built-in one could be, so there's much more incentive to specialize (and people already specialize hashtables for just a handful of percent speedup!)
- The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset. The baseline complexity of implementing a graph algorithm for a small dataset is pretty low, and the real problems come in later / at larger scale. So in graphs there's less incentive to learn a complex library's API when "I could just hack it myself," unlike for hashtables where the API is simple and doing it myself is much harder.
> The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset.
I would disagree with this, it's actually really easy to make one if you're willing to do away with many features (which aren't essential, but provide performance benefits). Implementing one is just something you never have to do in most modern languages.
The work you cited is very impressive and very welcome.
But you seem to be implying that `std::unordered_map` is the default choice one would use, which in my experience is not accurate -- it is well-known to have serious perf shortcomings, and everyone I know uses some other implementation by default. Even so, the delta from `std::unordered_map` to the improved hashtable in the blog post is impressive, and just shy of 10x.
Graph algorithms frequently have 10x improvements from one state-of-the-art approach to the next -- for example, here's one from my own research[1]. The delta between state-of-the-art and "good default" in graph algorithms would often be around 100-1000x. And comparing state-of-the-art to the equivalent of an `std::unordered_map` would be another 10-100x on top of that, so 1000-100000x total.
Whoah, thank you for sharing. I only knew that just like dictionaries, there are quite a few implementation choices when making a graph, depending on what operations the algorithms needs to do often, and how sparse the data is.
but where RDF graphs (roughly like a JSON document) get passed over the "lines" but found that the heavy hitters in this space believed this sort of product has to use columnar execution to be "fast enough". You can certainly build something that can do operations on a dynamic RDF graph (really a set of triple) but in principle you could compile code that treats native data structures as if they were in RDF... You might get pretty good in speed but it won't be as fast as native and hard to make it easier to code for than native.
RDF is not dependent on XML. There is a XML representation of RDF, but alternatives include JSON (using JSON-LD) and simple textual formats such as N3 and Turtle.
One trouble w/ RDF is that there are two official ways to do ordered collections, the RDF list which is basically a LISP list, and then RDF collections where the order is encoded in predicates. Neither is well-supported in most tools, for instance SPARQL lacks the list handling capabilities that you'd see in
The XMP spec, for instance, hacks Dublin Core by adding ordering information because... It matters what order the authors are in. Dublin Core on the other hand seems to be developed for cataloging elementary school libraries and they were huge fans of doing the easy stuff and leaving out anything moderately hard, so Dublin Core looks like it has a 1968 level of sophistication and MARC is so much more 2000s. People come to RDF, see all these problems that are ignored, and come to the conclusion RDF is not for them.
While there are (commonly used) alternative serialization formats, RDF is married to XML data model, as all core datatypes of RDF are defined in terms of XSD (XML Schema Definition) datatypes and very closely mimics it's data model.
If you want to have an RDF that is independent of that (e.g. based on Apache Arrow so that it's compatible with modern big data tooling) you might as well start from scratch.
And some languages (e.g. java, scala) have standard libraries with interfaces that describe ordered collections, dictionaries, etc, but offer multiple implementations so the programmer can pick based on their specific considerations, but still benefit from library code written around the interface.
It also uses hint/marker interfaces (eg. `RandomAccess`) so the library code can choose among algorithms internally while still presenting an uniform API to the client.
As somebody who's actually done the work, I don't think you've really spent time with the question. The problem is that graphs, in general, aren't special. They're a structural dumping ground.
Data structures are specialized graph representations. The source code you write is in a specialized graph representation. For the overwhelming majority of programmers, the fact that something can be viewed as a graph is much like saying "oh well that file is just a really long sequence of bits, so it's effectively an integer." It's not wrong, but is it useful?
We've got a couple common graph literals today: Graphviz's "dot" language [0] and the Mermaid language (in part inspired from "dot") [1]. Mermaid is increasingly common in documentation today, given Github supports it almost everywhere Markdown is supported. Parsing dot or Mermaid as embedded literals in another language doesn't seem that complicated (Mermaid's parser is written in JS and already runs anywhere you can run a JS parser).
Though one interesting thing to note here is that both languages are edge list languages and are optimized for algorithms that are useful for edge lists, particularly display/diagramming. That gets back to the article's point that there are other useful representations and those can matter for efficiency of algorithms.
They also can matter for efficiency of a graph literal in a source document. Edge lists are great for sparse graphs, but if you have a dense graph it might be easier to write as a literal with a massive spreadsheet of an adjacency graph. Especially if it can just be a spreadsheet with modern affordances like scrolling and sticky rows/columns and such. There's no perfect answer for "every" graph.
Same - DOT is a great way to go from zero to functional in near minimal time. It's also pretty trivial to generate since it's not order dependent, which is great for lightweight automation.
Fine-tuning layout can be a real hassle though, sadly. I haven't found any quick tools for that yet.
The way I approach fine-tuning DOT layout is to add subgroups where it seems appropriate, add a 1px border for the subgroup and see where the layout engine is situating it next to other nearby vertices/edges. Sometimes I may have to put a border around a few subgroups, then attempt to adjust size of vertices and entire area to nudge it to a local minima. Note: I don't attempt to adjust the size of the subgroups, I'm not sure that even works anyway, but maybe it depends on the choice of layout algorithm, too. Padding and other style on the vertex-box may help, too. It's been a few years for me, tbh.
Deciding where the appropriate subgroups are is a bit of an art. Sometimes it's obvious, as in bipartite graphs that are intentionally bipartite. Or, if there is a staged layout like for pipeline architectures. Sometimes it's not obvious even when it seems it should be, like when graphviz really wants to make a certain edge really short. Be ready to backtrack sometimes. Then I usually remove the subgroup border after I'm done, but a few times they have been useful to leave there.
One thing I really like about DOT is that adding hyperlinks to the vertices and edges that translate decently into the compiled output is really nice. I had an oncall dashboard that made liberal use of this feature that I still think back on fondly sometimes.
Can you point to where I can learn more about this ?
E.G. an example and explanation exists for hyper link embedding ?
I am especially interested in syntax suitable to be used in creating something to input into https://www.viz-js.com and creation of SVGs with embedded hyperlinks.
If you're using subgraphs, with an edge across subgraph boundaries, it is slightly order dependent - a node will appear in the area it was first mentioned in. If you define all the nodes/subgraphs first and all the edges at the bottom you'd never notice this though.
For a three node graph with edges between vertex 0 and 1, and vertex 0 and 2, vertex labels 'A' and 'B' and edge labels 'C', and 'D'. Not great to parse (as I sadly found), but possible to read.
> matrix multiplication and permanents are known to be non-cheap to compute, requiring worse-than-quadratic time! So any sort of good graph algorithm must dig deeper and be more specialized for the task at hand
I think the real problem is syntax. Someone needs to come up with a textual graph literal that fits in source code, and it'll be good enough for the small type. Switch to a custom ubergraph when you exceed, idk, ten million entries.