Hacker News new | past | comments | ask | show | jobs | submit login
How big data carried graph theory into new dimensions (quantamagazine.org)
150 points by rbanffy on Nov 2, 2023 | hide | past | favorite | 23 comments



I find these hands-on examples of hypergraphs quite informative [0][1]. The first one for intuiting some mathematical foundations and the second one for encoding a 'time' dimension in a flat plane (althought not strictly speaking a hyper graph).

A hypergraph example of Les Miserables provided a good demonstration the d3-hypergraph package as well, but unfortunately the bl.ocks.org live examples aren't up anymore [2].

[0]: https://observablehq.com/@toja/hypercube-graphs

[1]: https://observablehq.com/@mootari/force-graph-trails

[2]: https://gist.github.com/AndreaSimeone/1e9ed38b46b95b7848c714...


D3 blocks can be viewed by replacing gist.github.com or bl.ocks.org with blocks.roadtolarissa.com, e.g. for your link : https://blocks.roadtolarissa.com/AndreaSimeone/1e9ed38b46b95... (the home page https://roadtolarissa.com/ has a collection of visualisations, with an RSS link for updates, and the author posts some here as 1wheel).

The simulation playback in your link "Force Graph Trails" would be useful when trying to refine and balance the forces in a force graph; the small amount I've done felt like juggling parameters, searching in a space with too many dimensions.


I'm somewhat reminded of Bayesian Networks, which may be drawn as regular graphs with directed edges connecting a parent to a child, but which mathematically consist of one hyper edge per node, joining all parents and childz (In a process oddly called "moralization," though this term would suggest that polygamy is somehow moral, and thay any set of parents can only have a single child... I'm not sure of the true source of the term, honestly)

Of course, Bayesian networks are acyclic, which hinders many conceptual applications.


You often do that when encoding hypergraphs: store it as the graph of the construction.


Just wondering what is particularly immoral about the principle and the practice of polygamy, and mind you it's not a form of slavery? The normal practice of polygamy of single husband many wives has been practiced since the time immemorial and it's not condemned as immoral in all religious scriptures. For example Abraham as a great grandfather of Christian, Islam and Judaism religion is well known to have two wives with Ishak and Ismail are two sons from different wives. In fact I do believe that where polygamy are legally practiced the divorce rate is significantly lower but then I'm not the expert in sociology field. Don't get me wrong divorce is not immoral or sin, is just that it create a non-conducive upbringing environment for the children.


In the social environment where Bayesian networks were invented, polygamy is taboo, so "moralization" is an odd term for a situation like that, or so I thought. The rest of the direction of your comment appears to diverge pretty far afield, but so was my aside.


I do a fair bit of work with graph data, nothing fancy -- social networks and the like. I'm starting to combine them with NLP processes, running pairwise semantic similarity search across a corpus and building graphs that way. Seems really promising, certainly fun.

The problem I most often come across is an apparent lack of guidance and out-of-the-box tooling on applying a time dimension to graphs. Has anyone here seen anything cool done on that front?


I have worked on graph systems with direct support for temporal relationships. Time is typically handled in the same way it is with ordinary systems that handle a lot of space-time data. These were all closed bespoke systems, as tends to be the case for high-scale graph apps. I would agree that off-the-shelf graph systems tend to have weak or nonexistent temporal relationship support.

While it may not be obvious, graph and spatial data models are often represented with the same data structures at scale, but use somewhat different algorithms over those structures. The temporal bits are pretty vanilla unless you need bitemporal support or need to find unusually complex temporal patterns.


I think space-time/spatial-time/spatiotemporal is the next frontier for AI. For spatial data you use GCN and for time-series you can use Transformer, or combination thereof. You can find many recent works on this area especially in the ITS field for forecasting and prediction. One of the seminal papers using GCN:

Graph Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting:

https://www.researchgate.net/publication/318316069_Graph_Con...


I agree, combining GNNs with temporal signals and dynamics graphs make them so expressive. Another great recent paper on this topic https://arxiv.org/abs/2310.15978


What are some interesting kinds of graphs one can build this way and use them ?


Anything you want to look at over time. A social network is a classic: how a person's connections evolve over time or respond to temporal covariates.

Loads of stuff can be expressed as a graph though. I'm interested in natural language corpora as graphs, but as others have mentioned, there's neural networks here too. That's whatever you fancy these days.


The 'big data' craze did boost graph theory funding, but the whole thing is deeply flawed at the scientific level. Most of the results are concerned with sequences of graphs that tend to infinity. It's the same thing with probability, you can handwave that 'results should be close enough' provided 'the numbers are big enough'. But it isn't easy to do so, most results in graph theory are not good enough to provide accurate lower/upper bounds. Furthermore, in probability there is the issue of determining if a theoretical distribution does match the data. With graphs, there are so many properties to look for that the question does not any answer without asserting an hypothesis such as "only assortativity and node distribution matter".

One time one researcher was showing a connectivity result "showing the small-world hypothesis", which is that for any two individuals there is a chain of connections linking them of at most 5 persons who know each other on a first-name basis (formulations differ). When I asked about how the logarithmic factor in the equation, he said "it's basically the same as constant".

But even more fundamentally, there are things such as 'scale economy'. The 'cost' of linking two nodes in a graph is typically more important when there are few edges than when there are many. So you shouldn't even be able to consider a typical sequence of graphs, but consider it with parameters that change value with n, which is completely out of reach of current research.


It's not flaw/bug it's a feature especially from engineering point of view. If you think about it LLM is a true form of big data. Not only the data is big, the memory processing requirement is big as well, and that's the very definition of big data. Because of this big data requirement, Intel latest Xeon CPU has memory fabric very close to the processor and with the advent of CXL 3.0 this new feature of high-speed memory access is going to be norm rather the exception.

Another example is that big data is now trending for seismic research [1]. I'm not the author for this paper but currently we are working on offline big data for earthquakes and the results are very promising for prediction (several hours/days in advance i.e. not forecasting for months/years). Not only we use recorded big data for earthquakes in many earthquake prone countries e.g Turkey, but the technique that we are using namely time-frequency analysis requires big memory processing. Hopefully we can get fund for further research into the real-time big data collection and processing based on our offline data results. The main aim is to provide early detection and warning to the residences for the upcoming big earthquake (> 7 magnitude) with very good accuracy in a timely manner, to reduce the potential casualties.

[1]Big Data Seismology:

https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2021...


I think Wolfram came up with some new theory of everything based on hypergraphs a couple of years ago, not sure what the current status is though. I would be surprised if it ever becomes a useful theory but hypergraphs are certainly interesting and worth studying.


No, Wolfram's theory is based on rewriting directed graphs.

Start with some seed graph. A rewriting rule is a transformation (swap) of a small-ish input graph to small-ish output. If you have rules that consume small simple inputs, and produce slightly larger outputs, then your seed can be small, and in the end, probably arbitrary. You always get a Big Bang.

Test for a (directed) subgraph isomorphism match of the input, then replace the matched subgraph with the output of the rule, which has the same directed boundary (interface) as the input. Graph Rewriting is an old subject that has a big literature developing in the 1990s [also see the Structure Mapping idea for analogy].

So which rewriting rules do we choose to build our universe? Well, that's a difficult decision that only a god could answer, so Wolfram on Olympus trumps them all and says let all rules apply. The number of directed graphs gets very big, very quickly, for any number of nodes.

https://oeis.org/A000273

For n=10, there are 341,260,431,952,972,580,352 graphs. The number of rules is roughly quadratic in that. Perhaps Wolfram expects the probability (rate) of application of a rule to depend inversely on its size and complexity (i.e. computational complexity of isomorphism, not quite the same thing, but related) - almost as if the world really is a simulation, and someone has to pay the AWS bill. So maybe n-2,3,4,5..6 are the practical limit, with a little unexpected tunneling from n>7.

You might notice that such graph rewriting depends on the order in which you make replacements when the input matches overlap, which is quite often, in any slightly dense graph. Wolfram handles this indeterminism by letting all matches happen, and branching the output, into what he calls a multi-way graph. This leads to a mind-boggling explosion in the size of the mind-bogglingly large graph.

Later, he allows paths that converge to the same outcome to recombine. So, if you weren't mind-boggled enough by (A000273)^2 explosion of digraph rules, NP-complete subgraph-isomorphism, followed by combinatorial branching, you now have to add always-and-everywhere subgraph isomorphism to match and collapse common outcomes in the multi-way graph, such that it becomes a DAGgy multiway graph.

That is all assumed, without any explanation, or any worries about NP-completeness, or computational resources. Then he starts his physics project ...

You may have noticed that indeterminism in graph rewriting has a relationship to indeterminism in QM. Branching the multiway graph is a bit like (no really exactly like) the Many Worlds Interpretation of QM.

If you consider rewriting computational steps as clicks of your compute-clock, then time is just distance down the multiway graph. Light cones are just the incoming here-reachable-from relation (past) and reachable-from-here relation (future) regions of your space-time graph. If you slice the graph by local time distance, then you get spacelike separation and physical distance [similar to Causal Graphs and Causal Dynamical Triangulation putative QG theories].

If you look at different converging paths, you get something like the Feynman path-integral formulation of QM. You might think the path integral depends on combining complex numbers, but we haven't mentioned complex numbers yet. In Wolfram's World (TM), the magnitude comes from the number of converging paths, and the phase comes from the different time steps along those paths.

Understanding that fascinating fact was the trigger for me to get Wolfram-curious - it is very interesting that the magnitude and phase of what we think of as complex numbers, can some from the number and length of converging paths in a DAG.

Then energy and momentum (all relativistic 4-vectors?) can be generated by fluxes of various paths through time-like and space-like surfaces cutting the graph. Commuting variables of quantum measurement also come from knowledge within the graph. Then you are off the the races for generating all of physics ... (according to SW) ...


Not an expert, but analysis of very large graphs is already computationally expensive.

Does generalizing to hypergraphs ease this in any way? The cost-benefit trade-off seems to be greater computational complexity for potentially unobserved insights.


Hypergraphs do two things:

- they provide a type for conclusions that span several nodes, eg a ring’s interior is a node that represents the existence of a ring (and so cache your analysis conclusions)

- they allow for generalization by placing nodes within a cell, eg “people who make $50k-$150k/yr” can be a collection of people within a cell (and we can talk about aggregated edges from that cell’s interior to other objects)

A third (but useful in a different context):

- logic can be represented as hypergraphs and deduction rules on them


For the first two, these relationships can be modelled from the derived community graph of a binary network, so I am interested to see the performance gains over the binary approach.

See for instance, the ngraph coarsen procedure that reduces communities and their interrelations to single nodes connected to other community nodes [0].

[0]: https://github.com/anvaka/ngraph.coarsen


I too would be curious to see the comparison of different approaches.

But my intuition is that caching derived graphs and caching computed cells will be similar — and might even be two conceptual frameworks on equivalent data.


Not only that but there are lots of good results and algorithms on 'bipartite graphs', which are the kind that encode hypergraphs, so in practice it should work very well.


Existing graph-theoretic analysis would still apply to hypergraphs after essentially redefining the graph using coloring algorithms no?


I'm not sure if there is a neat relation between labelled (coloured) simple graphs and hypergraphs, however there is (from https://en.wikipedia.org/wiki/Hypergraph) :

> Hypergraphs can be viewed as incidence structures. In particular, there is a bipartite "incidence graph" or "Levi graph" corresponding to every hypergraph

Where the relation is one-to-one, so there is a Levi graph for every hypergraph and the reverse is also true.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: