A gentle introduction to persistent homology

duality · on June 10, 2019

Can anyone working in this field comment on how it's applied? For example, what is computed? Homology groups, just the ranks of these groups, or something else? Given these, what does one learn about the data set under analysis? https://en.wikipedia.org/wiki/Topological_data_analysis#Appl... is pretty scant on detail.

kaitai · on June 10, 2019

There are a number of approaches; since it's still relatively new, there's a lot of "playing around" with techniques.

1) The first idea is to just see which topological features persist as the filtration/threshold parameter varies. I personally find the barcode illustration, rather than the birth-death diagram, to be more intuitive in this respect. For instance, say you had LIDAR data about a vehicle. Looking at the persistent homology of this point cloud could allow you to ascertain the size and number of windows (smaller windows would have shorter persistence than larger windows, and the number of persistent bars would correspond to the number of windows or "openings" on the vehicle). This might allow you to figure out if the vehicle is a van or an SUV.

2) For applications like those in neuroscience (I like, for instance, the talk you can find on Youtube titled "Kathryn Hess (6/27/17) Bedlewo: Topology meets neuroscience") there is instead a look at how these ranks behave over time. A rough sketch is this: look at the topology of the pattern of activation in neurons as a mouse (or an AI) learns something. As the learning process happens, what happens to the Betti numbers?

3) Sometimes one might want explicit generators. A thought-provoking small case of this can be found in "Mind the Gap:A Study in Global Development through Persistent Homology" https://arxiv.org/abs/1702.08593 This paper looks at statistics like GDP and infant mortality of countries around the world and find explicit oddities in the data. It's a proof of concept, to me; I'm interested to see where it'll go.

There is a lot more of course but that brings you to three interesting yet different directions.

Pseudomanifold · on June 10, 2019

We recently applied TDA to get more information about the 'relevance' of nodes in graphs, which in turn we employed to improve graph classification. Let me know whether you want to discuss this some more!

kmill · on June 10, 2019

I don't work in the field, but I did a little reading about the theory last year. (I can't say anything about how any of this is applied---I have been wondering for a while whether it does anything beyond giving really nice looking diagrams!)

The basic input is a filtered simplicial complex. There are many ways to do this, but at least in what I read there are two main ones that come from point clouds: the Čech and the Rips complexes. Both are, roughly, from taking n-tuples of points that are mutually within some distance of each other to form the simplices. The filtration is indexed by the epsilons.

(From what I remember, the Čech complex is better behaved in terms of convergence properties, but the Rips complex is apparently easier to compute with. That said, they sit between each other in the filtration. There are some papers on detecting holes in sensor networks that make use of this fact, using the Rips complexes to infer the existence of a relative 2-cycle to prove a sensor network is hole-free.)

Since homology groups are functorial, the filtration gives induced maps on all the corresponding homology groups. Given a homological generator for a particular epsilon, one can then push it forward for increasing epsilons, seeing how long it persists, until, potentially, it is eventually in the kernel and dies. Conversely, any particular homological generator might be in the image from a previous epsilon, and there is some value of epsilon at which the generator is born, in the sense that it is not the image of a generator from a previous epsilon.

A barcode (maybe also persistence diagram but I'm not sure) is the collection of intervals representing all the homological generators for all values of epsilon, where an interval represents the persistence of that generator. Somehow practitioners make judgments about how long a generator persists to learn about topological features about a given data set.

It seems most of the time the homology computations are done with field coefficients. This, in particular, implies the homology groups themselves are vector spaces, and thus by knowing the rank you know the group. So, the barcode is a perfect representation of all the homology groups.

In a way I do not comprehend, what is "really" going on is that, since the point cloud has only finitely many points, the filtration has only finitely many discrete steps, so there is an associated spectral sequence that relates all the homology groups to each other. I think this might give a finer-grained barcode, where births and deaths sometimes come from events in higher and lower dimensions.

In the off-chance you already have a complex, there is another kind of filtration to try, which is from the sublevel sets. Persistent homology techniques apply, and I have heard of symplectic geometers using it to prove things about symplectomorphisms, where the sublevel sets come from a symplectic flow.

As an example of sublevel set persistent homology---and a point about what "birth" and "death" really mean---imagine two circles moving toward each other then merging. Each step of this event corresponds to a level set for a pair of pants. At the beginning, with two circles, there are two H_0 generators, one per circle, and two H_1 generators, one per circle. At the exact point the two circles merge, then there is a single H_0 generator but still two H_1 generators. Just after the merge, there is a single H_0 generator and a single H_1 generator for the resulting circle. The two generators of H_0 and H_1 "merged" in a sense. For H_0, it is arbitrary which of the two generators is the one that persisted and which was the one that died at the merge event. For H_1, it is somewhat strange: it is the sum of the two obvious generators that persists for all time, and either of them can be taken as the other interval in the barcode, which then dies at the merge event.

This illustrates that the actual generators in a barcode diagram are somewhat arbitrary. It seems to be only as arbitrary as bases for a vector space might be (though sometimes finding the "right" basis is key to understanding a pattern!)