Topic modeling with semantic graphs: a different approach

txtai · on Nov 5, 2022

Dimensionality reduction with UMAP combined with HDBSCAN is a popular topic modeling method found in a number of libraries. txtai takes a different approach with a semantic graph.

When enabled, txtai builds a semantic graph at index time as it's vectorizing data. These vector embeddings are then used to create relationships in the graph. Finally, community detection algorithms build topic clusters.

This approach has the advantage of only having to vectorize data once. It also has the advantage of better topic precision given there isn't a dimensionality reduction operation (UMAP).

Read more here: https://neuml.hashnode.dev/introducing-the-semantic-graph

leobg · on Nov 6, 2022

I’m looking for a way to build a hierarchy of categories from a text corpus, unsupervised.

Anyone have an idea how to accomplish this?

For instance, using it on a corpus about animals, it should generate a tree like

mammals mammals>mammals on land birds birds>birds of prey

In an ideal world, the system would be extracting both the category labels themselves and their hierarchy from the text. But even if you had to provide a list of categories, and the system merely used to text corpus to figure out their hierarchy, this would be helpful.

txtai · on Nov 6, 2022

With the method discussed here you can add a list of labels to categorize generated topics. This isn't quite what you're looking for but it does create a two level hierarchy.

Reference: https://neuml.github.io/txtai/embeddings/configuration/#topi...

leobg · on Nov 6, 2022

Thanks. Will look into it. If it can do two levels, why not three or more? Sounds as if there’s just a variable that needs to be changed from 2 to n. Or is there, I’m wondering, anything fundamentally different between n and 2?

txtai · on Nov 7, 2022

Currently, topic categories are used to label the generated topics, hence the single additional level.

Expanding functionality to include something like this is possible though. The community detection algorithms kind of do this in a way. Each iteration builds smaller and smaller communities.