Dimensionality reduction with UMAP combined with HDBSCAN is a popular topic modeling method found in a number of libraries. txtai takes a different approach with a semantic graph.
When enabled, txtai builds a semantic graph at index time as it's vectorizing data. These vector embeddings are then used to create relationships in the graph. Finally, community detection algorithms build topic clusters.
This approach has the advantage of only having to vectorize data once. It also has the advantage of better topic precision given there isn't a dimensionality reduction operation (UMAP).
I’m looking for a way to build a hierarchy of categories from a text corpus, unsupervised.
Anyone have an idea how to accomplish this?
For instance, using it on a corpus about animals, it should generate a tree like
mammals
mammals>mammals on land
birds
birds>birds of prey
In an ideal world, the system would be extracting both the category labels themselves and their hierarchy from the text. But even if you had to provide a list of categories, and the system merely used to text corpus to figure out their hierarchy, this would be helpful.
With the method discussed here you can add a list of labels to categorize generated topics. This isn't quite what you're looking for but it does create a two level hierarchy.
Thanks. Will look into it. If it can do two levels, why not three or more? Sounds as if there’s just a variable that needs to be changed from 2 to n. Or is there, I’m wondering, anything fundamentally different between n and 2?
Currently, topic categories are used to label the generated topics, hence the single additional level.
Expanding functionality to include something like this is possible though. The community detection algorithms kind of do this in a way. Each iteration builds smaller and smaller communities.
When enabled, txtai builds a semantic graph at index time as it's vectorizing data. These vector embeddings are then used to create relationships in the graph. Finally, community detection algorithms build topic clusters.
This approach has the advantage of only having to vectorize data once. It also has the advantage of better topic precision given there isn't a dimensionality reduction operation (UMAP).
Read more here: https://neuml.hashnode.dev/introducing-the-semantic-graph