The Sorry State of Trie Implementations in Python

jeffreychang · on Jan 31, 2013

Original author of the Biopython implementation here. Out of the maybe 100,000 lines of code that I have out on the internet somewhere, I never would have thought that this little module would be what makes the front page of Hacker News! Unfortunately, it's for a bug, but hey, I guess you can't win em all.

I've submitted the fix back to Biopython. Thanks for reporting the problem, and please let me know if there are more issues.

kevinwuhoo · on Jan 31, 2013

I'm was quite surprised (and happy) at the speed this fix came! I was mucking around with C for a few hours today and was elated to see your email! I thought that the original author would be long gone when I saw that the it was created in 2002.

Everything works as advertised now and I've amended the entry. Using your trie implementation cut my memory usage down from ~22gb to ~10gb. I just mapped hundreds of thousands of peptides. Thanks!

textminer · on Jan 30, 2013

Not useful for Kevin here, but for those with fat, frozen Python dictionaries that're useful to some vital process (e.g., feature lookup), the Python wrapper for the MARISA trie is wonderful. Tend to save 80-90% of memory whenever I capture a dict with it, and it works with memory mapping.

Quite efficient even for something silly, like casting a long_int -> short_int mapping by first converting the keys to strings. Hurrah, prefix trees!

sillywabbit · on Jan 31, 2013

It's even sorrier if the trie you want to search doesn't fit into memory. There are compressed suffix trees, but at the time all I could find was a C++ implementation[1] and I had to write my own bindings[2] using Cython. After figuring all that out, it turned out tries weren't very well suited the problem I wanted to solve. Oops.

[1] http://www.cs.helsinki.fi/group/suds/cst

[2] [link redacted]

shabble · on Jan 31, 2013

A while back I wrote a perl wrapper[1] for the CProps[2] Character-Trie, and found the library interface quite nice to deal with, even for a complete noob at perl-xs.

They have a bit-trie listed as well (I only ever needed strings) which might be a reasonable fit for more complicated serialised objects. I have no clue how easy the python FFI/native library stuff is though. I started out with Inline::C[3], which is both amazing and horrifying in near-equal measure.

[1] Sadly neglected, and never properly packaged for CPAN, but still around at https://github.com/shabble/cprops-perl/

[2] http://cprops.sourceforge.net/

[3] http://search.cpan.org/~sisyphus/Inline-0.51/C/C.pod

andrewcooke · on Jan 30, 2013

the first comment there is reasonable - suffix arrays seem to be preferable anyway (see http://en.wikipedia.org/wiki/Suffix_array). but i am not a biologist so perhaps i am missing something? (what?)

(i am not questioning the wider point about it being nice to have a good suffix tree library; just curious why it's important here).

[thanks for the replies]

kevinwuhoo · on Jan 30, 2013

You're right, generally suffix arrays are more efficient than suffix tries, but biologically there are two reasons why suffix tries are preferred.

1. There are a lot of repeated sequences in biology and many of them are slightly different than each other. For example, we need to insert "difference" and "different" in the structure. The suffix trie would have shared nodes while the suffix array needs an entry for each suffix.

2. Mutations are how evolution takes action. Biologically, in sequences, mutations are just misspellings. If you're looking for a sequence with arbitrary misspellings at position i, a tree structure is easier and faster to traverse.

But in reality, most indexing, especially in DNA, uses BWT [1] for indexing. However, there are more complex implementations of suffix arrays that may be suitable [2].

[1]: http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transfo...

[2]: http://www.mi.fu-berlin.de/wiki/pub/ABI/AdvancedAlgorithms11...

attractivechaos · on Jan 31, 2013

Actually just notice that you use "suffix trie". Note that there is trie (which is also called prefix tree) and suffix tree. There is no suffix trie. Suffix array usually uses less memory, but I am not sure in theory it is faster than suffix tree. We can use FM-index and enhanced suffix array (ESA) to simulate trie traversal. You can use ESA to achieve almost everything that you can do with a suffix tree.

Trie and suffix tree are natural structures for multiple strings. The typical suffix array is for one string. It is possible to encode multiple strings with suffix array, but that part of theory is not widely used.

Also, ESA was introduced in ~2002. It has been well known in CS. Because ESA is in some way a super set of FM-index, you can of course implement those popular mappers with ESA. The key reason that ESA is not widely seen is because it uses much more memory than FM-index. In 2008 when these mappers were greatly needed, no so many machines have 16GB or maybe more needed by ESA.

justin66 · on Feb 6, 2013

> Actually just notice that you use "suffix trie". Note that there is trie (which is also called prefix tree) and suffix tree. There is no suffix trie.

At the risk of being snotty: someone should tell Robert Sedgwick. He uses the term in his newest book, "An Introduction to the Analysis of Algorithms, Second Edition," which just came out. Sorry if I haven't been fair to your post due to a quick reading but Kevin isn't making the term up.

mbreese · on Jan 30, 2013

I'm fairly certain that it's more efficient to find matches with a certain number of allowed mismatches. I don't think that suffix arrays have as good of performance when you allow for some ambiguity.

attractivechaos · on Jan 31, 2013

Suffix array is usually built for ONE string, but in his work, OP probably needs multiple strings, for which trie is a conceptually better representation.

styluss · on Jan 30, 2013

Does anyone know a good implementation in Ruby?

alxndr · on Jan 30, 2013

http://stackoverflow.com/a/9042547/303896

https://github.com/dustin/ruby-trie

daragh · on Jan 31, 2013

https://github.com/harukizaemon/hamster is apparently quite good.