Hacker News new | past | comments | ask | show | jobs | submit login

A paper [1] we wrote in 2015 (cited by the authors) uses some more sophisticated data structures (compressed suffix trees) and Kneser–Ney smoothing to get the same "unlimited" context. I imagine with better smoothing and the same larger corpus sizes as the authors use this could improve on some of the results the authors provide.

Back then neural LMs were just beginning to emerge and we briefly experimented with using one of these unlimited N-gram models for pre-training but never got any results.

[1] https://aclanthology.org/D15-1288.pdf




The formulation of a succinct full text index as an "infinite n-gram model" is both genius in connecting with the ML literature, but also blind-eye, in that it ignores all the work on using exactly this data structure (well, the FM-index) over DNA. bwa mem is the OG ∞-gram aligner.


The idea is far older than BWA. The entire point of the Burrows-Wheeler transform is to create a compact order-k "language model" for every value of k simultaneously. And then use that for data compression.

David Wheeler reportedly had the idea in the early 80s, but he rejected it as impractical. Then he tried publishing it with Michael Burrows in the early 90s, but it was rejected from the Data Compression Conference. Algorithms researchers found the BWT more interesting than data compression researchers, especially after the discovery of the FM-index in 2000. There was a lot of theoretical work done in the 2000s, and the paper mentioned by GP cites some of that.

The primary application imagined for the FM-index was usually information retrieval, mostly due to the people involved. But some people considered using it for DNA sequences and started focusing on efficient construction and approximate pattern matching instead of better compression. And when sequencing technology advanced to the point where BWT-based aligners became relevant, a few of them were published almost simultaneously.

And if you ask Heng Li, he gives credit to Tak-Wah Lam: https://lh3.github.io/2024/04/12/where-did-bwa-come-from




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: