Hacker News new | past | comments | ask | show | jobs | submit login

It's fascinating to me that xz's algorithm[1] beats bzip2 (Burrows-Wheeler) in both time and space, but it's a much simpler algorithm.

1: https://en.m.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93M...




That is because of other weaknesses of bzip2, not because BWT is worse. BWT generally beats LZ-based algorithms. I would also dispute that LZMA is simpler, at least in compression. Good LZ parsing is not easy.


Can you provide examples of other BWT implementations, and instructions and perhaps some of the weaknesses of bzip2? Genuinely interested.


Check out http://mattmahoney.net/dc/text.html . The top BWT performer is #22. What it comes down to is the way the BWT output is transformed and entropy-coded afterwards, and various other tricks.

bzip2 only does very basic MTF transformation and Huffman coding, which are not very bit efficient.


bzip2 often beats xz for highly regular data, like a big csv or jsonl file. So I usually try both approaches on those sorts of files. For everything else my go to is zstd or xz. Xz usually wins by a little, but is a lot slower. Zstd is great in practice.

Just yesterday I had a file where bzip2 won. I would love to see a modern version with the first few stages of bwt feeding into xz or zstd.


bzip2 beats xz for 1gb of zeros as well.


Did anyone ever combine BWT with LZMA?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: