It's fascinating to me that xz's algorithm[1] beats bzip2 (Burrows-Wheeler) in b...

gliptic · on Sept 24, 2022

That is because of other weaknesses of bzip2, not because BWT is worse. BWT generally beats LZ-based algorithms. I would also dispute that LZMA is simpler, at least in compression. Good LZ parsing is not easy.

djha-skin · on Sept 25, 2022

Can you provide examples of other BWT implementations, and instructions and perhaps some of the weaknesses of bzip2? Genuinely interested.

gliptic · on Sept 28, 2022

Check out http://mattmahoney.net/dc/text.html . The top BWT performer is #22. What it comes down to is the way the BWT output is transformed and entropy-coded afterwards, and various other tricks.

bzip2 only does very basic MTF transformation and Huffman coding, which are not very bit efficient.

quag · on Sept 24, 2022

bzip2 often beats xz for highly regular data, like a big csv or jsonl file. So I usually try both approaches on those sorts of files. For everything else my go to is zstd or xz. Xz usually wins by a little, but is a lot slower. Zstd is great in practice.

Just yesterday I had a file where bzip2 won. I would love to see a modern version with the first few stages of bwt feeding into xz or zstd.

aidenn0 · on Sept 24, 2022

bzip2 beats xz for 1gb of zeros as well.

Dwedit · on Sept 24, 2022

Did anyone ever combine BWT with LZMA?