Hacker News new | past | comments | ask | show | jobs | submit login
Burrows–Wheeler Transform (wikipedia.org)
162 points by bshanks on Sept 24, 2022 | hide | past | favorite | 56 comments



Honestly, the inverse Burrows-Wheeler transform seems like some sort of Voodoo black magic to me.

It reminds be of the 100 prisoners problem [0]. Yes, I understand why it works. I can see how it works mathematically. But it still feels like it shouldn't work, that we're somehow getting something for free.

[0]: https://en.wikipedia.org/wiki/100_prisoners_problem


Explained here with Go code:

https://bugfix-66.com/7f0f425d3eee8def16e1102d054fd45394d027...

Wikipedia gives an impractical, inefficient account of the algorithm. It's almost comically bad.

But the BWT has an efficient inverse transform with very short, simple code, as seen above. This is the classic linked list algorithm.

The forward transform is also simple if done right. See the Go code here:

https://bugfix-66.com/72949624ebbb28b3d0ce5d700970a8857d354b...

Together, that's full code for industrial-strength forward and inverse BWT. All that's missing is some method of avoiding degeneracy in the forward transform string sort (e.g., noising long runs in the input or a suffix array).


It's frustrating that your link to an "explanation" is an intentionally buggy implementation of the algorithm :(

I understand the point of the debugging game, but in this context, a link to the solutions page would be more helpful....


Replace

  at = root
with

  at = next[root]
and you're done. Once you understand the forward transform, you understand that root is the LAST byte and next[root] is the FIRST byte.

All the fixes in BUGFIX-66 are trivial if you understand what the code is doing, that's part of the game concept. Also, if you submit broken code, it gives you a useful hint to direct your attention to the bug.

Similarly, the fix for the forward transform is to add

  i1, i2 = x[i1], x[i2]
at the top of the less() function.


> Wikipedia gives an impractical, inefficient account of the algorithm. It's almost comically bad.

I encourage you to edit and fix it.


I love your site! Hope you don't mind that I sent you an email.

It's a fantastic example of minimal yet fully functional!


Thanks. I just recently started building it.

Understanding small, elegant algorithms (enough to make tiny fixes) is something I consider fun. Hopefully it's fun for others.

Imagine the book Hacker's Delight, but turned into a simple debugging game. That's the concept.


> that we're somehow getting something for free.

The splay tree also falls squarely into this category for me. The core bit:

> [...] splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern.

I've been iterating a kind of append-only log + database system using this as a technique to optimize I/O patterns. Many forms of storage are very happy when you feed them sequential bytes and prefer reads from more-recently-written physical blocks.


Having read through the (very good!) Wikipedia explanations a few times, I'm left with a different troubling intuition: if this works to improve compression, then I feel like there must be something wrong with the compression algorithms!


General-purpose compression algorithms tend to be conceptually simple. They must be fast, so they can't try many different approaches to see what works best. And because they are general-purpose, they can't assume much about the data.

The typical data compression algorithm starts with a combinatorial transformation of the data. Then it creates a statistical model of the transformed data and encodes it according to the model. There are two successful families of combinatorial transformations: Lempel–Ziv and Burrows–Wheeler. Neither of them is superior to the other, but the compression performance depends on the properties of the data.

Lempel–Ziv parsing replaces copies of repeated substrings with references to an earlier copy. It works best when there are long repeated substrings in the data. The modeling/encoding parts of an LZ compressor are usually pretty simple.

Burrows-Wheeler transform reorders the symbols according to the context they occur in. It's often but not always followed by run-length encoding. Because the symbols are reordered by context, BWT-based compressors work best when there are many copies of each repeated substring.

Unlike the LZ, the BWT does not compress the data on its own. It relies entirely on statistical compression. Because the symbols are ordered by context, the BWT is an order-n model for every n simultaneously. Hence you get something similar to PPM by simply partitioning the BWT into blocks and encoding each block independently with an order-0 encoder.


Not necessarily. After the transform, there are more runs of characters, but it could be that this comes at the expense of recognizable patters (like common words) in the input. It likely helps simple compression algorithms more than sophisticated ones.


I clicked this thing on the front page just to leave this precise remark (minus reference to the 100 prisoners). As you already did that, it seems my only sensible option is to voice my agreement.


Same here. m2^2 ;) toying around with these kind of indices to detect / find repetitions for years already.

Btw: what's the best source for these kind of one of a kind, contra-intuitive algos and datastructures? Be it online, be it books...

BWT can't be the only one, right?


The output string implies extra information: the sorted string, ie. two adjacent rotations, ie. how to chain the letters back together. Knowing BW was applied matters!

For the 100 prisoners problem, the intuition corresponds to n-nested verb tenses, which don’t exist in any language AFAIK.


I don‘t understand why there is a difference between following any algorithm for picking drawers or picking them at random. Isn’t the chaining algorithm they describe just another form of random order (with the warden being the random number generator)? Is there a simple explanation?


If you randomize a list of numbers there will always be cycles like described (similar to how there will always be cycles if you would draw lots for Secret Santa and then reveal who drew who), and these cycles have a certain length. It might be one cycle of length 100, but more commonly there will be lots of shorter cycles. If you start with your own number, you are guaranteed to be in a cycle that will return back to you, you just don't know how long that cycle will be (could be 1, could be 100, more likely something between). And apparently you can calculate that in ~31% of cases all cycles will have a length of 50 or less. This is just a neat way to exploit a kind of structure that there always will be within the randomness.


The key insight is that this strategy causes there to be a strong correlation between the success of different prisoners.

As an example, let's say prisoner #10 opens his box and finds #21, then finds #5 in that box, then #84, then #51, then finally succeeds and finds his number 10 in box #51. These boxes form a cycle 10-21-5-84-51 (and then back to 10). Anyone who opens any of these boxes will eventually see the same set numbers that #10 did, so that means that we know prisoners #5, #10, #21, #51 and #84 will all succeed in finding their number by starting at their own box.

Compare that to the situation where they just randomly look in boxes and each independently have only a 50% chance to succeed - then the odds that those 5 prisoners all succeed would be (1/2)^5 = 1/32, or only about 3%. In every cycle containing n prisoners, instead of their success rate being (1/2)^n, now they all succeed together or all fail together.

Now that the success of every prisoner in the same cycle is perfectly correlated with each other, the prisoners' overall chance of success depends only on whether the random permutation created by the warden has any cycles >50 in length. If so, then everyone in that cycle will fail. If not, then all the cycles across the 100 boxes are short enough that all 100 prisoners will succeed.


It doesn't affect the chance of any one prisoner finding their number - that's still 50%. But the joint strategy massively reduces the independence of the individual prisoners' success, and hence the distribution of the number of successful prisoners.

Rather than a normal distribution (sum of independent outcomes) with a big peak around 50% of the prisoners finding their number and a vanishingly small chance of them all (or none) finding the right number, you end up with a much more complex pattern - there's an approximately uniform distribution in the 0-50 successes range with ~70% of the overall outcome, and a huge peak at the 100 successes point with ~30%.

https://www.r-bloggers.com/2014/08/update-100-prisoners-100-... has a bunch of distribution plots showing this.


No. That rules means you are definitely on the right track.

Think of it like having to search for someone in a forest with only enough time to search half the forest.

Would you want to search random spots or be dropped randomly into the trail to the person?

You might yet run out of time, but on average it'll be much more efficient. Surprisingly so in the 100 boxes context.

Veritasium has an excellent video on it but I'm on mobile and can't dig out a link. I'd recommend it though.


This is used in both bwa [1] and bowtie [2], two of the most popular DNA sequence aligners.

[1] https://github.com/lh3/bwa

[2] https://github.com/BenLangmead/bowtie


The full text search idea is also called the FM Index:

https://en.wikipedia.org/wiki/FM-index

The BWT is one of those almost magical tools for compression. But using it for speedy string search is a whole other amazing invention too.


Also bzip2! Which is just data compression, not sequence alignment.


This is a fun video from a series on compression that explains it well, and features Mike Burrows:

https://youtu.be/4WRANhDiSHM

He shares the origin of the algorithm as well as a story about how it was first published.

The Compressor Head video series is the best introduction to compression that I’ve found.


That was great. It's one of the greatest things about YouTube that you can find such videos on it.


I still remember when I first read the first C implementation of this, which was IIRC by Mike Burrows.

Apparently, back in the days, Mike had some sort of philosophical opposition to indenting his C code.

I was impressed he managed to write such a complex piece of code without ever indenting anything.


> Apparently, back in the days, Mike had some sort of philosophical opposition to indenting his C code.

That explains his aversion to Python back then: https://archive.is/OFmNT

imo, Mike Burrows' impact on the industry rivals that of the Turing awardees. Ex A from 2008: https://archive.is/ZPd4f


Hm, it seems obvious that Guido and Python have more impact on the industry (and I'm aware of Burrows' antipathy toward Python, which was well known at Google. As far as I remember it was due to debugging a syntax error caused by indentation in SWIG/Python)

JPL and Google were early Python users. Google's crawler and the home page started out in Python (trivia: the ancient asyncore stdlib module and its author played some role, as far as I remember)

Pretty much all AI and self driving cars use Python, BitTorrent was written in Python, Python is embedded in GDB, embedded in huge commercial applications like Maya for visual effects, etc. I think Ethereum had an early Python implementation

Python's design achieved Guy Steele's "growing a language" vision in his famous talk, i.e. using operator overloading to allow scientists to create their own lingua franca -- https://news.ycombinator.com/item?id=29171519 (i.e. the talk was about adding operator overloading to Java, as well "generics" and value types)

Mike Burrows is a great (even legendary) programmer and computer scientist, but if you're talking about "impact on the industry", there are levels :) I have to say I'm a big Alta Vista fan though


And a really nice person to work with


> I was impressed he managed to write such a complex piece of code without ever indenting anything.

I wonder if he'd enforce a 's/^\s*//g' pre-commit hook if you worked with him...


It's absolutely fascinating how many people in computing have really fringe ideas and still manage to get stuff done.


Related:

Burrows-Wheeler Transform [video] - https://news.ycombinator.com/item?id=10721401 - Dec 2015 (5 comments)

Compression with the Burrows-Wheeler Transform - https://news.ycombinator.com/item?id=1112845 - Feb 2010 (6 comments)

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


Can anyone explain to me why this is said to be O(n) when it heavily relies on sorting?


Wiki mentions https://en.wikipedia.org/wiki/Suffix_array is used for performance.


Not only that, the suffix array [1] is using counting sort O(n) instead of a comparison based O(n log n) sort. That is because they assume integer alphabets. For general alphabets they can only achieve O(n log n).

[1] https://arxiv.org/abs/1610.08305


For biologists reading this, I recommend the following series from a mathematical biologist in Spain (maybe Toronto by now): http://blog.thegrandlocus.com/tag/burrows-wheeler-transform


Someone's been reading about string matching algorithms, or compression. For biology?


Sure since biology is generating vast volumes of data in national genomics and digital pathology programs around the world. Bit of a bias on HN for genomics, which is maybe 250GB for a person but you do it once, typically. But slide imagery is petabytes a year per hospital type scales, with lossless compression needs not just jpeg.

If you're interested in string search algorithms one of the cooler ones to come from genomics needs is Bitap as it has some scaling by alphabet size but DNA has a small alphabet https://en.wikipedia.org/wiki/Bitap_algorithm


It's fascinating to me that xz's algorithm[1] beats bzip2 (Burrows-Wheeler) in both time and space, but it's a much simpler algorithm.

1: https://en.m.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93M...


That is because of other weaknesses of bzip2, not because BWT is worse. BWT generally beats LZ-based algorithms. I would also dispute that LZMA is simpler, at least in compression. Good LZ parsing is not easy.


Can you provide examples of other BWT implementations, and instructions and perhaps some of the weaknesses of bzip2? Genuinely interested.


Check out http://mattmahoney.net/dc/text.html . The top BWT performer is #22. What it comes down to is the way the BWT output is transformed and entropy-coded afterwards, and various other tricks.

bzip2 only does very basic MTF transformation and Huffman coding, which are not very bit efficient.


bzip2 often beats xz for highly regular data, like a big csv or jsonl file. So I usually try both approaches on those sorts of files. For everything else my go to is zstd or xz. Xz usually wins by a little, but is a lot slower. Zstd is great in practice.

Just yesterday I had a file where bzip2 won. I would love to see a modern version with the first few stages of bwt feeding into xz or zstd.


bzip2 beats xz for 1gb of zeros as well.


Did anyone ever combine BWT with LZMA?


Fun fact, burrows-wheeler is one of the assignments in coursera's algorthims course.


How is it that this doesn't violate the pidgeonhole principle?


If you were applying a first pass of compression to integer data and then applying gzip, would adding BWT in the middle still be beneficial?


It is an incredibly elegant scheme for bringing Markov context outputs together without actually trying to figure out what those contexts are.


Is there any analogous method for images?


If you are asking if the algorithm can be generalized from 1D to 2D or even higher, then the answer is yes. For example, one could take the BWT of all rows and then all columns (using a different terminal symbol for each dimension), and that would be reversible.

Question is if that is useful for compression.


So even the inventors quite misunderstand how they came up with this marvel. (we are lost ^^)


Has anyone here found uses for it, perhaps for domain-specific string compression?


bzip's compression ratio is hard to match without spending significantly more compute.


Isn't bzip notoriously slow though?


Next to lz or zstd, sure, when you don't care about compression ratio.

But bzip2 -1 offers comparable compression to zstd -10, and bzip2 -9 is more like zstd -18 which is appreciably slower.


bzip2 decompression is slow. It's faster than xz for compression.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: