Honestly, the inverse Burrows-Wheeler transform seems like some sort of Voodoo black magic to me.
It reminds be of the 100 prisoners problem [0]. Yes, I understand why it works. I can see how it works mathematically. But it still feels like it shouldn't work, that we're somehow getting something for free.
Together, that's full code for industrial-strength forward and inverse BWT. All that's missing is some method of avoiding degeneracy in the forward transform string sort (e.g., noising long runs in the input or a suffix array).
and you're done. Once you understand the forward transform, you understand that root is the LAST byte and next[root] is the FIRST byte.
All the fixes in BUGFIX-66 are trivial if you understand what the code is doing, that's part of the game concept. Also, if you submit broken code, it gives you a useful hint to direct your attention to the bug.
Similarly, the fix for the forward transform is to add
The splay tree also falls squarely into this category for me. The core bit:
> [...] splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern.
I've been iterating a kind of append-only log + database system using this as a technique to optimize I/O patterns. Many forms of storage are very happy when you feed them sequential bytes and prefer reads from more-recently-written physical blocks.
Having read through the (very good!) Wikipedia explanations a few times, I'm left with a different troubling intuition: if this works to improve compression, then I feel like there must be something wrong with the compression algorithms!
General-purpose compression algorithms tend to be conceptually simple. They must be fast, so they can't try many different approaches to see what works best. And because they are general-purpose, they can't assume much about the data.
The typical data compression algorithm starts with a combinatorial transformation of the data. Then it creates a statistical model of the transformed data and encodes it according to the model. There are two successful families of combinatorial transformations: Lempel–Ziv and Burrows–Wheeler. Neither of them is superior to the other, but the compression performance depends on the properties of the data.
Lempel–Ziv parsing replaces copies of repeated substrings with references to an earlier copy. It works best when there are long repeated substrings in the data. The modeling/encoding parts of an LZ compressor are usually pretty simple.
Burrows-Wheeler transform reorders the symbols according to the context they occur in. It's often but not always followed by run-length encoding. Because the symbols are reordered by context, BWT-based compressors work best when there are many copies of each repeated substring.
Unlike the LZ, the BWT does not compress the data on its own. It relies entirely on statistical compression. Because the symbols are ordered by context, the BWT is an order-n model for every n simultaneously. Hence you get something similar to PPM by simply partitioning the BWT into blocks and encoding each block independently with an order-0 encoder.
Not necessarily. After the transform, there are more runs of characters, but it could be that this comes at the expense of recognizable patters (like common words) in the input. It likely helps simple compression algorithms more than sophisticated ones.
I clicked this thing on the front page just to leave this precise remark (minus reference to the 100 prisoners). As you already did that, it seems my only sensible option is to voice my agreement.
The output string implies extra information: the sorted string, ie. two adjacent rotations, ie. how to chain the letters back together. Knowing BW was applied matters!
For the 100 prisoners problem, the intuition corresponds to n-nested verb tenses, which don’t exist in any language AFAIK.
I don‘t understand why there is a difference between following any algorithm for picking drawers or picking them at random. Isn’t the chaining algorithm they describe just another form of random order (with the warden being the random number generator)? Is there a simple explanation?
If you randomize a list of numbers there will always be cycles like described (similar to how there will always be cycles if you would draw lots for Secret Santa and then reveal who drew who), and these cycles have a certain length. It might be one cycle of length 100, but more commonly there will be lots of shorter cycles. If you start with your own number, you are guaranteed to be in a cycle that will return back to you, you just don't know how long that cycle will be (could be 1, could be 100, more likely something between). And apparently you can calculate that in ~31% of cases all cycles will have a length of 50 or less. This is just a neat way to exploit a kind of structure that there always will be within the randomness.
The key insight is that this strategy causes there to be a strong correlation between the success of different prisoners.
As an example, let's say prisoner #10 opens his box and finds #21, then finds #5 in that box, then #84, then #51, then finally succeeds and finds his number 10 in box #51. These boxes form a cycle 10-21-5-84-51 (and then back to 10). Anyone who opens any of these boxes will eventually see the same set numbers that #10 did, so that means that we know prisoners #5, #10, #21, #51 and #84 will all succeed in finding their number by starting at their own box.
Compare that to the situation where they just randomly look in boxes and each independently have only a 50% chance to succeed - then the odds that those 5 prisoners all succeed would be (1/2)^5 = 1/32, or only about 3%. In every cycle containing n prisoners, instead of their success rate being (1/2)^n, now they all succeed together or all fail together.
Now that the success of every prisoner in the same cycle is perfectly correlated with each other, the prisoners' overall chance of success depends only on whether the random permutation created by the warden has any cycles >50 in length. If so, then everyone in that cycle will fail. If not, then all the cycles across the 100 boxes are short enough that all 100 prisoners will succeed.
It doesn't affect the chance of any one prisoner finding their number - that's still 50%. But the joint strategy massively reduces the independence of the individual prisoners' success, and hence the distribution of the number of successful prisoners.
Rather than a normal distribution (sum of independent outcomes) with a big peak around 50% of the prisoners finding their number and a vanishingly small chance of them all (or none) finding the right number, you end up with a much more complex pattern - there's an approximately uniform distribution in the 0-50 successes range with ~70% of the overall outcome, and a huge peak at the 100 successes point with ~30%.
Hm, it seems obvious that Guido and Python have more impact on the industry (and I'm aware of Burrows' antipathy toward Python, which was well known at Google. As far as I remember it was due to debugging a syntax error caused by indentation in SWIG/Python)
JPL and Google were early Python users. Google's crawler and the home page started out in Python (trivia: the ancient asyncore stdlib module and its author played some role, as far as I remember)
Pretty much all AI and self driving cars use Python, BitTorrent was written in Python, Python is embedded in GDB, embedded in huge commercial applications like Maya for visual effects, etc. I think Ethereum had an early Python implementation
Python's design achieved Guy Steele's "growing a language" vision in his famous talk, i.e. using operator overloading to allow scientists to create their own lingua franca -- https://news.ycombinator.com/item?id=29171519 (i.e. the talk was about adding operator overloading to Java, as well "generics" and value types)
Mike Burrows is a great (even legendary) programmer and computer scientist, but if you're talking about "impact on the industry", there are levels :) I have to say I'm a big Alta Vista fan though
Not only that, the suffix array [1] is using counting sort O(n) instead of a comparison based O(n log n) sort. That is because they assume integer alphabets. For general alphabets they can only achieve O(n log n).
Sure since biology is generating vast volumes of data in national genomics and digital pathology programs around the world. Bit of a bias on HN for genomics, which is maybe 250GB for a person but you do it once, typically. But slide imagery is petabytes a year per hospital type scales, with lossless compression needs not just jpeg.
If you're interested in string search algorithms one of the cooler ones to come from genomics needs is Bitap as it has some scaling by alphabet size but DNA has a small alphabet https://en.wikipedia.org/wiki/Bitap_algorithm
That is because of other weaknesses of bzip2, not because BWT is worse. BWT generally beats LZ-based algorithms. I would also dispute that LZMA is simpler, at least in compression. Good LZ parsing is not easy.
Check out http://mattmahoney.net/dc/text.html . The top BWT performer is #22. What it comes down to is the way the BWT output is transformed and entropy-coded afterwards, and various other tricks.
bzip2 only does very basic MTF transformation and Huffman coding, which are not very bit efficient.
bzip2 often beats xz for highly regular data, like a big csv or jsonl file. So I usually try both approaches on those sorts of files. For everything else my go to is zstd or xz. Xz usually wins by a little, but is a lot slower. Zstd is great in practice.
Just yesterday I had a file where bzip2 won. I would love to see a modern version with the first few stages of bwt feeding into xz or zstd.
If you are asking if the algorithm can be generalized from 1D to 2D or even higher, then the answer is yes. For example, one could take the BWT of all rows and then all columns (using a different terminal symbol for each dimension), and that would be reversible.
It reminds be of the 100 prisoners problem [0]. Yes, I understand why it works. I can see how it works mathematically. But it still feels like it shouldn't work, that we're somehow getting something for free.
[0]: https://en.wikipedia.org/wiki/100_prisoners_problem