The author, Marcus Hutter, created the "Hutter Prize" [1] that offers a C50,000 prize to persons who can compress 100Mb of a Wikipedia file to under 15Mb.
From the prize site:
This compression contest is motivated by the fact
that being able to compress well is closely
related to acting intelligently, thus reducing
the slippery concept of intelligence to hard file
size numbers.
Hmm, the prize is defined to be at maximum €50,000. The formula is 50,000€×(1-S/L), where S is the compressed size + code of your program and L is the size of the current best entry. So if you are able to produce 5% smaller compression than the next best guy, you get 2,500€. The whole 50,000€ is only possible with infinite compression ratio.
I might be missing something here. Where does intelligence come in? On any given deterministic computer platform, there is a finite number of 15MB executables. And there is a finite number of 100MB text files. But the second number is much larger than the first, and thus not every 100MB text file has a 15MB executable that outputs it.
Of course, maybe we can find a 15MB executable that outputs the specific Wikipedia file he’s asking for, and there might be clever ways to search for or construct that executable on a specific computer platform, but it doesn’t strike me as some particularly generalizable sense of compression requiring intelligence.
He has a book on "Universal Artificial Intelligence" , that defines what he considers AI and then solves it. The only catch is that it relies on Kolmogorov complexity, which is incomputable.
The Kolmogorov complexity of a string is (loosely) the size of the smallest possible Turing machine that will output that string when run. So a string with 1000x "a" has a lower complexity than a random string of length 1000.
He sees GAI as a system that, given all its input up to now (from sensors and the like) and some value-function that defines the goal we're trying to reach, takes the action that has the highest expected value of the value-function. It's hard to give a better definition of intelligence in a completely general sense: given all input, choose the most likely best action.
That requires some guessing about the future: given all this input so far, for all possible next inputs in the next time step, what is the likelihood of each.
And his answer is: the likelihood is correlated with the Kolmogorov complexity of the sequence of all inputs so far, plus the possible future input we're currently considering. The idea is that there is information in the inputs so far, and the sequence that best uses that has the lowest Kolmogorov complexity and is most likely to occur. If we've, seen 999 "a" s, we expect another one. If we've seen a car-shaped object slide left to right on previous camera images, we expect it to continue its movement. If the input so far is completely random then every possible future is equally likely. Et cetera.
Given that, it becomes possible to calculate the value function for every combination of possible action and possible next sensor input, and the AI can know what to do.
Except of course for the detail that this is (wildly) incomputable.
But what is somewhat correlated to Kolmogorov complexity? Compressibility.
So that is why Hutter is interested in better compression algorithms.
All this is very loosely stated from memory, I was interested in this stuff over a decade ago...
I got a wife and kids so have no free time anymore, and nowadays I feel there are far more important problems than AI in the world (climate change, huge loss of biodiversity, cheap very effective propaganda).
I don’t dislike this challenge, and I don’t dislike code golf. What I don’t understand is the notion that compressing a 100MB Wikipedia file somehow requires an “intelligent” compression algorithm. If the challenge was to be able to compress any English text down to a certain ratio, then perhaps making the algorithm behave “intelligently” would be a requirement. But as far as I can tell, this challenge is asking for a lossless reproduction of one specific chunk of data, so it seems to me that we just have to hope that, on a given platform, there exists a 15MB executable that outputs that exact chunk of data. One might exist, but one might not. And you probably won’t even be able to prove the latter, because of the halting problem (technically it doesn’t apply with a computer with finite memory, but in practice no one will wait long enough).
Not just any chunk though. It (presumably) contains a lot of structure that encodes information about [English representations of] the real world.
It's not random, nor is it simple, nor is it an image of a fractal (maybe?) etc.
A system that is able to compress it well makes good predictions about what the data contain which indicates "intelligence" about the structure of the (encoded) world.
But would you expect that program to itself be useful for anything else? It literally just takes no input and outputs that one specific Wikipedia dump. It’s not like I can tweak it slightly and have it output new predictions of things that sound like Wikipedia articles.
Unless the idea is that whatever methodology is required to build this particular program could then be adapted to do other things that seem intelligent. That seems possible, but not at all likely given my reading of the challenge requirements.
See the links in my other comment in the thread for all the gory details.
> But would you expect that program to itself be useful for anything else?
That exact program, no, but the process of creating that program would be educational, both the compressor that makes it and the process of making the compressor.
Another way of putting it is in terms of the "Twenty Questions" game. Which pre-selected twenty questions allows you to "cover" the most of the Universe of phenomenon? If one set of 20 allows you to distinguish more of the Universe than some other set, then the first set is somehow a better model of the Universe. Does that help?
> If one set of 20 allows you to distinguish more of the Universe than some other set, then the first set is somehow a better model of the Universe.
Is Akinator’s 20q database available as a dataset for non-commercial research purposes? It’d be beat to attempt to compress it in a way where training with one half predicts the other half at the smallest trained model size.
The point of the challenge is to actually create a model that fits in 15MB together with its compressed data. Not just to prove its existence.
One of the best contenders is the open source cmix which, among other tricks, uses LSTM neural networks to predict the next character/bit: http://www.byronknoll.com/cmix.html
I had to search out the size of the decompressor on a different page, but it proves you right. The compressed file is ~400KB smaller than the record holder and the cmix decompressor is only ~200KB.
I wonder what the memory/speed/efficiency tradeoffs looks like for that kind of design.
It's related to intelligence in the sense that being able to predict well requires intelligence and helps compression. If you have an intelligent model that predicts new words with high confidence, then you can use fewer bits to encode them. If I were to be even more handwavy, I would say having the ability to summarize effectively is a sign of intelligence.
There is a much smaller set of useful 100Mb text files than there is of total possible ones. If there's only 15 million such useful 100Mb text files we should be able to map them to 15Mb executables.
All something like GPT-2 is, is a text compressor, a model for an arithmetic encoder specifically. It predicts the probability of the next token conditional on a history. You can directly measure the 'bits per character'/'perplexity' as a measure of (compression) performance, and that is typically how these language models are evaluated: eg http://nlpprogress.com/english/language_modeling.htmlhttps://paperswithcode.com/task/language-modelling
The Hutter dataset is included, incidentally, and the top is a Transformer-XL (277M parameters) at 0.94 BPC. Does all of that seem 'very much not AGI'?
A decompressor with a high upfront cost (say a pretrained model) and otherwise higher compression ratio would be more interesting than tweaking yet another PAQ8 variant, but I'm not sure this contest's rules make that very viable.
Language models trained in a predictive fashion achieve ever closer to human compression rates (humans would only get ~0.7-0.8 BPC on Hutter), and are responsible for not just the creepiest & most realistic text generation to date but also setting SOTAs across all language-related AI benchmarks like SuperGLUE, designed to test language understanding and real-world reasoning, or chatbots. If that has nothing whatsoever to do with progress towards AGI, we'll just have to agree to disagree.
If you think of intelligence as distilling the essence of the universe around you and building on top of that, at least the former starts to sound like really really really sophisticated compression.
Imagine a highly efficient compressor watching Toy Story in 4k definition. If it was really intelligent, then it could abstract the 3D-models, motion, physics, lightning, and textures, and store the movie (or something imperceivably like it) in a highly compressed representation. Or even write down just the script and direct a movie so similar that it can't be distinguished from the original.
Being able to discern large scale themes and patterns in data, then using those findings distill the original data input to minimally redundant messages, is similar to the goal of an AGI that can comprehend any input data.
From AGI perspective it should be more relevant to come up with the ideas/concepts described in those Wikipedia articles - not the exact words used by humans to describe them.
If you studied the text and learned everything in it, you'd be able to generate a new text from scratch, from your internal representation of the information. It wouldn't be word-for-word identical, but that's actually evidence that there is a compressed representation in your head. If you got a good sense of Wikipedia's style out of it, your reconstruction would be even better. Now imagine that we let you take notes - little reminders of unusual turns of phrase, hard to remember dates, small outlines etc. You could probably do quite well indeed, perhaps even well enough that your output could be diffed with the original and result in a comparatively small file. I think it would be fair to count that as a win for you, since the computer program is deterministic and can store the diff along with its output. True, we can't measure the size of the data in your head, but compression is definitely occurring because you would certainly fail miserably if you tried to reconstruct 100 megabytes of random binary data in a similar way.
Still, to achieve that last little bit of the way to truly lossless reconstruction without the diff-storing trick will present you with a challenge, I agree. Lossless reproduction inherently requires encoding a lot of noise, to which no pattern can be matched. It's still a valid test because, although there is a hard floor on how much the data can be compressed, a better encoder will get closer and closer to that floor. And there's room for interesting optimizations all the way to the bottom - for example, with the help of a language parser and a thesaurus one could probably encode the difference between your lossy output and the original text rather better than 'diff'...
That's a complete misunderstanding of the problem. The compressor should be compared to the machine that runs the AGI, not the AGI itself. Your brain is compressing data all the time but it is happening unconsciously. There are other flaws with the idea of compression being a path to AGI but this is not one of them.
Whenever new compression algs come up I can't help but think they're cheating with their dictionaries by pre-defining most common words and character sequences, such as html tags. I would wonder if it's more ideal for human language to simply have the best dictionary. Perhaps even a negotiation on both ends to define what that dictionary is. If browsers had an efficient dictionary of all English words and phrases, compression would seem to tiny.
In this competition, and in similar competitions, the size of the binary used to decompress is taken into account. If you wanted to use a dictionary, you would need to pay for it in binary size. In this competition, the file must be self-decompressing.
Dictionaries are powerful tools when compressing small data. But once the data is large enough they stop mattering so much. See the dictionary compression section of https://engineering.fb.com/core-data/zstandard/.
I have a concept I have explored a bit but has yet to yield a positive response (I have not experimented long enough to come to any conclusion):
Can you add text to a string of text to make the compressed string shorter?
I suppose you could call it “compressor hinting” and it would be specific to the compressing algorithm. The added text would be tagged through an escape sequence, so it could be removed after the decompressing stage.
My naive approach is to add randomly generated hints to at randomly chosen location and then gzip/ungzip. I haven’t had success yet. I think that the potential is limited by the expressiveness of the compressor’s “instruction set” - ie - can it understand generalized hints.
There are some compressors (zstd is the one I'm thinking of) that accept "dictionaries", which are meant to be produced from datasets similar to what you're going to compress; I would guess they contain something resembling frequency tables. https://github.com/facebook/zstd has some description but doesn't explain precisely what the dictionary contains.
1. Predefined statistics based on the training data for literals (bytes we couldn't find matches for), literal lengths, match lengths, and offset codes. These allow us to use tuned statistics without the cost of putting the tables in the headers, which saves us 100-200 bytes.
2. Content. Unstructured excerpts from the training data that are very common. This gets "prefixed" the the data before compression and decompression, to seed the compressor with some common history.
Dictionaries are very powerful tools for small data, but they stop being effective once you get to 100KB or more.
1. Pre distributed dictionaries for commonly transmitted data types. Ie: web browsers could ship with a shared dictionary that is generally helpful for JavaScript, then servers could negotiate and send a specially compressed version for those that have the dictionary.
2. Streaming buffered data: eg before sending video, send a dictionary. This is useful because we are interested in compressing each chunk well, not just the overall file. Relatedly - a compression scheme where you need all bytes before you can decompress any is rather useless here.
I'll have to give this a go. I'll probably fail miserably but it sounds like fun. Related: [1] Here is my (sophomoric) attempt at a compression algorithm, and [2] here is the blog post I wrote about it.
I'm not sure. Currently the compression variables are somewhat inefficient. I've compressed similar files down to about 17mb using hardcoded variables but I've yet to stumble across one that will reliably compress a variety of data. I was hoping to test it enough to come across patterns that would enable me to develop heuristics, changing the variables according to different input file characteristics. This is simply extremely time consuming. I have a non-open version of a listener for server applications that I've sort of moved my testing over to since I created the client compressor/extractor above.
and reading compressor description it seems to be a hand written version of mentioned somewhere above GPT-2, with hard coded English language model as one of the heuristics
A few years ago this was trending on hacker circles [1]. This repo is obviously a joke, but I always wondered if it could be taken more seriously and made into something cooler. Assuming that Pi is normally distributed, then every potential 1kb chunk of a file would be in too.
I tried programming this idea very briefly a couple years ago but I got pulled away for something else.
I imagine the BigInt index that points to the starting digit for your data would take more space than the data itself. Here is an example in julia:
julia> setprecision(BigFloat, 10_000) do
findfirst("123","$(BigFloat(pi)))")
end
1926:1928
I.e. the string "123" starts at index 1926.
On the other hand, this type of "encoding" always fun to think about. I would suggest reading "Permutation City" by Greg Egan if you are amused by this.
> I imagine the BigInt index that points to the starting digit for your data would take more space than the data itself
Assuming optimal conditions (i.e. optimal randomness), you can expect that as many pieces of data will have pointers smaller than them, as pointers bigger than them; you can't win that way necessarily, only in some cases.
But maybe if you can find two 'opposite' algorithms, such that for any data where the pointer is larger with the first, it's smaller in the second...
(From what I know of information theory, the extra bit you have to use to specify which algorithm to use will outweigh all savings, but it's still fun to think about.)
I doubt that... You are comparing a random variable (the size of the data) to a sum of random variables (the index of the data). Thus I would expect the index to be greater than the size... But I am fairly uncertain.
Assuming optimal randomness, the shortest program to output the data will be in the random source. So record the random source, try all programs shorter than the data, and note the timestamp when you've found the optimal compression.
> As [Martin] Gardner tells the tale, on visiting Earth, an inquisitive alien wishes to collect all of human knowledge. Having no room for the Encyclopedia Britannica aboard an already-cramped spaceship (and lacking a CD-ROM player), the alien proposes a clever method for compressing the Encyclopedia's volumes. Assuming there are fewer than 1000 unique letters, digits, and other symbols in the text, our resourceful visitor assigns each symbol a three-digit code from 000 to 999, including leading 0s. The word "Snow," for example, might be encoded as 083110111119. Translating the entire Encyclopedia this way produces a giant integer that, with an imaginary decimal point to the left, is equivalent to the decimal fraction, A/B. To complete the data compression, the visitor places a mark on a small rod of otherworldly material that divides the bar into two lengths, A and B.
> After returning to Zenon (or wherever), the alien precisely measures the marked rod, obtains lengths A and B, and divides A/B to yield the original integer, which a computer decodes to print out the Encyclopedia.
More interesting would be a semantic compressor, one that didn't necessarily return the exact words in the corpus but that would turn a small amount of compressed data into a document with very similar meaning.
E.g., "Because trees tend to have a low albedo, removing forests would tend to increase albedo and thereby cool the planet" is a sentence from the file that would need to be compressed. It should be acceptable for the decompressed text to read, "Eliminating large groups of trees would improve surface reflectivity and provide support for a planetary cooling trend," or to employ any number of similar phrasings.
Any form of AGI arrived at through Hutter's exercise would be more akin to the intelligence shown by an eidetic person when they recite the telephone book, and computers are already good at that.
> Although humans cannot compress losslessly, they are very good at lossy compression: remembering that which is most important and discarding the rest. Lossy compression algorithms like JPEG and MP3 mimic the lossy behavior of the human perceptual system by discarding the same information that we do. For example, JPEG codes the color signal of an image at a lower resolution than brightness because the eye is less sensitive to high spatial freqencies in color. But we clearly have a long way to go. We can now compress speech to about 8000 bits per second with reasonably good quality. In theory, we should be able to compress speech to about 20 bits per second by transcribing it to text and using standard text compression programs like zip.
> Humans do poorly at reading text and recalling it verbatim, but do very well at recalling the important ideas and conveying them in different words. It would be a powerful demonstration of AI if a lossy text compressor could do the same thing. But there are two problems with this approach.
> First, just like JPEG and MP3, it would require human judges to subjectively evaluate the quality of the restored data.
> Second, there is much less noise in text than in images and sound, so the savings would be much smaller. If there are 1000 different ways to write a sentence expressing the same idea, then lossy compression would only save log2 1000 = about 10 bits. Even if the effect was large, requiring compressors to code the explicit representation of ideas would still be fair to all competitors.
Given enough time and compute, an AGI could enumerate all possible ways to say the sentence in your example, and then store a number representing which sentence is the exact same one as the input.
That number will be much smaller than the text of the sentence.
That number would have to specify one of all possible sentences, though, not just one from the subset in question. I'd assume it would be longer in that case.
I thought of a potential loophole. What about a non-deterministic extractor? Let's say you can make your program smaller by having only a 1% chance of extracting perfectly. Just submit the program (for an expected) 100 times and claim your prize.
Through the magic of information theory, I can tell you that this particular trick will save you log2(100) = 6.6 bits. Probably not enough to make a difference.
Alexander Rhatushnyak -- the first, second, third and fourth winner of the Hutter prize -- is the main contributor to JPEG XL lossless mode. Perhaps we all will eventually get practical benefit from his ability to build amazing compression solutions.
Since it's about compressing one specific sequence of bytes, and compression time/cost doesn't matter - only decompression - wouldn't finding a set of [neural network inputs + weights + processing code + error correction] (I think in the direction of GANs or auto-encoders) be a way to have a high chance of finding improvements? Not 100% sure it would be in spirit of the contest, and if the cost of training would offset the reward...
A fully NN solution (as opposed to using a light sprinkling like all competitive solutions do) requires big binaries for the most part, while an arbitrary program can easily memorize exact sequences to find repetition while using a tiny NN for some finetuning of predictions. A pure NN solution like a Transformer-XL does turn in the record-setting performance on natural language datasets including WP... unfortunately, the required NN model sizes alone tend to be larger than the entire uncompressed WP corpus here, and so have zero chance of ever being the minimum of model+compression. (An observation which I think supports the idea that the Hutter Prize has long outlived its intended use for measuring progress towards AGI; it's now just sort of a 'demo scene' version of AI benchmarking, testing intelligence within extreme constraints of sample efficiency/compute, rather than a true useful benchmark.)
The fact that small compressors like gzip or zpaq do so well at small data/compute but then can't compete as you scale up to tens or hundreds of gigabytes of text (amortizing the cost of those fancy NN models) can be considered a version of the Bitter Lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
The bitter lesson seems to be striving for the wrong goal. Replicating human performance will at best give you human performance. Humans are incredibly inefficient. They require multiple orders of more computational resources just to match a machine. Since we do not have an abundance of computational resources, trying to replicate humans will always result in inferior results than simply running conventional algorithms.
I interpreted the rules as only decompression needs to be in 8 hrs, training could be arbitrarily long? That would also lead to the theoretical strategy of finding the sequence somewhere in pi, and just storing the index and a pi generator....
This test maybe should not focus on the output being byte-identical to the input file, but to be structurally identical. It's an XML file, with another semi-structured language inside the text nodes. There's opportunity to utilize that fact.
On a more serious note, I think his can still be gamed a little bit.
For instance, DEFLATE has default variable width encoding tables built in based on letter frequency of the English language as measured some time ago. If you use an algorithm with any kind of defaults those count against the cost, but they don’t necessarily cost more to tune it for a solitary input.
DEFLATE compressed data consists of blocks. Each block can be either uncompressed, or compressed with a pre-defined Huffman table (the variant mentioned in the parent comment), or compressed with a dynamic Huffman table. Many DEFLATE compressors would consider all three options (at higher -j values) and choose the best. So, one does not need to tune DEFLATE to get the advantage of dynamic Huffman tables. And any kind of arithmetic encoding (or the new hotness, Assymetric Numeral Systems) would do a better job.
For a competition like this, all simple approaches have probably been tried already. The current leader has been tuning their approach for 5 years: http://mattmahoney.net/dc/text.html#1159
Reminds me of this one-time a girl at school had copied a shortcut to a DOC file onto the floppy-disk and submitted it at school.
The teacher had a hard time explaining shortcuts to her. The girl kept insisting that the PC at school was broken as the shortcut worked in her machine at home.
A guy I knew used to reencode his songs as 16 kbps mp3 files to save space. When I told him they sounded horrible, he told me "no problem, I can always reencode them back to 128 kbps!"
Page/block size is not inode size. While filesystems very often work in chunks of 4KB, the normal inode size is 256 bytes on ext filesystems and 1KB on NTFS. They get packed together.
From the prize site:
[1] http://www.hutter1.net/prize/index.htm