Hutter Prize: Compress a 100MB file to less than the current record of 16 MB

abetusk · on Aug 11, 2019

The author, Marcus Hutter, created the "Hutter Prize" [1] that offers a C50,000 prize to persons who can compress 100Mb of a Wikipedia file to under 15Mb.

From the prize site:

    This compression contest is motivated by the fact
    that being able to compress well is closely
    related to acting intelligently, thus reducing
    the slippery concept of intelligence to hard file
    size numbers.

[1] http://www.hutter1.net/prize/index.htm

vesinisa · on Aug 11, 2019

Hmm, the prize is defined to be at maximum €50,000. The formula is 50,000€×(1-S/L), where S is the compressed size + code of your program and L is the size of the current best entry. So if you are able to produce 5% smaller compression than the next best guy, you get 2,500€. The whole 50,000€ is only possible with infinite compression ratio.

sieabahlpark · on Aug 12, 2019

You'd make more money selling cloud storage at a fraction of the cost.

NullPrefix · on Aug 12, 2019

High ratio (de)compression is usually latency slow. You could look into selling tape storage alternatives.

tshaddox · on Aug 11, 2019

I might be missing something here. Where does intelligence come in? On any given deterministic computer platform, there is a finite number of 15MB executables. And there is a finite number of 100MB text files. But the second number is much larger than the first, and thus not every 100MB text file has a 15MB executable that outputs it.

Of course, maybe we can find a 15MB executable that outputs the specific Wikipedia file he’s asking for, and there might be clever ways to search for or construct that executable on a specific computer platform, but it doesn’t strike me as some particularly generalizable sense of compression requiring intelligence.

Scarblac · on Aug 11, 2019

He has a book on "Universal Artificial Intelligence" , that defines what he considers AI and then solves it. The only catch is that it relies on Kolmogorov complexity, which is incomputable.

The Kolmogorov complexity of a string is (loosely) the size of the smallest possible Turing machine that will output that string when run. So a string with 1000x "a" has a lower complexity than a random string of length 1000.

He sees GAI as a system that, given all its input up to now (from sensors and the like) and some value-function that defines the goal we're trying to reach, takes the action that has the highest expected value of the value-function. It's hard to give a better definition of intelligence in a completely general sense: given all input, choose the most likely best action.

That requires some guessing about the future: given all this input so far, for all possible next inputs in the next time step, what is the likelihood of each.

And his answer is: the likelihood is correlated with the Kolmogorov complexity of the sequence of all inputs so far, plus the possible future input we're currently considering. The idea is that there is information in the inputs so far, and the sequence that best uses that has the lowest Kolmogorov complexity and is most likely to occur. If we've, seen 999 "a" s, we expect another one. If we've seen a car-shaped object slide left to right on previous camera images, we expect it to continue its movement. If the input so far is completely random then every possible future is equally likely. Et cetera.

Given that, it becomes possible to calculate the value function for every combination of possible action and possible next sensor input, and the AI can know what to do.

Except of course for the detail that this is (wildly) incomputable.

But what is somewhat correlated to Kolmogorov complexity? Compressibility.

So that is why Hutter is interested in better compression algorithms.

All this is very loosely stated from memory, I was interested in this stuff over a decade ago...

0-_-0 · on Aug 12, 2019

Why aren't you interested anymore? We're getting closer to actually doing what you just outlined.

Scarblac · on Aug 14, 2019

I got a wife and kids so have no free time anymore, and nowadays I feel there are far more important problems than AI in the world (climate change, huge loss of biodiversity, cheap very effective propaganda).

mpoteat · on Aug 11, 2019

Code golf is anecdotal, at least, evidence that proving lower bounds on the Kolgomorov complexity of a particular datum requires intelligence.

tshaddox · on Aug 11, 2019

I don’t dislike this challenge, and I don’t dislike code golf. What I don’t understand is the notion that compressing a 100MB Wikipedia file somehow requires an “intelligent” compression algorithm. If the challenge was to be able to compress any English text down to a certain ratio, then perhaps making the algorithm behave “intelligently” would be a requirement. But as far as I can tell, this challenge is asking for a lossless reproduction of one specific chunk of data, so it seems to me that we just have to hope that, on a given platform, there exists a 15MB executable that outputs that exact chunk of data. One might exist, but one might not. And you probably won’t even be able to prove the latter, because of the halting problem (technically it doesn’t apply with a computer with finite memory, but in practice no one will wait long enough).

carapace · on Aug 11, 2019

> one specific chunk of data

Not just any chunk though. It (presumably) contains a lot of structure that encodes information about [English representations of] the real world.

It's not random, nor is it simple, nor is it an image of a fractal (maybe?) etc.

A system that is able to compress it well makes good predictions about what the data contain which indicates "intelligence" about the structure of the (encoded) world.

tshaddox · on Aug 11, 2019

But would you expect that program to itself be useful for anything else? It literally just takes no input and outputs that one specific Wikipedia dump. It’s not like I can tweak it slightly and have it output new predictions of things that sound like Wikipedia articles.

Unless the idea is that whatever methodology is required to build this particular program could then be adapted to do other things that seem intelligent. That seems possible, but not at all likely given my reading of the challenge requirements.

carapace · on Aug 11, 2019

See the links in my other comment in the thread for all the gory details.

> But would you expect that program to itself be useful for anything else?

That exact program, no, but the process of creating that program would be educational, both the compressor that makes it and the process of making the compressor.

Another way of putting it is in terms of the "Twenty Questions" game. Which pre-selected twenty questions allows you to "cover" the most of the Universe of phenomenon? If one set of 20 allows you to distinguish more of the Universe than some other set, then the first set is somehow a better model of the Universe. Does that help?

derefr · on Aug 12, 2019

> If one set of 20 allows you to distinguish more of the Universe than some other set, then the first set is somehow a better model of the Universe.

Is Akinator’s 20q database available as a dataset for non-commercial research purposes? It’d be beat to attempt to compress it in a way where training with one half predicts the other half at the smallest trained model size.

klipt · on Aug 11, 2019

The point of the challenge is to actually create a model that fits in 15MB together with its compressed data. Not just to prove its existence.

One of the best contenders is the open source cmix which, among other tricks, uses LSTM neural networks to predict the next character/bit: http://www.byronknoll.com/cmix.html

kzrdude · on Aug 11, 2019

Seems like cmix compresses better than the winners, but uses far too much memory to be eligible for the prize.

Dylan16807 · on Aug 12, 2019

I had to search out the size of the decompressor on a different page, but it proves you right. The compressed file is ~400KB smaller than the record holder and the cmix decompressor is only ~200KB.

I wonder what the memory/speed/efficiency tradeoffs looks like for that kind of design.

ychen306 · on Aug 12, 2019

It's related to intelligence in the sense that being able to predict well requires intelligence and helps compression. If you have an intelligent model that predicts new words with high confidence, then you can use fewer bits to encode them. If I were to be even more handwavy, I would say having the ability to summarize effectively is a sign of intelligence.

codebje · on Aug 11, 2019

There is a much smaller set of useful 100Mb text files than there is of total possible ones. If there's only 15 million such useful 100Mb text files we should be able to map them to 15Mb executables.

0-_-0 · on Aug 12, 2019

Maybe "useful" is not the right word, I would say "non-random".

dang · on Aug 11, 2019

We moved this comment from https://news.ycombinator.com/item?id=20669214 to this thread, where it seems more at home.

dang · on Aug 11, 2019

Thread from 2014: https://news.ycombinator.com/item?id=7405129

2008: https://news.ycombinator.com/item?id=143704

asperous · on Aug 11, 2019

"as a path to AGI" anyone else disagree with this? Creating a highly efficient compressor seems very much not AGI

gwern · on Aug 11, 2019

All something like GPT-2 is, is a text compressor, a model for an arithmetic encoder specifically. It predicts the probability of the next token conditional on a history. You can directly measure the 'bits per character'/'perplexity' as a measure of (compression) performance, and that is typically how these language models are evaluated: eg http://nlpprogress.com/english/language_modeling.html https://paperswithcode.com/task/language-modelling

The Hutter dataset is included, incidentally, and the top is a Transformer-XL (277M parameters) at 0.94 BPC. Does all of that seem 'very much not AGI'?

tux3 · on Aug 11, 2019

What's unfortunate is the small size of enwiki8.

A decompressor with a high upfront cost (say a pretrained model) and otherwise higher compression ratio would be more interesting than tweaking yet another PAQ8 variant, but I'm not sure this contest's rules make that very viable.

an_opabinia · on Aug 11, 2019

Yes, it does not at all seem like AGI! You’ve explained succinctly why the GPT2 result is maybe not on the road to AGI.

gwern · on Aug 11, 2019

Language models trained in a predictive fashion achieve ever closer to human compression rates (humans would only get ~0.7-0.8 BPC on Hutter), and are responsible for not just the creepiest & most realistic text generation to date but also setting SOTAs across all language-related AI benchmarks like SuperGLUE, designed to test language understanding and real-world reasoning, or chatbots. If that has nothing whatsoever to do with progress towards AGI, we'll just have to agree to disagree.

dwohnitmok · on Aug 11, 2019

I found Scott Alexander's take on this fairly convincing. Essentially if you master language you get broadly generalizable insights across domains.

https://slatestarcodex.com/2019/02/19/gpt-2-as-step-toward-g...

If you think of intelligence as distilling the essence of the universe around you and building on top of that, at least the former starts to sound like really really really sophisticated compression.

throwawaywego · on Aug 12, 2019

Imagine a highly efficient compressor watching Toy Story in 4k definition. If it was really intelligent, then it could abstract the 3D-models, motion, physics, lightning, and textures, and store the movie (or something imperceivably like it) in a highly compressed representation. Or even write down just the script and direct a movie so similar that it can't be distinguished from the original.

asynchrony · on Aug 11, 2019

Being able to discern large scale themes and patterns in data, then using those findings distill the original data input to minimally redundant messages, is similar to the goal of an AGI that can comprehend any input data.

jpalomaki · on Aug 11, 2019

From AGI perspective it should be more relevant to come up with the ideas/concepts described in those Wikipedia articles - not the exact words used by humans to describe them.

salty_biscuits · on Aug 12, 2019

Wouldn't that just be an extension to perceptually lossy compression?

herendin2 · on Aug 11, 2019

A compressor needs to detect patterns in data. Best if it works with general data. General intelligence is excellent at these tasks.

dooglius · on Aug 11, 2019

AGI would probably be sufficient, but not necessary.

jobigoud · on Aug 11, 2019

Do you mean necessary but not sufficient?

I run an non-artificial GI in my brain and can't solve this challenge manually. So AGI doesn't seem sufficient.

dTal · on Aug 12, 2019

I suspect you're selling yourself a little short.

If you studied the text and learned everything in it, you'd be able to generate a new text from scratch, from your internal representation of the information. It wouldn't be word-for-word identical, but that's actually evidence that there is a compressed representation in your head. If you got a good sense of Wikipedia's style out of it, your reconstruction would be even better. Now imagine that we let you take notes - little reminders of unusual turns of phrase, hard to remember dates, small outlines etc. You could probably do quite well indeed, perhaps even well enough that your output could be diffed with the original and result in a comparatively small file. I think it would be fair to count that as a win for you, since the computer program is deterministic and can store the diff along with its output. True, we can't measure the size of the data in your head, but compression is definitely occurring because you would certainly fail miserably if you tried to reconstruct 100 megabytes of random binary data in a similar way.

Still, to achieve that last little bit of the way to truly lossless reconstruction without the diff-storing trick will present you with a challenge, I agree. Lossless reproduction inherently requires encoding a lot of noise, to which no pattern can be matched. It's still a valid test because, although there is a hard floor on how much the data can be compressed, a better encoder will get closer and closer to that floor. And there's room for interesting optimizations all the way to the bottom - for example, with the help of a language parser and a thesaurus one could probably encode the difference between your lossy output and the original text rather better than 'diff'...

imtringued · on Aug 12, 2019

That's a complete misunderstanding of the problem. The compressor should be compared to the machine that runs the AGI, not the AGI itself. Your brain is compressing data all the time but it is happening unconsciously. There are other flaws with the idea of compression being a path to AGI but this is not one of them.

devwastaken · on Aug 12, 2019

Whenever new compression algs come up I can't help but think they're cheating with their dictionaries by pre-defining most common words and character sequences, such as html tags. I would wonder if it's more ideal for human language to simply have the best dictionary. Perhaps even a negotiation on both ends to define what that dictionary is. If browsers had an efficient dictionary of all English words and phrases, compression would seem to tiny.

terrelln · on Aug 13, 2019

In this competition, and in similar competitions, the size of the binary used to decompress is taken into account. If you wanted to use a dictionary, you would need to pay for it in binary size. In this competition, the file must be self-decompressing.

Dictionaries are powerful tools when compressing small data. But once the data is large enough they stop mattering so much. See the dictionary compression section of https://engineering.fb.com/core-data/zstandard/.

sagebird · on Aug 11, 2019

I have a concept I have explored a bit but has yet to yield a positive response (I have not experimented long enough to come to any conclusion):

Can you add text to a string of text to make the compressed string shorter?

I suppose you could call it “compressor hinting” and it would be specific to the compressing algorithm. The added text would be tagged through an escape sequence, so it could be removed after the decompressing stage.

My naive approach is to add randomly generated hints to at randomly chosen location and then gzip/ungzip. I haven’t had success yet. I think that the potential is limited by the expressiveness of the compressor’s “instruction set” - ie - can it understand generalized hints.

waterhouse · on Aug 11, 2019

There are some compressors (zstd is the one I'm thinking of) that accept "dictionaries", which are meant to be produced from datasets similar to what you're going to compress; I would guess they contain something resembling frequency tables. https://github.com/facebook/zstd has some description but doesn't explain precisely what the dictionary contains.

terrelln · on Aug 13, 2019

Zstd's dictionaries contain two things:

1. Predefined statistics based on the training data for literals (bytes we couldn't find matches for), literal lengths, match lengths, and offset codes. These allow us to use tuned statistics without the cost of putting the tables in the headers, which saves us 100-200 bytes. 2. Content. Unstructured excerpts from the training data that are very common. This gets "prefixed" the the data before compression and decompression, to seed the compressor with some common history.

Dictionaries are very powerful tools for small data, but they stop being effective once you get to 100KB or more.

sagebird · on Aug 13, 2019

I can imagine dictionaries as being useful for:

1. Pre distributed dictionaries for commonly transmitted data types. Ie: web browsers could ship with a shared dictionary that is generally helpful for JavaScript, then servers could negotiate and send a specially compressed version for those that have the dictionary.

2. Streaming buffered data: eg before sending video, send a dictionary. This is useful because we are interested in compressing each chunk well, not just the overall file. Relatedly - a compression scheme where you need all bytes before you can decompress any is rather useless here.

zelon88 · on Aug 11, 2019

I'll have to give this a go. I'll probably fail miserably but it sounds like fun. Related: [1] Here is my (sophomoric) attempt at a compression algorithm, and [2] here is the blog post I wrote about it.

[1] https://github.com/zelon88/xPress

[2] https://www.honestrepair.net/index.php/2019/03/08/xpress-an-...

mruts · on Aug 11, 2019

Have you run it on the competition dataset? What’s the size of the executable + compressed data?

zelon88 · on Aug 11, 2019

I'm not sure. Currently the compression variables are somewhat inefficient. I've compressed similar files down to about 17mb using hardcoded variables but I've yet to stumble across one that will reliably compress a variety of data. I was hoping to test it enough to come across patterns that would enable me to develop heuristics, changing the variables according to different input file characteristics. This is simply extremely time consuming. I have a non-open version of a listener for server applications that I've sort of moved my testing over to since I created the client compressor/extractor above.

rasz · on Aug 11, 2019

not updated in a while, there are better results (14,838,332) noted here http://mattmahoney.net/dc/text.html

hinkley · on Aug 11, 2019

Looks like it shaved off half a percent total size but takes 7 times as long to run.

rasz · on Aug 12, 2019

and reading compressor description it seems to be a hand written version of mentioned somewhere above GPT-2, with hard coded English language model as one of the heuristics

tombert · on Aug 11, 2019

A few years ago this was trending on hacker circles [1]. This repo is obviously a joke, but I always wondered if it could be taken more seriously and made into something cooler. Assuming that Pi is normally distributed, then every potential 1kb chunk of a file would be in too.

I tried programming this idea very briefly a couple years ago but I got pulled away for something else.

[1] https://github.com/philipl/pifs

krastanov · on Aug 11, 2019

I imagine the BigInt index that points to the starting digit for your data would take more space than the data itself. Here is an example in julia:

    julia> setprecision(BigFloat, 10_000) do 
               findfirst("123","$(BigFloat(pi)))")
           end
    1926:1928

I.e. the string "123" starts at index 1926.

On the other hand, this type of "encoding" always fun to think about. I would suggest reading "Permutation City" by Greg Egan if you are amused by this.

earenndil · on Aug 12, 2019

> I imagine the BigInt index that points to the starting digit for your data would take more space than the data itself

Assuming optimal conditions (i.e. optimal randomness), you can expect that as many pieces of data will have pointers smaller than them, as pointers bigger than them; you can't win that way necessarily, only in some cases.

But maybe if you can find two 'opposite' algorithms, such that for any data where the pointer is larger with the first, it's smaller in the second...

(From what I know of information theory, the extra bit you have to use to specify which algorithm to use will outweigh all savings, but it's still fun to think about.)

krastanov · on Aug 12, 2019

I doubt that... You are comparing a random variable (the size of the data) to a sum of random variables (the index of the data). Thus I would expect the index to be greater than the size... But I am fairly uncertain.

throwawaywego · on Aug 12, 2019

Assuming optimal randomness, the shortest program to output the data will be in the random source. So record the random source, try all programs shorter than the data, and note the timestamp when you've found the optimal compression.

tombert · on Aug 11, 2019

I have not heard of this book, but these things do interest me...I'll check it out; I've been looking for a new book to read.

throwawaywego · on Aug 12, 2019

> As [Martin] Gardner tells the tale, on visiting Earth, an inquisitive alien wishes to collect all of human knowledge. Having no room for the Encyclopedia Britannica aboard an already-cramped spaceship (and lacking a CD-ROM player), the alien proposes a clever method for compressing the Encyclopedia's volumes. Assuming there are fewer than 1000 unique letters, digits, and other symbols in the text, our resourceful visitor assigns each symbol a three-digit code from 000 to 999, including leading 0s. The word "Snow," for example, might be encoded as 083110111119. Translating the entire Encyclopedia this way produces a giant integer that, with an imaginary decimal point to the left, is equivalent to the decimal fraction, A/B. To complete the data compression, the visitor places a mark on a small rod of otherworldly material that divides the bar into two lengths, A and B.

> After returning to Zenon (or wherever), the alien precisely measures the marked rod, obtains lengths A and B, and divides A/B to yield the original integer, which a computer decodes to print out the Encyclopedia.

- Dr Dobb's

benbristow · on Aug 12, 2019

That's easy. Just need to use Pied Piper's middle-out algorithm.

CamperBob2 · on Aug 12, 2019

More interesting would be a semantic compressor, one that didn't necessarily return the exact words in the corpus but that would turn a small amount of compressed data into a document with very similar meaning.

E.g., "Because trees tend to have a low albedo, removing forests would tend to increase albedo and thereby cool the planet" is a sentence from the file that would need to be compressed. It should be acceptable for the decompressed text to read, "Eliminating large groups of trees would improve surface reflectivity and provide support for a planetary cooling trend," or to employ any number of similar phrasings.

Any form of AGI arrived at through Hutter's exercise would be more akin to the intelligence shown by an eidetic person when they recite the telephone book, and computers are already good at that.

throwawaywego · on Aug 13, 2019

> Why Not Lossy Compression?

> Although humans cannot compress losslessly, they are very good at lossy compression: remembering that which is most important and discarding the rest. Lossy compression algorithms like JPEG and MP3 mimic the lossy behavior of the human perceptual system by discarding the same information that we do. For example, JPEG codes the color signal of an image at a lower resolution than brightness because the eye is less sensitive to high spatial freqencies in color. But we clearly have a long way to go. We can now compress speech to about 8000 bits per second with reasonably good quality. In theory, we should be able to compress speech to about 20 bits per second by transcribing it to text and using standard text compression programs like zip.

> Humans do poorly at reading text and recalling it verbatim, but do very well at recalling the important ideas and conveying them in different words. It would be a powerful demonstration of AI if a lossy text compressor could do the same thing. But there are two problems with this approach.

> First, just like JPEG and MP3, it would require human judges to subjectively evaluate the quality of the restored data.

> Second, there is much less noise in text than in images and sound, so the savings would be much smaller. If there are 1000 different ways to write a sentence expressing the same idea, then lossy compression would only save log2 1000 = about 10 bits. Even if the effect was large, requiring compressors to code the explicit representation of ideas would still be fair to all competitors.

http://mattmahoney.net/dc/rationale.html

londons_explore · on Aug 12, 2019

Given enough time and compute, an AGI could enumerate all possible ways to say the sentence in your example, and then store a number representing which sentence is the exact same one as the input.

That number will be much smaller than the text of the sentence.

CamperBob2 · on Aug 12, 2019

That number would have to specify one of all possible sentences, though, not just one from the subset in question. I'd assume it would be longer in that case.

MilesTeg · on Aug 11, 2019

I thought of a potential loophole. What about a non-deterministic extractor? Let's say you can make your program smaller by having only a 1% chance of extracting perfectly. Just submit the program (for an expected) 100 times and claim your prize.

svantana · on Aug 11, 2019

Through the magic of information theory, I can tell you that this particular trick will save you log2(100) = 6.6 bits. Probably not enough to make a difference.

MilesTeg · on Aug 11, 2019

I see. You are right.

JyrkiAlakuijala · on Aug 12, 2019

Alexander Rhatushnyak -- the first, second, third and fourth winner of the Hutter prize -- is the main contributor to JPEG XL lossless mode. Perhaps we all will eventually get practical benefit from his ability to build amazing compression solutions.

jerrre · on Aug 11, 2019

Since it's about compressing one specific sequence of bytes, and compression time/cost doesn't matter - only decompression - wouldn't finding a set of [neural network inputs + weights + processing code + error correction] (I think in the direction of GANs or auto-encoders) be a way to have a high chance of finding improvements? Not 100% sure it would be in spirit of the contest, and if the cost of training would offset the reward...

gwern · on Aug 11, 2019

A fully NN solution (as opposed to using a light sprinkling like all competitive solutions do) requires big binaries for the most part, while an arbitrary program can easily memorize exact sequences to find repetition while using a tiny NN for some finetuning of predictions. A pure NN solution like a Transformer-XL does turn in the record-setting performance on natural language datasets including WP... unfortunately, the required NN model sizes alone tend to be larger than the entire uncompressed WP corpus here, and so have zero chance of ever being the minimum of model+compression. (An observation which I think supports the idea that the Hutter Prize has long outlived its intended use for measuring progress towards AGI; it's now just sort of a 'demo scene' version of AI benchmarking, testing intelligence within extreme constraints of sample efficiency/compute, rather than a true useful benchmark.)

The fact that small compressors like gzip or zpaq do so well at small data/compute but then can't compete as you scale up to tens or hundreds of gigabytes of text (amortizing the cost of those fancy NN models) can be considered a version of the Bitter Lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

imtringued · on Aug 12, 2019

The bitter lesson seems to be striving for the wrong goal. Replicating human performance will at best give you human performance. Humans are incredibly inefficient. They require multiple orders of more computational resources just to match a machine. Since we do not have an abundance of computational resources, trying to replicate humans will always result in inferior results than simply running conventional algorithms.

gwern · on Aug 12, 2019

Why then do systems following the bitter lesson like AlphaZero strongly outperform humans? (Hint: it involves scaling.)

yongjik · on Aug 11, 2019

But your entire model, plus your code, plus the compressed data must fit in 16MB: it gives very little room for maneuver.

Scoundreller · on Aug 11, 2019

Yes, but limitations are:

1) Needs to run in around 8hrs or less on a single CPU core

2) (I assume) it needs to be self-extracting (and size includes the compression logic).

jerrre · on Aug 12, 2019

I interpreted the rules as only decompression needs to be in 8 hrs, training could be arbitrarily long? That would also lead to the theoretical strategy of finding the sequence somewhere in pi, and just storing the index and a pi generator....

codetrotter · on Aug 11, 2019

> 2) (I assume) it needs to be self-extracting (and size includes the compression logic).

Size includes decompression logic, which may be what you meant but it is important to distinguish between compression and decompression.

http://www.hutter1.net/prize/hrules.htm#rules

dooglius · on Aug 11, 2019

Since the stated motivation is work toward AGI, I think that such an approach would be highly in the spirit of the contest.

Smoosh · on Aug 12, 2019

What if we write a powerful AGI, but it just isn't interested in this problem?

foota · on Aug 11, 2019

The size of your model is included.

jl6 · on Aug 11, 2019

> Restrictions: Must run in ≲8 hours using a single CPU core and <1GB RAM and <10GB HDD on our test machine.

TBH I feel limiting RAM usage to 1GB is an unnecessary restriction in 2019.

dmitrygr · on Aug 11, 2019

People thinking that 1 GB of RAM is not much is one of the main problems in computing today. Please stop.

jbverschoor · on Aug 11, 2019

This +1000000000

Dylan16807 · on Aug 11, 2019

It's outdated, but it's still 10x the size of the uncompressed data, so it should be generally sufficient.

Flow · on Aug 12, 2019

This test maybe should not focus on the output being byte-identical to the input file, but to be structurally identical. It's an XML file, with another semi-structured language inside the text nodes. There's opportunity to utilize that fact.

shultays · on Aug 11, 2019

I can compress it to 0b! With a decompressor that will be around 100mb.

shoghicp · on Aug 11, 2019

That’s why the rules state the size of your code is included in the calculations

etaioinshrdlu · on Aug 11, 2019

Decompressor size is included in the measurement rules.

CamouflagedKiwi · on Aug 11, 2019

Unfortunately this seems to be foreseen, the decompressor size is included in the measurement.

hinkley · on Aug 11, 2019

On a more serious note, I think his can still be gamed a little bit.

For instance, DEFLATE has default variable width encoding tables built in based on letter frequency of the English language as measured some time ago. If you use an algorithm with any kind of defaults those count against the cost, but they don’t necessarily cost more to tune it for a solitary input.

krasin · on Aug 12, 2019

DEFLATE compressed data consists of blocks. Each block can be either uncompressed, or compressed with a pre-defined Huffman table (the variant mentioned in the parent comment), or compressed with a dynamic Huffman table. Many DEFLATE compressors would consider all three options (at higher -j values) and choose the best. So, one does not need to tune DEFLATE to get the advantage of dynamic Huffman tables. And any kind of arithmetic encoding (or the new hotness, Assymetric Numeral Systems) would do a better job.

For a competition like this, all simple approaches have probably been tried already. The current leader has been tuning their approach for 5 years: http://mattmahoney.net/dc/text.html#1159

machrider · on Aug 11, 2019

Do you think he would notice if the test machine downloaded a ~100MB file from the web?

raphaelj · on Aug 11, 2019

Can't he just disable the Internet interface?

imtringued · on Aug 12, 2019

It's not written in the restrictions so it wouldn't be fair.

throwawaywego · on Aug 13, 2019

> Programs must run without input from other sources (files, network, dictionaries, etc.) under Windows or Linux without additional installations.

manyall · on Aug 11, 2019

Or provide a readme.txt, or manual.txt, for the decompressor, at the humble size of 100mb, or so.

nihilarian · on Aug 11, 2019

Right click, create shortcut, 4kb

cvs268 · on Aug 11, 2019

Reminds me of this one-time a girl at school had copied a shortcut to a DOC file onto the floppy-disk and submitted it at school.

The teacher had a hard time explaining shortcuts to her. The girl kept insisting that the PC at school was broken as the shortcut worked in her machine at home.

sokoloff · on Aug 12, 2019

What I hope really happened is that the savvy student bought herself an extra day to complete the assignment!

naushniki · on Aug 11, 2019

A friend of mine "copied" a bunch of songs from an audio CD to a floppy disk this way.

zuuow · on Aug 11, 2019

A guy I knew used to reencode his songs as 16 kbps mp3 files to save space. When I told him they sounded horrible, he told me "no problem, I can always reencode them back to 128 kbps!"

natpalmer1776 · on Aug 11, 2019

I wonder how much disk space a file system link would take.

cvs268 · on Aug 11, 2019

Typically the filesystem "page" or "inode" size. Usually 4KB.

Dylan16807 · on Aug 11, 2019

Page/block size is not inode size. While filesystems very often work in chunks of 4KB, the normal inode size is 256 bytes on ext filesystems and 1KB on NTFS. They get packed together.

xNevo · on Aug 11, 2019

Wait. That's illegal!