Kolmogorov Complexity: Extensions and Applications

mountainriver · on May 5, 2021

Really love the use of Kolmogrov Complexity as Occam’s razor in Solomonoff Induction

laszlokorte · on May 5, 2021

came here to say that as well and recommend this video [1]

[1]: https://www.youtube.com/watch?v=yXkEpzexYIU

/edit: ant this one [2]

[2]: https://www.youtube.com/watch?v=DCjeM3Sei-s

sean_pedersen · on May 5, 2021

Good input, added it to the TODO list.

pama · on May 5, 2021

I’m a bit confused about the second sentence in this snippet: “ We search for the shortest program K(X, T=<42) that produces X within a predefined time / computational resource limit. Start with the shortest program and complexify it iteratively spanning a tree of all possible programs.” If we use the time cutoff to make the search for the Kolmogorov complexity practical, then we’re done once we have found the shortest program that produces X. If instead the author means that we can start with the shortest valid program that doesn’t necessarily produce X and then complexify it, then I believe that the logic is incorrect (a shorter valid program that produces X may arise from an otherwise invalid sequence that is not part of the tree that originated from the shortest valid program). If the statement is about finding K(X, T), ie taking the execution time into account, similarity, the optimal program might not result from an iteration of a previously valid shorter program (with the caveat that I’m not fully clear what the author means by complexify and iterate on a tree). Perhaps the suggestion is to search the space of all possible programs sorted by length; if we include invalid programs, we can think of trees of character strings starting from an empty string and spanning all possible strings that are allowed in our programming language. In any case, if we plan the search according to what I think the author means, then once we find a first hit that produces X in less than 42 seconds we are done. Or am I misunderstanding this part completely?

operator-name · on May 5, 2021

You seem to have go it, the suggestion seems is to search all possible programs sorted by length.

"Start with the shortest program", a null instruction, which does not meet the criteria, "and complexity it iteratively" by appending new instruction in a breadth first manner "spanning a tree of all possible program."

You're correct that the first program that this search finds will satisfy K(X, T) but may not satisfy KT or kST.

sean_pedersen · on May 5, 2021

"once we find a first hit that produces X in less than 42 seconds we are done" yes this is what I tried to convey. Start with the shortest program e.g. print(""). Complexify it (span the tree of all valid programs), discard all programs running longer than 42 secs. First program to produce the string signals that we are done.

chriswarbo · on May 5, 2021

The article proposes a computable approximation of Kolmogorov complexity which minimises the length of a program multiplied by its running time.

There's actually a much nicer alternative, sometimes called Levin Complexity, which minimises the length of a program (in bits) plus the logarithm of its running time (base 2). What's really cool about Levin Complexity is that the time required to calculate it only differs from the running time of the minimal program by a constant factor.

For example, let's say we have some data X and we want to find its Levin Complexity, using some pre-defined language L (e.g. the tape of a universal turing machine, or binary lambda calculus). For simplicity, we'll assume L is 'prefix-free' (e.g it has an EOF (end of file) symbol):

    def levinComplexity(x):
      # Start by trying complexity 1
      currentComplexity = 1

      # Keep trying larger complexities until we return
      while True:
        # Try different lengths, to trade-off length and time
        for 1 <= length <= currentComplexity:
          for program in enumeratePrograms(length):
            result = run(program, steps = 2^(currentComplexity - length))
            if result == x:
              # Found a program which outputs x
              return (currentComplexity, program)

        # x wasn't found, try a larger complexity
        currentComplexity += 1

This will start by enumerating all data of complexity 1 (the output of programs of length 1 bit, run for 2^0 = 1 steps), checking if any is equal to the input x. If not, the 'while' loop iterates and it checks all data of complexity 2 (outputs of programs of length 1 bit, run for 2^1 = 2 steps; and programs of length 2 bits, run for 2^0 = 1 steps); the next iteration checks complexity 3 (length 1 bit, run for 2^2 = 4 steps; length 2 bits, run for 2^1 = 2 steps; and length 3 bits, run for 2^0 = 1 steps); and so on until we find x and return.

The iteration which checks complexity N needs to re-run all the same programs as for N-1, but for twice as long; hence doubling the amount of time needed. We also need to run all programs of length N for one step; since there are 2^N programs of length N, this also doubles the time taken. This makes each iteration of the 'while' loop take 4 times as long as the previous, or about O(2^N) time. Hence the overall algorithm is around O(2^C), where C is the input's Levin Complexity (this grows so quickly that the final iteration dominates the running time).

If we focus on any particular program P we find something interesting: our search will eventually start executing P, once currentComplexity = length(P) it will be run for 1 step. The next iteration will run P again, but for 2 steps; the next iteration will run for P for 4 steps; then 8 steps and so on, doubling each time. Hence iteration N will run program P for around O(2^N) steps. Since iteration N takes around O(2^N) time, we're executing O(2^N) steps of P per O(2^N) steps of levinComplexity. In other words, this algorithm is running every program at O(1) speed!

What's the catch? There are two things going on:

- Restarting at exponentially-spaced intervals lets us switch between programs without having to save/restore any state, but surprisingly it only slows us down by at most two-thirds. It's easiest to work this out backwards: the first time we execute a program's Tth step, we must have taken T steps plus all of the previous executions which got restarted. Any previous execution must have restarted before the Tth step; any execution before that must have taken fewer than T/2 steps; any before that fewer than T/4 steps; and so on. Hence we will reach the Tth step after at most T + T + T/2 + T/4 + T/8 + ... = T(2 + 1/2 + 1/4 + 1/8 + ...) = T(2 + 1) = 3T steps.

- Each program is getting a constant fraction of the execution time, but longer programs get exponentially smaller fractions. Programs of length N+1 get half as much execution time as those of length N. The overall execution time is divided up between program lengths like 1/2 + 1/4 + 1/8 + ... Since we're using a fixed language there are a constant number of programs of each length, so each gets a constant fraction. (Note: The prefix-free property is useful for making this work)

These tiny fractions of time are the main reason this is an impractical algorithm. Although they're exponentially small, big-O only cares about functions of the input size; whilst these exponentials depend only on the program lengths.

Note that our search finishes when we find the minimal program for x. Since that program (whatever it is) is being run in linear time, our search will finish once that program has output x.

tromp · on May 5, 2021

> This makes each iteration of the 'while' loop take 4 times as long as the previous, or about O(2^N) time.

That would be O(4^N) time, which is certainly not O(2^N). But in fact each iteration does NOT take 4x as long.

Thanks to the prefix-free property, each iteration only takes twice as long. Consider programs as half-open subintervals of the unit [0,1) interval. The whole interval corresponds to the empty string "" (which by prefix freeness is not a program), and the left and right halves of a subinterval corresponding to string s, correspond to strings s0 and s1. Prefix freeness means that all programs are disjoint. In the Levin search at complexity c, a program p of length |p| contributes 2^{-|p|} interval space and 2^{c-|p|} steps. By induction, any set of programs of total interval length x contribute at most x*2^c steps to the Levin search at complexity c, which means that it takes at most 2^c steps altogether for the entire unit interval.

chriswarbo · on May 5, 2021

Gah, that's what happens when I try to re-derive half-remembered proofs late at night :( Thanks for the clarifications.

> That would be O(4^N) time, which is certainly not O(2^N)

Oops, I wanted it in twos and misplaced my parentheses. Should have been 4^N = (2×2)^N, but I mistakenly did 4^N = 2×(2^N), then simplified to 2^(N+1) ~= 2^N, which is nonsense.

> But in fact each iteration does NOT take 4x as long.

> Thanks to the prefix-free property, each iteration only takes twice as long.

Yes, I knew being prefix-free meant the time fractions sum to one; but I wasn't sure what the total slowdown was (or rather, I was sure it was O(2^N), but got myself in a muddle trying to get there!).

I like to think of the programs as being leaves of a tree, and their prefix-free encoding is the path from the root. We allocate 100% of the time to the root (whose path is the empty string), and nodes divide their allocated time uniformly between their sub-trees. The time allocated to a leaf tells us how long to run its corresponding program.

That way, each iteration of the algorithm just doubles the number of steps allocated to the root, which "trickle down" the tree. We run all of the programs which get allocated at least one step (i.e. we don't follow subtrees which get 1/2 a step).

It's probably worth pointing out that this scheme is a general search algorithm, which isn't specific to programs (it's like a fancier alternative to grid search). For example, if we have a long-running algorithm like a theorem prover, but we're unsure what inputs to give it, we can explore all the different input combinations by arranging them in a tree and using Levin Search, running the prover for longer and longer on more and more complicated inputs until it finds our desired output (e.g. a proof/disproof of a specified conjecture).

Of course, as soon as we start to look at theorem proving, we soon stumble upon Hutter search, AKA "the fastest and shortest algorithm for all well-defined problems" ;)

Lots of great links at http://www.scholarpedia.org/article/Universal_search

hakuseki · on May 5, 2021

The article contains a misleading description of NP-hardness:

> Computing the Kolmogorov complexity for a sequence X of N bits length is an NP-hard problem, since the search space of possible programs producing any sequence of length N is growing exponentially with respect to the input size N

NP-hardness would actually mean that any NP problem, e.g. a traveling salesman problem, could be "reduced" within polynomial time to calculating the Kolmogorov Complexity of a given string.

dataflow · on May 5, 2021

I think NP-hard is correct? You can reduce the halting problem and Kolmogorov complexity computation to each other [1], so the question is whether TSP can be reduced to the halting problem. Well, you can reduce SAT to the halting problem (they explain how here [2]) and you can reduce TSP to SAT given they're both NP-complete so... we're done?

[1] https://www.nearly42.org/cstheory/halting_to_kolmogorov/

[2] https://en.wikipedia.org/wiki/NP-hardness#Examples

hakuseki · on May 5, 2021

The reductions in your linked paper [1] are not polynomial-time reductions.

bennofs · on May 5, 2021

Construct a TM which enumerates all possible variable assignments and checks if the SAT problem is satisfied then halts if so. You can construct this TM in polynomial time, and it halts exactly if the SAT problem is satisfiable. So this is a polynomial reduction from SAT to the halting problem.

hakuseki · on May 5, 2021

I do not dispute this. My comment was about the linked paper [1] regarding equivalence of the halting problem and Kolmogorov complexity, not the SAT problem.

sean_pedersen · on May 5, 2021

Yes I am not aware of any proof on that one. Do you think it will suffice to generalize the statement to just "NP problem" instead?

gjm11 · on May 5, 2021

I'm pretty sure determining Kolmogorov complexity is not in class NP, and in fact isn't computable.

(You can't just try all possible programs, because some of them never terminate and some others run for a very long time and you can't tell in advance which are which.)

sean_pedersen · on May 5, 2021

I think you are right, since it entails the Halting problem it is not computable and thus not even in NP.

sn41 · on May 5, 2021

No, it is uncomputable, not an NP problem.

amelius · on May 5, 2021

> The most obvious upper bound for K(X) is obviously the length N of X itself.

This seems false as you'd have to account for the header of the program. Otherwise how can you tell programs and texts apart?

laszlokorte · on May 5, 2021

In Kolmogorov complexity you typically do not care about constant offsets. You assume you are free to choose the language to implement the program in and in the worst case if someone else choses another language you just have to prefix all your programs with an interpreter of your language written in the other language. The size of that interpreter might be huge but it‘s a constant length for all your programs so you dont care. Same for program vs literal: just include one prefix symbol in your language to mark your program as either literal output or real program. That makes all your programs one symbol longer but as you only care about relative program size inside your language you do not care.

tromp · on May 5, 2021

Yes, you need some overhead for specifying string literals. For a binary string x of length n, you have KS(x) <= n + O(1) for simple Kolmogorov complexity (where one program can be a prefix of another), and KP(x) <= n + KP(n) + O(1), for the prefix Kolmogorov complexity, in which programs must be prefix-free. In this paper [1] of mine, you can find explicit definitions of KS and KP, as well as the theorems above, with specific constants instead of O(1).

[1] https://tromp.github.io/cl/LC.pdf

shakow · on May 5, 2021

> how can you tell programs and texts apart?

By opening them either with a text editor or a compiler/interpreter.

Homoiconic languages have no problem being edited and run.

fighterpilot · on May 5, 2021

What about the print or return statement?

chriswarbo · on May 5, 2021

Expression-oriented languages don't need explicit returns. However, your point is absolutely correct: there must be a constant overhead, even if it's just a single bit; otherwise there would be no way to tell what should be treated as a program and what should be treated as a literal string/value (even if it just-so-happens to be a valid program).

Since this overhead is constant, it becomes negligible as the sequence length increases. Yet we should take it into account, e.g. by saying the upper bound is 'N + O(1)'.

sischoel · on May 4, 2021

This article is surprisingly short on actual applications.

For some interesting applications one might look at the the Wikipedia article for the normalized compression distance: https://en.wikipedia.org/wiki/Normalized_compression_distanc...

This paper here shows some interesting applications - it is not the first one, but the first non-paywalled one that I found: https://arxiv.org/pdf/cs/0312044.pdf

There is also a book by Paul Vitanyi - it has been a while since I looked at it, but if I remember correctly it also discusses some of these applications: https://www.amazon.com/Introduction-Kolmogorov-Complexity-Ap...

vimax · on May 5, 2021

My favorite application is Marcus Hutter's AIXI[1], a provably optimal theoretic intelligent agent for any environment, though uncomputable.

And then Shane Legg's use of a Monte Carlo AIXI approximation to define a universal measure of intelligence[2].

[1]https://en.wikipedia.org/wiki/AIXI

[2]http://www.vetta.org/documents/Machine_Super_Intelligence.pd...

chriswarbo · on May 5, 2021

Cilibrasi's "Statistical Inference through Data Compression" is pretty cool. One nice application was using compression to infer the mutual-information between DNA from different animals, and using that to construct evolutionary trees.

Edit: I see your arxiv link is to the same thing (although I have a hard-copy of the book based on their thesis, so didn't spot it immediately)

sischoel · on May 5, 2021

Thanks, I had not seen that dissertation yet, it has much more information than that arxiv link - here is a a pdf of that dissertation that google gave me: https://www.illc.uva.nl/Research/Publications/Dissertations/...

Although strictly speaking the idea with the evolutionary trees is a bit older, as far as I know it is from https://arxiv.org/pdf/cs/0111054.pdf

Ono-Sendai · on May 5, 2021

There are no useful applications of Kolmogorov complexity, because it's useless (apart from it's though provoking qualities): http://forwardscattering.org/post/7

chriswarbo · on May 5, 2021

This problem is solved by choosing a standard language, which requires some way to tell whether one language is "less silly" than another. There's no 'correct' way to judge this, but a reasonable, objective benchmark is how many bits it takes to write a self-interpreter (in a similar spirit to "compiler optimizations should pay for themselves" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.... )

Binary Lambda Calculus is a pretty good for this. I personally prefer Binary Combinatory Logic, but its self-interpreter is a bit longer.

There's probably a way to game this benchmark with a "silly" language, but I imagine anything more complicated than hard-coded literals would end up larger than BLC.

sischoel · on May 5, 2021

I think there is a mistake in that article: One has to fix a specific language before and then determine the Kolmogorov Complexity of an input - one is not allowed to take a separate Language for each input, unless one also encodes how that Language is calculated from an input, which will take up some space. This is a bit like saying that SAT is in O(1): One can create a separate program for each instance that answers that instance in constant time.

And the useful applications that I know of (apart from using it as a tool in proving theoretical things) come from replacing the Kolmogorov complexity with a a real compressor that just approximates the Kolmogorov complexity - this produces some empirical results.

chriswarbo · on May 5, 2021

> one is not allowed to take a separate Language for each input

They're not defining a different language per input, they're defining a single language, but it's "silly" because it's tailored to favour one specific string. Their point is that we can always construct a universal language which encodes any particular sequence as 1 bit: the language simply checks the first bit of its input, and if it's 0 it returns that sequence; otherwise it ignores that bit and interprets the rest of the input as some "non-silly" universal language (Ruby, in their case).

Kolmogorov Complexity normally tries to ignore the particular choice of language (since, in principle, any turing-complete language can emulate any other with a constant (but probably large) overhead). Their point is that language choice matters a lot, since there's nothing to prevent us choosing a universal language which just-so-happens to favour some outputs much more than others.

I get what they're saying, but there are solutions to this, in particular minimal languages like Binary Lambda Calculus, Binary Combinatory Logic, Wolfram's 2/3 turing machine, the rule 110 cellular automaton, Brainfuck, etc.

These are akin to "nothing up my sleeve numbers" used in cryptography (where the choice of parameter is pretty much arbitrary, but technically there are certain "silly" choices that would undermine the security)

rhyn00 · on May 5, 2021

Does anyone recommend a book about Solomonoff induction?