The argument of this very long and technical paper seems to boil down to this:
Solomonoff induction is (in the limit) as good as any computable prediction method, which sounds pretty "universal". But it is not itself a computable prediction method (it's just semi-computable). And if you are satisfied with semi-computability in a predictor, then you should demand instead that a "universal" predictor always be as good as any semi- computable predictor, and it can't meet that more demanding requirement. And so on in infinite regress.
I find this kind of underwhelming. Solomonoff induction represents, in practice, an unattainable ideal that prediction systems can aspire to approximate. This argument shows that if you somehow could reach the ideal, there is an infinite regress of super-ideals beyond it. So what?
I think you did a better job than all other commenters here (who seem to be mostly free-associating on the title and abstract), but your comment still doesn't quite capture the contribution of the paper or the reasonable takeaways from it.
The key is section 4.3.3, which draws a distinction between measures and predictors. The Solomonoff lower-semicomputable semimeasure is universal for all lower-semicomputable semimeasures, not just for some narrower class. But if we forget the idea of lower-semicomputable semimeasures and treat Solomonoff as a member of a wider class - limit-computable predictors - it turns out to be not universal for that class. In fact, moving from measures to predictors is so lossy that no predictor can be universal for its own class, by a simple diagonal argument. To me that means Solomonoff's approach with measures is more nuanced and informative than the approach with predictors.
In my admittedly very cursory skim of the paper, I gave it the benefit of the doubt on this point, because this seemed convincing:
"for the purpose of prediction, we are, of course, not so much interested in the probabilities issued by the measure functions, as in the conditional probabilities that give the corresponding predictors’ outputs."
Do you disagree?
(Also, my original summary is wrong at least in that I should have said limit computable where I said semi computable!)
I guess the question is whether we want universality over all limit-computable predictors, or only those derived from lower-semicomputable semimeasures. Since the latter class contains its own universal element and also every computable predictor, maybe it's okay? But yeah, I can see how you can have a different takeaway. Sorry for being a bit harsh in the previous comment.
I'm honestly surprised there's not more research / funding on universal prediction. It seems like a top down approach for true AI (start from theory and work backwards to something feasible/practical) as opposed to the current engineering mindset (start with something that works, i.e., neural networks, and try to do more with them).
The ideas related to universal prediction were initially developed in the 1960s by Chaitin, Solomonoff, and Kolmogorov, but little has been done with the concept since, partly due to the fact that computers weren't very powerful back then and partly due to the fact that a lot of their theorems ran up against barriers of uncomputability.
Marcus Hutter and Shane Legg come to mind as people who have done some work in this area, but I'm not sure if Shane has continued investigating this approach separately from his work with neural networks at DeepMind.
I think there are 2 main problems with working on universal prediction:
1. getting all the information to the program in a usable, computational format.
2. Defining the requirements (old problem but a good one). How would you actually design an algorithm to predict something completely unknown. ML works well here, as it is fairly narrow and 'easily' definable
It certainly is the most interesting problem though, and I'd love to see more research attempted.
How would you make an algorithm to predict something unknown?
This is essentially the field of sequential decision making, which encompasses everything from reinforcement learning to multi-armed bandits. [1]
As an example, let’s pretend we have the task of predicting labels of a size 2 alphabet {0, 1}. These labels are come to us via a stream; so in contrast to classical ML, we are not given access to the full “dataset” beforehand to train our predictor. But instead, each round, a label is given to us; and in between each round, we can only predict the next label with the set of labels previously revealed.
Note that we have no model of how the labels are generated — this is what we mean by “predict something unknown”. In general, we can assume two cases: it’s either adversarial or predictable (shaky definitions, I know).
In an adversarial case, we assume the labels are generated such that we incur a heavy loss. Say, if we predict 0, the adversary outputs 1; vice versa. It turns out with deterministic strategies and a certain adversarial model, we can always incur loss linear to the amount of rounds; this goes a bit into game theory. To counter this, we can employ a random strategy where we flip a fair coin to decide what to predict. This is somewhat analogous to you, in the classic tie breaking game, of choosing {rock, paper, scissors} with 1/3 probability.
But what if the labels weren’t generated adversarially? Remember, we don’t have a model of how the labels were generated. Say, the sequence was 000000... predicting with a fair coin would be a bad strategy then; and instead it’d be better to simply predict 0. A better strategy here would be something like follow-the-leader, where you predict the most frequent label seen so far.
But recall, we don’t know how the labels are generated — it could very well alternate between periods of predictability and unpredictability.
What we need is a mixed strategy which does well in both cases. One algorithm that turns out to do this is called multiplication weights, which assigns “weights” to each label and increases or decreases them proportionally each round depending on what the label is.
There’s a whole lot of other ground that can be covered but I’d recommend reading a book instead of reading it here. One resource is “Prediction, Learning, and Games”—but feel free to find another book since this one assumes a background in probability.
[1] Actually the first section of the linked pdf above.
Huh. My impression was always that "universal prediction" got replaced by various branches of statistics, ML theory, and probability theory, each considering more specific questions that used to fall under a single umbrella?
Perhaps a best comparison is "cybernetics", which doesn't get funding these days except, of course, for the billions upon billions of dollars thrown at robotics, self-driving, factory automation, and so on.
In this dissertation I investigate the theoretical possibility of a universal method of prediction. A prediction method is universal if it is always able to learn what there is to learn from data: if it is always able to extrapolate given data about past observations to maximally successful predictions about future observations. The context of this investigation is the broader philosophical question into the possibility of a formal specification of inductive or scientific reasoning, a question that also touches on modern-day speculation about a fully automatized data-driven science.
I investigate, in particular, a specific mathematical definition of a universal prediction method, that goes back to the early days of artificial intelligence and that has a direct line to modern developments in machine learning. This definition essentially aims to combine all possible prediction algorithms. An alternative interpretation is that this definition formalizes the idea that learning from data is equivalent to compressing data. In this guise, the definition is often presented as an implementation and even as a justification of Occam's razor, the principle that we should look for simple explanations.
The conclusions of my investigation are negative. I show that the proposed definition cannot be interpreted as a universal prediction method, as turns out to be exposed by a mathematical argument that it was actually intended to overcome. Moreover, I show that the suggested justification of Occam's razor does not work, and I argue that the relevant notion of simplicity as compressibility is problematic itself.
It's amazing to see a paper referring to Solomonoff on HN. It does seem that if ever there was a time for his work it is now.
I find his writing obscure, and in person I could almost never understand anything he said. He really never fit in with the world of symbolic AI (called GOFAI these days).
It's possible to have universal prediction for some definition of universal and prediction.
The problem is that any universe one starts predicting could look an indefinitely nested set of conditional distributions.
So your system could be going along learning and predicting when suddenly things changed and it would have to start all over. It would have some mechanism for pivoting at that point but of course that would less than perfect. So there are inherent limits if you talk arbitrary universe.
Interesting. Here's a possible way out. Suppose the universe is always changing according to some state-transition function that gets you to time t to time t+1. We have a machine existing in this universe which is trying to predict the state at t+1. You point out this is impossible because the transition function might suddenly change for some particular t+1. But if the change is massive enough, it will also destroy the internal workings of the prediction machine. Conditional on the universe retaining enough regularity for the prediction machine to survive, maybe it would be possible to make "universal" predictions.
The universe (U) itself would be the prediction machine.
To be fair, there would be no definition of "circular" without the universe.
The reality we - or habitants from other universes or other domensions - can measure seems to be some kind of "Fractal collapsing" at the dimension/universe that the "measurer" is located.
---
Kind of related:
Dostoiévsk, in his book notes from underground proposes an interesting problem.
I paraphrase it below:
What would you do if you receive from god himself a manuscript of the actions (down to the minimum possible dt) that will maximize your "satisfaction function" considering all your life?
---
Rick and Morty also explored this using another analogy:
Psychologists are like horses in a caroussel trying to understand it's mechanism. They'll do their best, but in the end won't help much and feel terrified.
Isn't being able to reliably and precisely predict anything is just about building a sufficiently precise model of the system? So, if we could build a sufficiently precise model of the world we live in wouldn't it also "automate science"? In fact, some people say our universe and we ourselves actually are such a model already.
And although it may be impossible to model the whole universe with model scientists in it, I'm rather certain it is going to possible at some point to build an artificial brain that is going to be a model of an extremely smart human additionally enhanced with direct interface to huge databases and computational powers this way being able to figure out almost anything swiftly making an extremely resourceful automatic scientist.
I only read some of the first two parts of this dissertation.
I wonder whether there are any good classes of learnable patterns. In, say, the continued fractions, or the decimal deals, there is a fine but essential line between patterns of coefficients with repetands, and those which do not repeat.
>> I show that the suggested justification of Occam's razor does not work, and I argue that the relevant notion of simplicity as compressibility is problematic itself
Solomonoff induction is (in the limit) as good as any computable prediction method, which sounds pretty "universal". But it is not itself a computable prediction method (it's just semi-computable). And if you are satisfied with semi-computability in a predictor, then you should demand instead that a "universal" predictor always be as good as any semi- computable predictor, and it can't meet that more demanding requirement. And so on in infinite regress.
I find this kind of underwhelming. Solomonoff induction represents, in practice, an unattainable ideal that prediction systems can aspire to approximate. This argument shows that if you somehow could reach the ideal, there is an infinite regress of super-ideals beyond it. So what?