Thanks, that's an interesting perspective. It also highlights one of the weak points in the concept, I think, which is that this is only a tool for updating beliefs to the extent that the underlying probability space ("ontology" in this analogy) can actually "model" the phenomenon correctly!
It doesn't seem to shed much light on when or how you could update the underlying probability space itself (or when to change your ontology in the belief setting).
This kind of thinking will lead you to ideas like algorithmic probability, where distributions are defined using universal Turing machines that could model anything.
I think what you're getting at is the construction of the sample space - the space of outcomes over which we define the probability measure (e.g. {H,T} for a coin, or {1,2,3,4,5,6} for a die).
Let's consider two possibilities:
1. Our sample space is "incomplete"
2. Our sample space is too "coarse"
Let's discuss 1 first. Imagine I have a special die that has a hidden binary state which I can control, which forces the die to come up either even or odd. If your sample space is only which side faces up, and I randomize the hidden state appropriately, it appears like a normal die. If your sample space is enlarged to include the hidden state, the entropy of each roll is reduced by one bit. You will not be able to distinguish between a truly random coin and a coin with a hidden state if your sample space is incomplete. Is this the point you were making?
On 2: Now let's imagine I can only observe whether the die comes up even or odd. This is a coarse-graining of the sample space (we get strictly less information - or, we only get some "macro" information). Of course, a coarse-grained sample space is necessarily an incomplete one! We can imagine comparing the outcomes from a normal die, to one which with equal probability rolls an even or odd number, except it cycles through the microstates deterministically e.g. equal chance of {odd, even}, but given that outcome, always goes to next in sequence {(1->3->5), (2->4->6)}.
Incomplete or coarse sample spaces can indeed prevent us from inferring the underlying dynamics. Many processes can have the same apparent entropy on our sample space from radically different underlying processes.
Right, this is exactly what I'm getting at - learning a distribution over a fixed sample space can be done with Bayesian methods, or entropy-based methods like the OP suggested, but I'm wondering if there are methods that can automatically adjust the sample space as well.
For well-defined mathematical problems like dice rolling and fixed classical mechanics scenarios and such, you don't need this I guess, but for any real-world problem I imagine half the problem is figuring out a good sample space to begin with. This kind of thing must have been studied already, I just don't know what to look for!
There are some analogies to algorithms like NEAT, which automatically evolves a neural network architecture while training. But that's obviously a very different context.
We could discuss completeness of the sample space, and we can also discuss completeness of the hypothesis space.
In Solomonoff Induction, which purports to be a theory of universal inductive inference, the "complete hypothesis space" consists of all computable programs (note that all current physical theories are computable, so this hypothesis space is very general). Then induction is performed by keeping all programs consistent with the observations, weighted by 2 terms: the programs prior likelihood, and the probability that program assigns to the observations (the programs can be deterministic and assign probability 1).
The "prior likelihood" in Solomonoff Induction is the program's complexity (well, 2^(-Complexity), where the complexity is the length of the shortest representation of that program.
Altogether, the procedure looks like: maintain a belief which is a mixture of all programs consistent with the observations, weighted by their complexity and the likelihood they assign to the data. Of course, this procedure is still limited by the sample/observation space!
That's our best formal theory of induction in a nutshell.
Someone else pointed me to Solomonoff induction too, which looks really cool as an "idealised" theory of induction and it definitely solves my question in abstract. But there are obvious difficulties with that in practice, like the fact that it's probably uncomputable, right?
I mean I think even the "Complexity" coefficient should be uncomputable in general, since you could probably use a program which computes it to upper bound "Complexity", and if there was such an upper bound you could use it to solve the halting problem etc. Haven't worked out the details though!
Would be interesting if there are practical algorithms for this. Either direct approximations to SI or maybe something else entirely that approaches SI in the limit, like a recursive neural-net training scheme? I'll do some digging, thanks!
It doesn't seem to shed much light on when or how you could update the underlying probability space itself (or when to change your ontology in the belief setting).