I'm very excited that we're figuring out how to use deep learning on small numbers of data points!
I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.
About their overall method, they write:
> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.
I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):
> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.
So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.
Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.
Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:
> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.
As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.
The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.
Thanks for this insightful comment. One of my first questions about this was how it avoids some kind of latent space collapse with such a tiny dataset.
Do you think it's accurate to describe equivariance as both a strength and a weakness here? As in it allows the model to learn a useful compression, but you have to pick your set of equivariant layers up front, and there's little the model can do to "fix" bad choices.
Yeah, I think it's really important to understand how to coax non-equivariant models into being equivariant when needed. I don't think purely equivariant architectures are the way forward.
One example that comes to mind (I don't know much/haven't thought about it much) is how AlphaFold apparently dropped rotational equivariance of the model in favor of what amounts to data augmentation---opting to "hammer in" the symmetry rather than using these fancy equivariant-by-design architectures. Apparently it's a common finding that hard-coded equivariance can hurt performance in practice when you have enough data.
I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.
About their overall method, they write:
> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.
I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):
> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.
So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.
Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.
Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:
> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.
As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.
The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.