Interesting to read this. I was a little confused at first because the information bottleneck paper wasn't new, and then as I read I realized they acknowledged that. It's interesting to see followup and new research coming out about it, because it struck me as promising when I first read about it.
The information bottleneck idea is very similar, basically the same, as how I've always thought about DL models, and statistical models more generally. The hidden variables at each layer are basically digitized codes, and there's a compression at each layer, which is equivalent to learning/inference as in an algorithmic complexity/MDL sense.
What was surprising to me was the relationships with renormalization groups, which I wasn't familiar with at all.
The quote from Lake was also interesting. I forgot about that Bayesian Program Learning paper, which was interesting. My guess is BPL and DL are not really all that different at some level.
> certain very large deep neural networks don’t seem to need a drawn-out compression phase in order to generalize well. Instead, researchers program in something called early stopping, which cuts training short to prevent the network from encoding too many correlations in the first place.
This would explain the success of attention mechanisms quite well. If you are good in selecting the right input you don't need to discard a lot later.
The information bottleneck idea is very similar, basically the same, as how I've always thought about DL models, and statistical models more generally. The hidden variables at each layer are basically digitized codes, and there's a compression at each layer, which is equivalent to learning/inference as in an algorithmic complexity/MDL sense.
What was surprising to me was the relationships with renormalization groups, which I wasn't familiar with at all.
The quote from Lake was also interesting. I forgot about that Bayesian Program Learning paper, which was interesting. My guess is BPL and DL are not really all that different at some level.