I was intrigued by this a while back. I think of training a NN as generating a function, an equation, from a training set which, given a specific input, outputs a prediction. If you can come up with an equation-input pair that, when executed, a) accurately enough approximates some data, and b) requires less space than the original file, you have achieved (most likely lossy) compression.