Deep learning model compression methods

mkaic · on April 5, 2021

It seems absolutely wild to me that there are cases where you can prune 90% of neurons and still get a functional network out the other side. I'm definitely interested in experimenting with some of these techniques myself soon!

mjburgess · on April 5, 2021

Measure the position of a pen relative to 1000 other objects in your room.

Move the pen.

How many of those measurements co-vary? Nearly all of them.

Can you reduce that down to a subset of measurements that vary always independently?

With care, you only need 3.

robertacion · on April 5, 2021

This is very interesting, can you recommend any link where I can learn more about this concept?

mjburgess · on April 5, 2021

ggl factor analysis

also: degree of freedom, dimension

elteto · on April 5, 2021

Please go easy on yourself...

perl4ever · on April 5, 2021

"A man with an unusually tiny brain manages to live an entirely normal life despite his condition, which was caused by a fluid build-up in his skull.

Scans of the 44-year-old man's brain showed that a huge fluid-filled chamber called a ventricle took up most of the room in his skull, leaving little more than a thin sheet of actual brain tissue"

An article goes on to say that he was tested at an IQ of 75, below normal but not disabled; he was employed as a civil servant, and married with children.

https://web.archive.org/web/20150712092909if_/http://www.new...

mjburgess · on April 5, 2021

biology has no relevance here

why the body can be regulated with a subset of brain function has nothing to do with why parameters covary in large statistical models

sillysaurusx · on April 5, 2021

It's not so strange if you visualize the actual weights of a model. Notice that most of the visualizations on https://battle.shawwn.com/sdb/visualizations/2020-09-16-117m... are solid colors.

yorwba · on April 5, 2021

If you zoom in until you can distinguish individual pixels, it's clear that they aren't actually solid colors. It's just that all the variability is at the pixel level and there are no large-scale structures. (There couldn't possibly be, because the ordering of rows and columns is arbitrary: you could shuffle them around and get an equivalent model. The most you could expect is some banding where values in the same row or column are related.)

woliveirajr · on April 5, 2021

Always thought of the opposite: DL model as a compression method.

Like taking a initial frame of a movie and training a DL to reproduce the following frames. And then transmitting the frame + parameters to reproduce the movie.

abyesilyurt · on April 5, 2021

It is not video, but I have worked on image compression using neural networks on my master thesis.

It was an improvement over an existing method, so there is quite a bit of research going on in this area.

Briefly, the idea was to use an auto-encoder to transform the image, then quantize and encode the transform coefficients. So, you can actually “teach” the network to be resilient against quantization operation. Very similar to what the author describes.

wenc · on April 5, 2021

That’s what NVIDIA Maxine does.

https://www.dpreview.com/news/5756257699/nvidia-research-dev...

astrange · on April 5, 2021

Deep learning for video compression seems like it'll have a natural advantage just because video codecs have never been particularly large (maybe 100 KBs?) but everyone expects an ML model to be enormous. That's a lot of space to store common data in.

chestervonwinch · on April 5, 2021

Any dimensionality reduction method (DL-based or otherwise) can be viewed as a form of compression. But, the end goal in data dimensionality reduction is usually quite different than with data compression.

Either way, this is much different than model compression/distillation, which is compression of the parametrized functional mapping itself. As a silly example, imagine you fit a 100-degree polynomial to noisy linear data using some proper regularization. You would find that can distill/compress your 100-degree polynomial model into a 1-degree model with comparable accuracy.

unemphysbro · on April 5, 2021

Yeah, I've thought something along those lines, maybe applied to a layer(s)/architectural element-wise basis. I'm sure someone's done it because it; just seems like another optimization problem.

fantod · on April 5, 2021

Maybe this could be the beginning of a plausible explanation for why knowledge distillation works.

bitforger · on April 5, 2021

I wrote a blog post a while back about whether methods like these will be necessary in the future:

http://mitchgordon.me/machine/learning/2020/01/13/do-we-real...

posharma · on April 5, 2021

I’m surprised there’s no mention of sparse formats to store the weights of the model.

antixk · on April 5, 2021

Becuase unless you have a certain (high) level of sparsity, sparse formats are infact ineffective in storing. There are cases where sparse formats take more memory than storing dense tensors.

posharma · on April 5, 2021

Sure. It’s still one of the methods to compress models. You might be surprised how often the weights are really quite sparse.

mkaic · on April 5, 2021

Excellent write-up, I learned some new things today!