Hacker News new | past | comments | ask | show | jobs | submit login

> even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations

What makes you say this?




It's especially true for models that do some kind of weight sharing, which is very common (CNNs, RNNs, transformers, etc). For a concrete example, consider a layer from an image convolutional network, which maps from a 3-dim colorspace to a 128-dim feature space. Assuming a 5x5 kernel that's about 10k parameters. However, after applying this layer, you go from having an (B,H,W,3) tensor to a (B,H-4,W-4,128) tensor, where H,W are the height and width of the image, and B is the number of images in the batch. If you're working with even moderately high resolution images, the memory required for these intermediate tensors at each layer is much larger than the parameters.

Something similar applies for RNNs (same weights applied at each element of a sequence), GNNs and transformers (same weights applied at each pair of data).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: