Hacker News new | past | comments | ask | show | jobs | submit login

Is there a good reference as to what a "parameter" is in this context? I've looked a few times, but the explanations don't make any sense to me.



It's a degree of freedom of the learnable model. For example in a "vanilla" neural network layer (MLP), which maps from M to N feature dimensions will contain an MxN matrix of learnable parameters that model the connections between the M inputs to the N outputs. Every time the model is updated during backpropagation, the loss gradient which has to be computed has the same dimensionality as the number of parameters. Also, generally more parameters means more operations in the forward pass. Therefore, a model with more parameters in general will require more FLOPs per iteration of training. The main point of this paper is that you can actually do better by training a smaller model for longer, rather than a bigger model for less time, assuming you have a fixed FLOP budget.


The other thing with more parameters is that it gives the NN more ability to overfit. That means that instead of, say, learning what a dog is, it instead memorises all the sentences containing "dog" that it has ever seen.


You can think of a parameter as a number you can tweak while training. This network has 70B such numbers.


And if every parameter is one byte, the minimum, it will take at least 70gb to save or share this model. So it's still way to big to package directly in a app.


From the paper, they are using bfloat16, so I guess two bytes. But distributing and "packaging into an app" are not at all of practical interest for these kinds of models. You (a consumer) would interact via some API service, with the model running on a hardware-accelerated compute cloud.

In any case, during training (where the model is run in possibly large batches), and even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations.


> even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations

What makes you say this?


It's especially true for models that do some kind of weight sharing, which is very common (CNNs, RNNs, transformers, etc). For a concrete example, consider a layer from an image convolutional network, which maps from a 3-dim colorspace to a 128-dim feature space. Assuming a 5x5 kernel that's about 10k parameters. However, after applying this layer, you go from having an (B,H,W,3) tensor to a (B,H-4,W-4,128) tensor, where H,W are the height and width of the image, and B is the number of images in the batch. If you're working with even moderately high resolution images, the memory required for these intermediate tensors at each layer is much larger than the parameters.

Something similar applies for RNNs (same weights applied at each element of a sequence), GNNs and transformers (same weights applied at each pair of data).


Have you seen modern games?


I doubt they load that amount of data in memory


I'm thinking about upgrading from 64gb to 128gb so i can use all my Cities: Skylines assets in the same map


Right, they usually stream assets as they are requested. Large models do the same.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: