Is there a good reference as to what a "parameter" is in this context? I've look...

cshimmin · on April 11, 2022

It's a degree of freedom of the learnable model. For example in a "vanilla" neural network layer (MLP), which maps from M to N feature dimensions will contain an MxN matrix of learnable parameters that model the connections between the M inputs to the N outputs. Every time the model is updated during backpropagation, the loss gradient which has to be computed has the same dimensionality as the number of parameters. Also, generally more parameters means more operations in the forward pass. Therefore, a model with more parameters in general will require more FLOPs per iteration of training. The main point of this paper is that you can actually do better by training a smaller model for longer, rather than a bigger model for less time, assuming you have a fixed FLOP budget.

zamalek · on April 11, 2022

The other thing with more parameters is that it gives the NN more ability to overfit. That means that instead of, say, learning what a dog is, it instead memorises all the sentences containing "dog" that it has ever seen.

guipsp · on April 11, 2022

You can think of a parameter as a number you can tweak while training. This network has 70B such numbers.

sirk390 · on April 11, 2022

And if every parameter is one byte, the minimum, it will take at least 70gb to save or share this model. So it's still way to big to package directly in a app.

cshimmin · on April 11, 2022

From the paper, they are using bfloat16, so I guess two bytes. But distributing and "packaging into an app" are not at all of practical interest for these kinds of models. You (a consumer) would interact via some API service, with the model running on a hardware-accelerated compute cloud.

In any case, during training (where the model is run in possibly large batches), and even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations.

brrrrrm · on April 11, 2022

> even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations

What makes you say this?

cshimmin · on April 11, 2022

It's especially true for models that do some kind of weight sharing, which is very common (CNNs, RNNs, transformers, etc). For a concrete example, consider a layer from an image convolutional network, which maps from a 3-dim colorspace to a 128-dim feature space. Assuming a 5x5 kernel that's about 10k parameters. However, after applying this layer, you go from having an (B,H,W,3) tensor to a (B,H-4,W-4,128) tensor, where H,W are the height and width of the image, and B is the number of images in the batch. If you're working with even moderately high resolution images, the memory required for these intermediate tensors at each layer is much larger than the parameters.

Something similar applies for RNNs (same weights applied at each element of a sequence), GNNs and transformers (same weights applied at each pair of data).

lostmsu · on April 11, 2022

Have you seen modern games?

sva_ · on April 11, 2022

I doubt they load that amount of data in memory

replygirl · on April 11, 2022

I'm thinking about upgrading from 64gb to 128gb so i can use all my Cities: Skylines assets in the same map

lostmsu · on April 11, 2022

Right, they usually stream assets as they are requested. Large models do the same.