It's a degree of freedom of the learnable model. For example in a "vanilla" neural network layer (MLP), which maps from M to N feature dimensions will contain an MxN matrix of learnable parameters that model the connections between the M inputs to the N outputs. Every time the model is updated during backpropagation, the loss gradient which has to be computed has the same dimensionality as the number of parameters. Also, generally more parameters means more operations in the forward pass. Therefore, a model with more parameters in general will require more FLOPs per iteration of training. The main point of this paper is that you can actually do better by training a smaller model for longer, rather than a bigger model for less time, assuming you have a fixed FLOP budget.
The other thing with more parameters is that it gives the NN more ability to overfit. That means that instead of, say, learning what a dog is, it instead memorises all the sentences containing "dog" that it has ever seen.
And if every parameter is one byte, the minimum, it will take at least 70gb to save or share this model. So it's still way to big to package directly in a app.
From the paper, they are using bfloat16, so I guess two bytes. But distributing and "packaging into an app" are not at all of practical interest for these kinds of models. You (a consumer) would interact via some API service, with the model running on a hardware-accelerated compute cloud.
In any case, during training (where the model is run in possibly large batches), and even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations.
It's especially true for models that do some kind of weight sharing, which is very common (CNNs, RNNs, transformers, etc). For a concrete example, consider a layer from an image convolutional network, which maps from a 3-dim colorspace to a 128-dim feature space. Assuming a 5x5 kernel that's about 10k parameters. However, after applying this layer, you go from having an (B,H,W,3) tensor to a (B,H-4,W-4,128) tensor, where H,W are the height and width of the image, and B is the number of images in the batch. If you're working with even moderately high resolution images, the memory required for these intermediate tensors at each layer is much larger than the parameters.
Something similar applies for RNNs (same weights applied at each element of a sequence), GNNs and transformers (same weights applied at each pair of data).