Why don't gradients vanish in large scale LLMs?

mk67 · on May 19, 2024

Not easy to give a concise answer here, but let me try:

The problem mainly occurs in networks with recurrent connections or very deep architectures. In recurrent architectures this was solved via LSTMs with the signal gates. In very deep networks, e.g. ResNet, this was solved via residual connections, i.e. skip connections over layers. There were also other advances, such as replacing sigmoid activations with the simpler ReLU.

Transformers, which are the main architecture of modern LLMs, are highly parallel without any recurrence, i.e. at any layer you still have access to all the input tokens, whereas in an RNN you process one token at a time. To solve the potential problem due to "deepness" they also utilize skip connections.