They rely heavily on what we call residual or skip connexions. This means each layer does something like x = x + f(x). This helps the training a lot, ensuring the gradient can flow nicely in the whole network.
This is heavily used in ResNets (residual networks) for computer vision, and is what allows training much deeper convolutional networks. And transformers use the same trick.
This is heavily used in ResNets (residual networks) for computer vision, and is what allows training much deeper convolutional networks. And transformers use the same trick.