There are some good explanations here of the self-attention architecture that ma...

There are some good explanations here of the self-attention architecture that makes Transformers unique.

However, most people gloss over other aspects of the "Attention is all you need" paper, which is a sense mis-titled.

For example, Andrej Karpathy pointed out that the paper had another significant improvement hidden in it: during training the gradients can take a "shortcut" so that the bottom layers are trained faster than in typical deep learning architectures. This enables very large and deep models to be trained in a reasonable time. Without this trick, the huge LLMs seen these days would not have been possible!

Andrej talks about it here: https://youtu.be/9uw3F6rndnA?t=238