There are some good explanations here of the self-attention architecture that makes Transformers unique.
However, most people gloss over other aspects of the "Attention is all you need" paper, which is a sense mis-titled.
For example, Andrej Karpathy pointed out that the paper had another significant improvement hidden in it: during training the gradients can take a "shortcut" so that the bottom layers are trained faster than in typical deep learning architectures. This enables very large and deep models to be trained in a reasonable time. Without this trick, the huge LLMs seen these days would not have been possible!
However, most people gloss over other aspects of the "Attention is all you need" paper, which is a sense mis-titled.
For example, Andrej Karpathy pointed out that the paper had another significant improvement hidden in it: during training the gradients can take a "shortcut" so that the bottom layers are trained faster than in typical deep learning architectures. This enables very large and deep models to be trained in a reasonable time. Without this trick, the huge LLMs seen these days would not have been possible!
Andrej talks about it here: https://youtu.be/9uw3F6rndnA?t=238