Hacker News new | past | comments | ask | show | jobs | submit login

There are some good explanations here of the self-attention architecture that makes Transformers unique.

However, most people gloss over other aspects of the "Attention is all you need" paper, which is a sense mis-titled.

For example, Andrej Karpathy pointed out that the paper had another significant improvement hidden in it: during training the gradients can take a "shortcut" so that the bottom layers are trained faster than in typical deep learning architectures. This enables very large and deep models to be trained in a reasonable time. Without this trick, the huge LLMs seen these days would not have been possible!

Andrej talks about it here: https://youtu.be/9uw3F6rndnA?t=238




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: