> Looks interesting, but I'm skeptical that a book can feasibly stay up to date ...

mistrial9 · 2025-01-28T15:46:55 1738079215

> gradient descent

funny mentioning the math but not the Transformer encoders..

jampekka · 2025-01-28T16:33:11 1738081991

Transformer encoders are not really popular anymore, and all the top LLMs are decoder-only architectures. But encoder models like BERT are used for some tasks.

In any case, self-attention and MLP is the crux of Transformer blocks, be they in the decoder or the encoder.

mistrial9 · 2025-01-28T16:39:49 1738082389

> Transformer encoders are not really popular anymore

references, please