> Looks interesting, but I'm skeptical that a book can feasibly stay up to date with the speed of development.
The basic structure of the base models has not really changed since the first GPT launched in 2018. You still have to understand gradient descent, tokenization, embeddings, self-attention, MLPs, supervised fine tuning, RLHF etc for the foreseeable future.
Adding RL based CoT training would be a relatively straightforward addendum to a new edition, and it's an application of long established methods like PPO.
All "generations" of models are presented as revolutionary -- and results-wise they maybe are -- but technically they are usually quite incremental "tweaks" to the previous architecture.
Even more "radical" departures like state space models are closely related to same basic techniques and architectures.
Transformer encoders are not really popular anymore, and all the top LLMs are decoder-only architectures. But encoder models like BERT are used for some tasks.
In any case, self-attention and MLP is the crux of Transformer blocks, be they in the decoder or the encoder.
The basic structure of the base models has not really changed since the first GPT launched in 2018. You still have to understand gradient descent, tokenization, embeddings, self-attention, MLPs, supervised fine tuning, RLHF etc for the foreseeable future.
Adding RL based CoT training would be a relatively straightforward addendum to a new edition, and it's an application of long established methods like PPO.
All "generations" of models are presented as revolutionary -- and results-wise they maybe are -- but technically they are usually quite incremental "tweaks" to the previous architecture.
Even more "radical" departures like state space models are closely related to same basic techniques and architectures.