> Secondly, his description of each layer's function as adding information to the original vector misses the mark IMO--it is more like the original input is convolved with the weights of the transformer into the output. I am probably missing the mark a bit here as well.
> Lastly, his statement that the embedding vector of the final token output needs all the info for the next token is plainly incorrect. The final decoder layer, when predicting the next token, uses all the information from the previous layer's hidden layer, which is the size of the hidden units times the number of tokens so far.
I think the author is correct. Information is only moved between tokens in the attention layers, not in the MLP layers or in the final linear layer before the softmax. You can see how it’s implemented in nanoGPT:
https://github.com/karpathy/nanoGPT/blob/f08abb45bd2285627d1...
At training time, probabilities for the next token are computed for each position, so if we feed in a sequence of n tokens, we basically get n training examples, one for each position, but at inference time, we only compute the next token since we’ve already output the preceding ones.
I think his description is basically correct given how the residual streams work. The output of each sublayer is basically added onto the input. See https://transformer-circuits.pub/2021/framework/index.html
> Lastly, his statement that the embedding vector of the final token output needs all the info for the next token is plainly incorrect. The final decoder layer, when predicting the next token, uses all the information from the previous layer's hidden layer, which is the size of the hidden units times the number of tokens so far.
I think the author is correct. Information is only moved between tokens in the attention layers, not in the MLP layers or in the final linear layer before the softmax. You can see how it’s implemented in nanoGPT: https://github.com/karpathy/nanoGPT/blob/f08abb45bd2285627d1...
At training time, probabilities for the next token are computed for each position, so if we feed in a sequence of n tokens, we basically get n training examples, one for each position, but at inference time, we only compute the next token since we’ve already output the preceding ones.