This is wrong. The term for models that look only at previous tokens in the sequ...

Grimblewald · 2024-07-20T04:41:24 1721450484

arent a lot of transformers built in a way where attention is only applied to previous tokens in sequence, even though its fully possible to apply it both ways?

pizza · 2024-07-20T05:47:04 1721454424

That's the autoregressive aspect. The decoder aspect is that the last layer converts representations into output sequences (and the generation happens autoregressively, one at a time). Similarly at the last layer an encoder outputs a representation/embedding (while being able to attend to the entire sequence).