The key to understanding the difference is that transformers are attention model...

tempusalaria · 2024-07-20T00:23:46 1721435026

This is wrong.

The term for models that look only at previous tokens in the sequence is auto-regressive.

Encoder and decoder has nothing to do with this.

Grimblewald · 2024-07-20T04:41:24 1721450484

arent a lot of transformers built in a way where attention is only applied to previous tokens in sequence, even though its fully possible to apply it both ways?

pizza · 2024-07-20T05:47:04 1721454424

That's the autoregressive aspect. The decoder aspect is that the last layer converts representations into output sequences (and the generation happens autoregressively, one at a time). Similarly at the last layer an encoder outputs a representation/embedding (while being able to attend to the entire sequence).

bjourne · 2024-07-19T23:23:20 1721431400

But this has nothing to do with encoding and decoding.