The key to understanding the difference is that transformers are attention models where tokens can "attend" to different tokens.
Encoder models allow all tokens to attend to every other token. This increases the number of connections and makes it easier for the model to reason, but requires all tokens at once to produce any output. These models generally can't generate text.
Decoder models only allow tokens to attend to previous tokens in the sequence. This decreases the amount of tokens, but allows the model to be run incrementally, one token at a time. This incremental processing is key to allowing the models to generate text.
arent a lot of transformers built in a way where attention is only applied to previous tokens in sequence, even though its fully possible to apply it both ways?
That's the autoregressive aspect. The decoder aspect is that the last layer converts representations into output sequences (and the generation happens autoregressively, one at a time). Similarly at the last layer an encoder outputs a representation/embedding (while being able to attend to the entire sequence).
Encoder models allow all tokens to attend to every other token. This increases the number of connections and makes it easier for the model to reason, but requires all tokens at once to produce any output. These models generally can't generate text.
Decoder models only allow tokens to attend to previous tokens in the sequence. This decreases the amount of tokens, but allows the model to be run incrementally, one token at a time. This incremental processing is key to allowing the models to generate text.