Is this because the final represention in bert style models more globally focused, rather than being optimized for next token prediction?
Is this because the final represention in bert style models more globally focused, rather than being optimized for next token prediction?