Hacker News new | past | comments | ask | show | jobs | submit login

Is the encoder style arch better for representing classification tasks at a given compute budget than a causal LM?

Is this because the final represention in bert style models more globally focused, rather than being optimized for next token prediction?




They are 100% better for classification at a given compute budget. They can account for information before and after e.g. a token for token classification and use that information to classify.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: