Hacker News new | past | comments | ask | show | jobs | submit login

Yes, it has to in fact. If you have zero context to attend to in a transformer, and you try to predict the first token, you effectively are multiplying a zero-vector by the attention head, making all tokens equally likely in the final softmax (unless the lm_head has a bias, but at least in GPT it does not).

So the <|beginning of text|> token, with no context before it, learns to predict the first-token-in-a-document distribution. That's not quite the same as predicting nothing at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: