Yes, it has to in fact. If you have zero context to attend to in a transformer, ...

Yes, it has to in fact. If you have zero context to attend to in a transformer, and you try to predict the first token, you effectively are multiplying a zero-vector by the attention head, making all tokens equally likely in the final softmax (unless the lm_head has a bias, but at least in GPT it does not).

So the <|beginning of text|> token, with no context before it, learns to predict the first-token-in-a-document distribution. That's not quite the same as predicting nothing at all.