I think it's because you want to be able to predict the next token using only 1 ...

guy98238710 · on April 15, 2023

Why would you train the model on shorter context than you can provide? Why not provide all context you have? Sure the model has to learn to handle short context, but that occurs naturally at the beginning of the document.

Anyways, this still involves only left-side masking. Why mask future tokens when sliding window can do that (without wasting a single token of context)?