Why would you train the model on shorter context than you can provide? Why not p...

Why would you train the model on shorter context than you can provide? Why not provide all context you have? Sure the model has to learn to handle short context, but that occurs naturally at the beginning of the document.

Anyways, this still involves only left-side masking. Why mask future tokens when sliding window can do that (without wasting a single token of context)?