Hacker News new | past | comments | ask | show | jobs | submit login

I think it's because you want to be able to predict the next token using only 1 token or the whole context window (and any size inbetween). So, you end up getting n different losses for each text snippet (where n is the size of the context window).

If i'm wrong, can someone correct here, would be useful to know.




Why would you train the model on shorter context than you can provide? Why not provide all context you have? Sure the model has to learn to handle short context, but that occurs naturally at the beginning of the document.

Anyways, this still involves only left-side masking. Why mask future tokens when sliding window can do that (without wasting a single token of context)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: