Hacker News new | past | comments | ask | show | jobs | submit login

In the text they say you need to cram all information needed to predict the next token into a single 6KB word embedding, but isn’t that wrong?

Rather, isn’t the autoregressively predicted single next token a combination (based on attention) of all 6KB word tokens in the attention window.

So the size of memory where all information for next token prediction needs to be ”crammed into” is more like window_size*6KB, right?




This is probably what the author meant to say but elided. I can see why it looks off though.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: