Rather, isn’t the autoregressively predicted single next token a combination (based on attention) of all 6KB word tokens in the attention window.
So the size of memory where all information for next token prediction needs to be ”crammed into” is more like window_size*6KB, right?
Rather, isn’t the autoregressively predicted single next token a combination (based on attention) of all 6KB word tokens in the attention window.
So the size of memory where all information for next token prediction needs to be ”crammed into” is more like window_size*6KB, right?