Author here: (1) We didn't remove the stddev term. (2) We use token-level loss (every token has the same weight), which is very similar to what Dr. GRPO does. However, we compute the mean gradient per token, while Dr. GRPO computes the sum. Typically, these are equivalent. However, since we're also doing gradient accumulation over micro-batches to reduce memory usage during training, this led to a bug in our implementation: it gives more weight to tokens in short sequences than to those in long sequences.
Interestingly, this is the same bug that most open-source LLM training frameworks (such as HF Trainer) had and only recently fixed.
In short, I'm working on a quick fix, after that, using sum or mean should yield equivalent results.
I find this work much more exciting. They're not just teaching a model to hallucinate given WASD input. They're generating durable, persistent point clouds. It looks so similar to Genie2 yet they're worlds apart.
For context, the author is Steven Johnson, one of the key people behind Google's latest hit, NotebookLM.
For those who are curious, how can we technically support really long context window (like in the millions or even billions)? The short answer is simple: we can just use more GPUs. The long answer is detailed in my recent note here: https://neuralblog.github.io/scaling-up-self-attention-infer...
I'd rephrase that as "the author is the author, Stephen Johnson, who is also one of the key people..."
I've read many of his books over the last 20 some years (and even watched a PBS documentary series he hosted). I was aware via his Substack that he was collaborating somehow with the NotebookLM team. But I was rather startled when he demoed NotebookLM at a Google all hands meeting a few weeks ago! Apparently he's a full time product manager now.
He plays an essential role as the model for NotebookLM. As Raiza Martin, the PM for NotebookLM, mentioned on a recent podcast, Steven is the product. The NotebookLM team essentially emulated his workflow, how he conducts research and compiles information on a particular topic.
During inference, it is not a matrix x matrix operation, but rather a weight matrix x input vector operation, as we are generating one token at a time. The bottleneck now is how fast we can load the weight matrix from memory to tensor cores, hence the need for weight quantization.
Look from a different point of view: this is a feature, not a bug. With this, every example has equal weight, while with the fix, every token has equal weight.
That makes it sound like it’s a choice, which it isn’t really. The way to look at it is from a probabilistic perspective: with the fix, you maximise the probability of the data. Without the fix, you fairly arbitrarily raise some probabilities to a power greater than one, and some to a power less than one.
Although there may be uses for such a modified loss, based on the tone of the writeup it feels like this was an unintended bug in their training code. Training llms with variable max sequence length on different GPU is a recipe for inefficient training anyways, so careful optimizion of MFU at scale, or fixed max sequence length per batch, would have avoided this “bug”.
Ye one way to fix it is to use fixed sequence lengths, but it'll still be a tad bit off. Packing say 1000 small sequences to fit a large sequence lengths still will incur the same issue since the denominator will be off by 1000, but yes the problem is much less pronounced.
Not sure what you meant here; of course one needs to still correctly estimate the cross-entropy loss in the end (in order to keep their sanity, or compare to runs with different total batch size), but each mini-batch term has the same relative contribution to the entropy.
Edit: oh, I guess you probably meant that the code was previously not averaging correctly even for the case of same total mini-batch length... I haven't looked at this code.
Yes so making the sequences all the same ie through packing them into one still introduces issues since packing them into one has an unpadded token at the end due to the autoregressive nature of the training process.
Not sure what you mean. There is always an end to a batch. It doesn't have to be the end of a document entry, otherwise the model might get lazy and learn something related to the position of the text (i.e. look into the position encoding and call it a day).
Yes you're correct, but in normal full batch training without gradient accumulation, all tokens are weighted equally. Standard grad accum does not, and so the "fix" makes grad accum and full batch training finally mathematically equivalent
On a related note: recently, I released a visualization of all MLP neurons inside the llama3 8B model. Here is an example "derivative" neuron which is triggered when talking about the derivative concept.
There is a simple web page where you can explore the neurons and discover interesting features yourself! With more than 400k neurons in the Llama3-8B model, there are plenty of fascinating neurons for you to uncover. Check it out and explore all the neurons at https://neuralblog.github.io/llama3-neurons/neuron_viewer.ht...
Interestingly, this is the same bug that most open-source LLM training frameworks (such as HF Trainer) had and only recently fixed.
In short, I'm working on a quick fix, after that, using sum or mean should yield equivalent results.
P.S. Fixed!
reply