More

xcodevn · 2025-04-14T10:32:36 1744626756

Author here: (1) We didn't remove the stddev term. (2) We use token-level loss (every token has the same weight), which is very similar to what Dr. GRPO does. However, we compute the mean gradient per token, while Dr. GRPO computes the sum. Typically, these are equivalent. However, since we're also doing gradient accumulation over micro-batches to reduce memory usage during training, this led to a bug in our implementation: it gives more weight to tokens in short sequences than to those in long sequences.

Interestingly, this is the same bug that most open-source LLM training frameworks (such as HF Trainer) had and only recently fixed.

In short, I'm working on a quick fix, after that, using sum or mean should yield equivalent results.

P.S. Fixed!

cubefox · 2025-04-14T12:26:24 1744633584

Cool!

xcodevn · 2024-12-04T16:23:30 1733329410

On a very similar theme, here is the work from World Lab (founded by Fei-Fei Li, ImageNet dataset, et al.) about creating 3D worlds:

https://www.worldlabs.ai/blog

momojo · 2024-12-04T22:47:55 1733352475

I find this work much more exciting. They're not just teaching a model to hallucinate given WASD input. They're generating durable, persistent point clouds. It looks so similar to Genie2 yet they're worlds apart.

xcodevn · 2024-11-23T18:16:39 1732385799

Who cares? If it brings benefits to the people who paid for the service, duh!

xcodevn · 2024-11-21T11:47:27 1732189647

For context, the author is Steven Johnson, one of the key people behind Google's latest hit, NotebookLM.

For those who are curious, how can we technically support really long context window (like in the millions or even billions)? The short answer is simple: we can just use more GPUs. The long answer is detailed in my recent note here: https://neuralblog.github.io/scaling-up-self-attention-infer...

r_klancer · 2024-11-21T22:47:50 1732229270

I'd rephrase that as "the author is the author, Stephen Johnson, who is also one of the key people..."

I've read many of his books over the last 20 some years (and even watched a PBS documentary series he hosted). I was aware via his Substack that he was collaborating somehow with the NotebookLM team. But I was rather startled when he demoed NotebookLM at a Google all hands meeting a few weeks ago! Apparently he's a full time product manager now.

xcodevn · 2024-11-22T01:07:18 1732237638

He plays an essential role as the model for NotebookLM. As Raiza Martin, the PM for NotebookLM, mentioned on a recent podcast, Steven is the product. The NotebookLM team essentially emulated his workflow, how he conducts research and compiles information on a particular topic.

xcodevn · 2024-10-25T02:51:39 1729824699

During inference, it is not a matrix x matrix operation, but rather a weight matrix x input vector operation, as we are generating one token at a time. The bottleneck now is how fast we can load the weight matrix from memory to tensor cores, hence the need for weight quantization.

xcodevn · 2024-10-18T08:26:01 1729239961

Look from a different point of view: this is a feature, not a bug. With this, every example has equal weight, while with the fix, every token has equal weight.

oergiR · 2024-10-18T08:57:25 1729241845

That makes it sound like it’s a choice, which it isn’t really. The way to look at it is from a probabilistic perspective: with the fix, you maximise the probability of the data. Without the fix, you fairly arbitrarily raise some probabilities to a power greater than one, and some to a power less than one.

danielhanchen · 2024-10-18T18:38:18 1729276698

Yes exactly- mathematically it was incorrect to begin with.

pama · 2024-10-18T13:10:47 1729257047

Although there may be uses for such a modified loss, based on the tone of the writeup it feels like this was an unintended bug in their training code. Training llms with variable max sequence length on different GPU is a recipe for inefficient training anyways, so careful optimizion of MFU at scale, or fixed max sequence length per batch, would have avoided this “bug”.

danielhanchen · 2024-10-18T18:37:46 1729276666

Ye one way to fix it is to use fixed sequence lengths, but it'll still be a tad bit off. Packing say 1000 small sequences to fit a large sequence lengths still will incur the same issue since the denominator will be off by 1000, but yes the problem is much less pronounced.

pama · 2024-10-18T19:45:37 1729280737

Not sure what you meant here; of course one needs to still correctly estimate the cross-entropy loss in the end (in order to keep their sanity, or compare to runs with different total batch size), but each mini-batch term has the same relative contribution to the entropy.

Edit: oh, I guess you probably meant that the code was previously not averaging correctly even for the case of same total mini-batch length... I haven't looked at this code.

danielhanchen · 2024-10-18T23:21:16 1729293676

Yes so making the sequences all the same ie through packing them into one still introduces issues since packing them into one has an unpadded token at the end due to the autoregressive nature of the training process.

pama · 2024-10-20T15:36:19 1729438579

Not sure what you mean. There is always an end to a batch. It doesn't have to be the end of a document entry, otherwise the model might get lazy and learn something related to the position of the text (i.e. look into the position encoding and call it a day).

danielhanchen · 2024-10-18T08:29:33 1729240173

Yes you're correct, but in normal full batch training without gradient accumulation, all tokens are weighted equally. Standard grad accum does not, and so the "fix" makes grad accum and full batch training finally mathematically equivalent

xcodevn · 2024-10-08T16:16:25 1728404185

Looking forward to Hinton receiving a Fields Medal for inventing backpropagation.

77pt77 · 2024-10-09T20:19:16 1728505156

He's like 2 times the maximum allowed age.

I do get the sarcasm though.

aithrowawaycomm · 2024-10-08T16:49:10 1728406150

This is sarcastic, right? https://en.wikipedia.org/wiki/Seppo_Linnainmaa

xanderlewis · 2024-10-08T18:06:37 1728410797

Is this a joke?

xcodevn · 2024-06-19T13:50:24 1718805024

If you're reading this, you may be interested in my other work on "Exploring Llama-3 MLP Neurons" as well.

https://neuralblog.github.io/llama3-neurons/

xcodevn · 2024-06-10T04:33:21 1717994001

On a related note: recently, I released a visualization of all MLP neurons inside the llama3 8B model. Here is an example "derivative" neuron which is triggered when talking about the derivative concept.

https://neuralblog.github.io/llama3-neurons/neuron_viewer.ht...

skulk · 2024-06-10T06:05:31 1717999531

This is insanely fun to just flip through. I found a "sex" neuron. https://neuralblog.github.io/llama3-neurons/neuron_viewer.ht...

vpj · 2024-06-10T05:27:34 1717997254

Pretty cool. The tokens are highlighted based on the activation?

xcodevn · 2024-06-10T05:42:26 1717998146

Yes, you're correct. The tokens are highlighted based on the neuron activation value, which is scaled to a range of 0 to 10.

xcodevn · 2024-06-08T19:34:03 1717875243

There is a simple web page where you can explore the neurons and discover interesting features yourself! With more than 400k neurons in the Llama3-8B model, there are plenty of fascinating neurons for you to uncover. Check it out and explore all the neurons at https://neuralblog.github.io/llama3-neurons/neuron_viewer.ht...