So to my understanding, this work reproduces DeepSeek R1's reinforcement learning mechanism in a very small language model.
The AI gets "rewards" (like points) for doing two things correctly:
Accuracy : Getting the right answer. For example, math answers must be in a specific format (e.g., inside a box) so a computer can easily check them. For coding problems, test cases verify if the code works.
Format : Using the <think> and <answer> tags properly. This forces the AI to organize its responses clearly.
So in this case, the training program can extract the model's answer by parsing <answer> tag. We can eval the answer and evaluate if it's correct or not. If it's correct give reward, else: no reward.
Create N such answers from a single question, create N reward array. This is enough for the RL algorithm to guide the model to be more smart.
I've been trying to follow the literature on PPO/GRPO as applied to LLMs. From what I understand, since reward is only given once the entire COT sequence is sampled, traditional RL techniques would require some form of credit-assignment to distribute that reward amongst individual tokens – which is where the critic/value network comes in, right?
Instead DeepSeek (with GRPO) seems to just omit that value function entirely and use only sparse rewards. How does this end up being more efficient, since I thought the sparse nature of rewards makes it harder to converge to the optimal policy?
I don't think it's only using sparse rewards because of the format rewards. The training recipe is pretty comprehensive and involves multiple stages.[1] The paper mentions that when only using the RL technique, the output is often not suitable for reading. (Language mixing, etc) That feels like a AlphaZero moment for LLMs?
The R1 paper says that they didn't use "process reward modeling". And the paper that introduced GPRO says that it can be used either with "outcome supervision" or "process supervision", with outcome supervision "only provid[ing] a reward at the end of each output". Put together, doesn't that imply R1 uses sparse rewards provided only at end of COT sequence?
Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1. Your "sparse reward" means only providing reward at the end of each output.
I think the reward is relative to other sampled answers for the same question. This way the signal is strong at the very margin of what is possible with a given model and there is less noise in it with impossible or too easy questions.
There is some confusion - because they do compute that simple reward, but then they convert it to a relative value and call it advantage. And I think they use that advantage to update the model - not the base reward.
Yes you're right, in their paper I think they say the process of sampling multiple traces then taking relative rewards is supposed to monte-carlo approximate the value network? I don't really have the intuition for that, but it does make sense that rather than simply nudging probabilities in the direction of the trace with the highest absolute reward, you want to favor the trace which had the best reward relative to current state. E.g. for quick intuition if absolute rewards for traces were {0, 0, 0, 0.01} then using absolute rewards would only give a weak signal (nudge weights proportional to 0.01 * logprob) for the last trace, but using relative rewards (based on z-score) of 1.5 * logprob.
Not only that - if you have {0,0,0,0.01} - then the probability that you would get any reward at one shot would be very low. And also I have the intuition that giving the rewards to traces at the edge is more efficient - because the model needs only a small perturbation to get right. If you gave negative rewards to traces that are very far from being right - then the model might be steered in a wrong direction.
The part I found strange: these RL formulations give no reward for incorrect solutions, so unless there are training examples that are easy enough for the base model to solve, the RL process won’t do anything.
So is the actual magic that the base models are good enough to sometimes generate successful CoT output in their unmodified state? Or did I miss something in the R1 paper and the code here?
I think is where the relative rewards come to play - they sample many thinking traces and reward those that are correct. This works at the current 'cutting edge' for the model - exactly where it could be improved.
I was wondering the same thing. I feel there is too large of a gap between a raw base model and and a model that produces fully correct answers and follows a specific format. My guess is their rule base reward system is more nuanced than just correctness and format.
Yeah I find this part not clearly expressed as well. My best guess is that it's not simply binary "correct/incorrect" but rather the reward is made up of multiple parts (e.g. format + correctness) and structured in a way such that "close enough" answers still get some reward. From there I would expect that a base model might at least be able to "autocomplete" the format/style, at which point RL machinery would kick in to tune it to properly obey the format, and once that's mastered eventually correctness.
They did mention something about tuning on an un-SFT'd base model being much slower 'warming it up' with some existing reasoning traces.
We're still in the process of thinking through and fleshing out full details for remote MCP connections. This is definitely a good idea to include in the mix!
PSA: You can also use singleflight[1] to solve the problem. This prevents the thundering herd problem. Pocache is an interesting/alternative way to solve thundering herd indeed!
I'm confused by the decision in DoChan to return a channel (instead of accepting one supplied by the caller) and then, given that, also not to close that channel (is something else going to be sent to the channel in the future?). Both seem like strange/unnecessary design decisions.
Returning a channel avoids questions of what happens if sending to a caller-supplied channel blocks. DoChan returns a channel with a single-element buffer, so a single send to the channel will always succeed without blocking, even if the caller has lost interest in the result and discarded the channel.
DoChan doesn't close the channel because there isn't any reason to do so.
A non-blocking send would work just as well for that issue, is a standard part of the language, and would support user-supplied channels, but it would still be at risk of panicking when sending to a closed channel. I think there ought to be a safe way to send to a closed channel, but the language authors disagree, so that's not really on the library authors (though they could still recover from the panic).
However, not closing the channel you specifically chose to control all sending to is just lazy/rude. Even though the caller should receive from the channel once and then forget about it, closing the channel after sending would prevent incorrect subsequent receives from hanging forever.
All this having been said, contributing to these libraries seems better than complaining about them, but I don't know how the golang.org/x stuff is maintained; looks like this one is here: https://github.com/golang/sync
Closing the channel is pointless. I don't understand why people get obsessive about closing channels.
It's not needed by the garbage collector, it's not good practice. It's explicitly called out in the official go guide as unnecessary most of the time. [0]
If you have a channel that is only used a single time and then discarded, closing it is literally just wasting CPU cycles.
And definitely not "lazy/rude".
I illustrated why closing the channel is beneficial: the consumer of the channel may not be using it properly. Reading the unclosed channel more than once will hang. A stuck goroutine is rarely desirable. The cost of closing a channel is similar to the cost of bounds checking; it may not be free, but it's usually worth it. Agreed that this has no benefit to the garbage collector. I also think this is a pretty clear example of when you should close a channel, as pointed out by the Tour: to inform the consumer that no more values will ever be forthcoming.
A non-blocking send doesn't work in this case. Consider: User provides DoChan an unbuffered channel, and then reads a value from it. If the send is nonblocking and occurs before the user reads from the channel, the value is lost.
thank you for the recommendation, was a good read as well. I could even use it to replace how I'm handling the call suppression/debounce mechanism. Though I think Pocache does 1 extra thing, which is to keep the cache updated before it expires, i.e. for keys which are frequently fetched it'd serve up to date data always from the cache. If we only relied on call suppression, then the concurrent requests would just have to wait during the update stage, or the read-through mechanism would keep hitting the main database.
Actually, llama.cpp running on Apple silicon uses GPU(Metal Compute Shader) to inference LLM models. Token generation is also very memory bandwidth bottlenecked. On high end Apple silicon it's about 400MB/s to 800MB/s, comparable to NVIDIA RTX 4090, which has memory bandwidth of 1000MB/s. Not to mention that Apple silicon has unified memory architecture and has high memory models (128GB, up to 192GB), which is necessary to run large LLMs like Llama 3 70B, which roughly takes 40~75GB of RAM to work reasonably.
The number of people running llama3 70b on NVidia gaming GPUs is absolutely tiny. You're going to need at least two of the highest end 24 GB VRAM GPUs and even then you are still reliant on 4 bit quantization with almost nothing left for your context window.
To a first approximation, Kompute[1] is that. It doesn't seem to be catching on, I'm seeing more buzz around WebGPU solutions, including wonnx[2] and more hand-rolled approaches, and IREE[3], the latter of which has a Vulkan back-end.
That's why it's not parallelized along the time axis but rather along the dimension of the embedding axis.
You split the big matrices into smaller matrices to dispatch the workload. But this means you have to add some communication overhead (roughly nblayers sequential synchronisation point per token). In official LLama implementation this is done transparently using RowParallelLinear, ColumnParallelLinear, ParallelEmbedding see https://github.com/facebookresearch/llama/blob/main/llama/mo...
Transformer have multiple attention heads, that can be computed independently and then summed together to produce the output of the layer. This allow to split the parameter space among machines without having to transfer them at each iteration.
I'm really curious how Meta, DeepMind and OpenAI make the big models work. The biggest A100 you can buy is just 80GB. And I assume the big companies use single precision floating point during training. Are they actually partitioning the big model across multiple GPU instances? If one had the hardware, how many GPUs does the biggest LLAMA take? These are systems issues and I have not read papers or blog posts on how this works. To me, this infra is very non-trivial.
The "standard" machine for these things has 8x80GB = 640GB memory (p4de instances here: https://aws.amazon.com/ec2/instance-types/p4/), with _very_ fast connections between GPUs. This fits even a large model comfortably. Nowadays probably most training use half precision ("bf16", not exactly float16, but still 2 bytes per parameter). However during training you easily get a 10-20x factor between the number of parameters and the bytes of memory needed, due to additional things you have to store in memory (activations, gradients, etc.). So in practice the largest models (70-175B parameters) can't be trained even on one of these beefy machines. And even if you could, it would be awfully slow.
In practice, they typically use servers with clusters of these machines, up to about 1000 GPUs in total (so around 80TB of memory, give or take a few?). This allows even the biggest models to be trained on large batches of several hundreds, or even thousands, of elements (the total memory usage is _not_ proportional to the product of number of parameters and the batch size, but it does increase as a function of both of them, a term of which being indeed the product of the two). It makes for some very tricky engineering choices to make just the right data travel across connections, trying to avoid as much as possible that you have to sync large amount of data between different machines (so "chunking" things to stay on the 640GB range) with strategies such as ZeRO being published every now and then. Plus of course the practical effort to make physical connections as fast as possible...
To get an idea of how hard these things are, take a look at how long the list of names in the published paper about BLOOM language model is :-)
That's absolutely nuts. That's basically the entire capital cost of an 8x A100 hyperplane from LambdaLabs [1] plus power for a year plus administration! What's the point of cloud hardware if you're paying for everything reserve anyway?
Roughly the same setup costs $12/hour at Lambda if you're lucky enough to snag one so it looks like demand for 8x A100 is so high that you basically have to pay AWS for an entire pod to get access to one, unless you want to pay $40 per hour (!!!)
Very insightful!! A 175B parameter model with 2 bytes per weight, and say 2 bytes per gradient (not sure if single precision gradients makes sense?) comes in at 700GB, which is beyond a single 8x80GB beefy machine!! I recall reading with tech such as RDMA, you can communicate really fast between machines .. I assume if you add a switch in there, you are toast (from a latency perspective). Perhaps using 2 such beefy machines in a pair would do the trick .. after all .. model weights aren't the only thing that needs to be on the GPU.
I saw a reference that said GPT-3, with 96 decoder layers, was trained on a 400 GPU cluster, so that seems like the ballpark for a 175B parameter model. That's 50 of the hypothetical machines we talked about (well .. really 100 for GPT-3 since back in those days, max was 40 or 48 GB per GPU).
I also wonder why NVIDIA (or Cerebras) isn't beefing up GPU memory. If someone sold a 1TB GPU, they could charge a 100grand easy. As I understood it, NVIDIA's GPU memory is just HBM-6 .. so they'd make a profit?
Looking here: https://huggingface.co/docs/transformers/perf_train_gpu_one#...
It looks like the most standard optimizer (AdamW) uses a whopping 18 bits per parameter during training. Using bf16 should reduce that somehow, but it wasn't really considered in that section, I'm not sure if that part of the guide is a bit outdated (before A10 / A100 this wasn't an option) or if it still has some instability issues ("normal" float16 can't be used for training because multiplying gradients through the hundreds of layers you'd get 0 or infinity values that would kill your learning). You can switch to different optimizers (Adafactor) and modify a few other things, but that typically comes at the cost of either lower accuracy or slower training, or both.
For multiple GPUs there are quite a few ways to improve memory footprint and speed: https://huggingface.co/docs/transformers/perf_train_gpu_many
Although I'm not sure if the implementations in HuggingFace are really on par with the SOTA methods (they shouldn't be far away in any case). I guess they should be at least on par, if not better, with whatever OpenAI used for GPT-3 back then, things evolving so quickly in this realm...
On the last point, I can only assume there are some hard thresholds which are difficult to overcome in order to add more memory, otherwise they would. Just an 80GB memory GPU was something unthinkable a dozen years ago, before the deep learning explosion around 2GB was the norm. A couple of years ago, when 16GB or 32GB was the best you'd get from Nvidia, AMD did come out with consumer grade GPUs having significant larger memory (maybe 48GB back then? I can't remember), which could have stirred the market a bit I guess, but it didn't pick up for deep learning (I suspect mostly due to a lack of the equivalent to cudnn / cuda, that makes it possible to "easily" build deep learning frameworks on top of the GPUs).
My take on this is, if there's a competitor who fights hard to regain market share, and bets big on offering more memory, and still the best it comes up with is just a couple of times more than what the others have, it must be not as easy as "let's stick another bank of memory here and sell it", or they would have...?
GPU memory is also useful to load large detailed scenes for rendering (.usd). It is a bit surprising that 80GB is the limit. It was obvious for years that GPU compute is ahead of GPU memory size by 10x-100x. And loading larger models and scenes into memory was always a struggle. This must be a hardware or yields issue.
The AI gets "rewards" (like points) for doing two things correctly:
Accuracy : Getting the right answer. For example, math answers must be in a specific format (e.g., inside a box) so a computer can easily check them. For coding problems, test cases verify if the code works.
Format : Using the <think> and <answer> tags properly. This forces the AI to organize its responses clearly.
So in this case, the training program can extract the model's answer by parsing <answer> tag. We can eval the answer and evaluate if it's correct or not. If it's correct give reward, else: no reward.
Create N such answers from a single question, create N reward array. This is enough for the RL algorithm to guide the model to be more smart.