Took me a while to understand what their "hot" and "cold" neurons meant, since in most ML I do, there is no such notion. And their paper doesn't directly define it (or I missed it)
After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you're "cold" if that neuron's ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn't use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people "fine-tuning" (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity: https://huggingface.co/SparseLLM
So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with "less ideal" activations. Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?)
I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define "hot" and "cold" neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)
> Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?).
How/when did these types of regulations come about? This feels like an insane thing to have to keep in mind while developing.
The EU messed up with the GDPR - they should have implemented it at least a decade earlier and ignored the lobby which lead to the cookie banner instead of either an outright ban on tracking for all but a tiny number of purposes. Such a ban would have had a negligible impact on the tech industry financially but would have had huge privacy rewards.
They're trying to get in early on AI so as not to make the same mistake again. Which might result in them making the opposite mistake.
Making the world a worse place? If you look carefully you’ll realize most of the harms and negative effects of technology are due to it being primarily funded by advertising and trying to maximize ad revenue.
I see again and again this non-argument on HN. Yes, if you get robbed but not killed then it is a better outcome than getting killed but this doesn't make robbing good by any measure.
But what if you make the punishment for robbing harsher than murder? Maybe people start killing you after robbing you to get a lesser sentence. It happens in some parts of the world, if they accidentally hit you with their car they'll run over you again to finish the job because if you sue or go after them it'll be real bad.
Point is we have to be careful about how we regulate things or we can shift things in an even worse direction.
Mobile games are only harmful to a relatively tiny group of addicted gamers, while internet ads have very serious consequences acting on society as a whole.
I don’t think mobile gaming companies have a potential to destroy free press, or negatively affect mental health of wide population of teenagers, or invade privacy of billions of people. They simply don’t have the scale for any of that.
Ads are harmful, no doubt, but I do not think they are more harmful than the normalization of gambling in our society.
'I watched an ad, and then [my entire life was destroyed]' is quite hard to imagine, unless it's an ad for an MLM, crypto, entrepreneurship scam, or gambling.
On the other hand, I absolutely know people who started out in soft gambling who then proceeded to throw their life (and sometimes families) away trying to catch the next high with higher and higher stakes gambling until they lost everything, and then some.
We also don't really know the impact gambling is going to have in the near future. Loot boxes, online gambling, internet celebrity gambling, etc. really only became popular around ~2010 or later, and the kids who have been growing up with low-risk gambling as a daily accessible thing on their iPads have not come into adulthood yet.
> Mobile games are only harmful to a relatively tiny group of addicted gamers, while internet ads have very serious consequences acting on society as a whole
It is still unethical to even play "free"-to-play games. You are entertained at the expense of a small group of addicts that are often spending more money than what they can afford, and, at least in many games, just being logged in helps create a nicer environment that lures in those people. If you are not there to be a whale you are there to be lure for them. It might not be harmful to you to play, but you are being harmful to the addicts.
All those mobile games frequently require advertising in the first place to race their customers/victims. We should definitely ban a lot of the dark patterns which would coincidentally improve AAA games which use similar patterns (eg increasing duration of gameplay because of grinding mechanics).
And the largest benefit of modern technology comes from the fact that so much of it is "free" (ad-supported). Without ads, there would simply be no effect at all.
Wikipedia and stack overflow and forums like reddit and chat and similar are the biggest benefits of the internet and they are very cheap to run, you could run them based on donations. Reddit is more expensive than it has to be since they try to pivot to more ads and media, but a text forum is very cheap.
The biggest benefit from ad supported tech are search and video, the rest would be better without ads. Reddit would be a better place if they didn't try to get ad revenue etc, in those cases them chasing revenue makes user experience worse instead of better.
I don't have the study at hand but this was proven false: the impact was negligible (% points) as the fundamentals are extremely good for the big platforms. Take FB and Google: they already have extremely strong (and legitimate) profiles of users without following you around the web.
> How/when did these types of regulations come about?
I can't say much about US. As I see it, EU pretty much copied US about that part. There was nothing related to computation in the EU's AI Act projects until few months ago, it was purely a "what kind of data processing are you allowed to do?"
"Until such technical conditions are defined, the Secretary shall require compliance with these reporting requirements for:
(i) any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations[...]"
Probably I did this wrong but I’m getting an approximation of 300K H100s completes that in a month. At least they choose something fairly large it seems. Not sure how LoRA or other incremental training is handled.
Depends on which spec you used, since the law doesn't specify the floating point width. If you used FP8 ops on the H100 SXM then a single GPU would hit the limit in 25265285497.72612 seconds. 300,000 GPUs would pass 10^26 FP8 ops in 23 hours.
Are they trying to bring back SIMD-within-a-register? Though that only gives you ~one order of magnitude doing packed 4-bit stuff with 64-bit GPRs. And perhaps fixed-point, sign-exponent and posits are unregulated.
Probably because the parent comment didn't contain much of substance. "Oh, I'd love to see this with [insert my favorite model here]" doesn't really add a lot to the discussion.
For example, the parent commenter could have talked about the specific attributes of that model that make it superior. I personally am aware that Mixtral is one of the best performing models right now, but is everyone else? Also, does Mixtral need to be uncensored? I've used vanilla Mistral for some...interesting...prompts and had no issues with it moralizing at me.
Yeah, so they demo a bigger model on an RTX 4090 with 24 GB VRAM. Granted an implementation of sparse activations with the Mixture of Experts could be non-trivial, I think it’s a brilliant move, that could potentially allow for even, e.g., CPU only processing and/or much cheaper GPU processing… Mixtral technically already has neural network controlled sparse activations, but like the Inception meme says: we must go deeper…
It's more about what's possible to build. Dual 4090 or 3090 is possible to setup without hassle. Beyond that not really because it'd be above home power socket rating, not possible to fit on the board and case etc.
It's true you can also build dual A6000 with 48+48 = 96GB VRAM also, but that's $10k+ setup just for GPUs on legacy generation.
Since they mentioned they’re working on Mistral-7B, I’d like to note that my GPU-only implementation of Mistral uses slightly over 5GB of VRAM: https://github.com/Const-me/Cgml
Runs pretty good on most consumer-grade GPUs, but so far it only supports Windows OS.
The performance on integrated GPUs is not stellar, but it should work.
On my AMD Ryzen 5 5600u with dual-channel DDR4, I’m getting 2 tokens/second. My friend with Intel Core i3 and single-channel memory was getting 1 token/second.
For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.
Hopefully the "cold" neurons eventually get offloaded to the IGP instead?
Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?
The only thing I could think of on the question of Apple Silicon and Metal is that they think they could still split out the cold neurons to the CPU/Accelerate and the hot ones on the GPU and utilize both. The speedup is likely less if there is already no copying of data between GPU/CPU and using the unified memory. Still, it would be great if you could use even more of the capabilities of the chip simultaneously. In order to avoid thermal throttling they should use the efficiency cores only (I think this is what game mode does).
That doesn't make much sense to me. The GPU's task energy is so much lower than even the e cores, and AFIAK the GPU's compute isn't even fully utilized for local inference.
From my understanding in this implementation there is some amount of knowledge about the model itself needed to determine what parts to place in system memory vs what parts to place in GPU memory. Can this ideally be computed automatically or will future models have some sort of interface for placement algorithms like this to help automate this? If the algorithm needs to be adopted for each model architecture, it's going to be a lot of work to maintain this project.
That sounds about right. They provide a script to combine their "Predictor" weights to the original models, but I don't see anything obvious in the front page of the Github repo about how to create those weights.
A 10x speed improvement is really impressive. If this kind of improvement is reproducible across other models, then presumably identifying hot and cold neurons for inference optimization should go on to become a normal part of model development process.
>"This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers."
Everyone compares against llama.cpp because it's easy mode. Llama.cpp is slow! Everyone should know this. They should compare against exllamav2 or other optimized implementations.
In this case they're comparing against llama.cpp because the code is literally a modification of llama.cpp. I'm not talking about using the ggml lib for matrix calculations, it's literally using the llama.cpp main.cpp and other normal llama.cpp code. It's a fork. It is directly comparable.
Oh I see, they are running a 40B model unquantized, whereas exllamav2 would have to use 4-bit quantization to fit. Given the quality of 4-bit quantization these days and the speed boost it provides I question the utility of running unquantized for serving purposes.
I see they have a 4-bit benchmark lower down in the page. That's where they ought to compare against exllamav2.
I used pyinstaller. It was difficult because Python makes these things difficult. But it works. It does require an Nvidia GPU. MLC-LLM is another option that might be easier to package and potentially able to run on AMD.
Yeah if you want maximum performance on multiple platforms you'll probably have to package multiple frameworks. Llama.cpp might be a decently fast option on Apple Silicon, I'm not sure of the state of the art there.
This will be really cool once there's the ability to generate the sparse predictor files for arbitrary models rather than just the 4 they've done it with. Looking through the page and code it doesn't seem like the tools to do that step are included. Guess I'll wait on this one a bit. Hopefully these features will be merged back into llama.cpp as options eventually since this is based on the normal llama.cpp code (ie, not just using the ggml matrix lib).
All the "consumer grade GPUs" terminology makes it seem like you could run it on a variety of models, but like so many of these posts, is this a 4090 exclusive?
I can't think of anything that is a 4090 exclusive. What usually matters is VRAM, so if something needs 24Gb, then 3090 is also an option, or dual 12Gb cards.
Anyway, the technique that they describe is a general one that should broadly improve the ability to run larger models on smaller GPUs by drastically improving perf for CPU offloading. They demonstrate it using both 4090 running the largest models at fp16, but also 2080Ti running the same 4-bit quantized, and they still get ~3x speedup for LLaMA. So this very much sounds like it'll make 33B models the new default on the desktops, while people with even a single 3090 or 4090 will now be able to run 70B at realtime chat speeds.
But if you want to run a model that requires more VRAM than you have, the current approach is to use llama.cpp and specify n_gpu_layers. That works, but is slower than GPU-only.
OP claims to be 10x as fast as llama.cpp in the case when you can't fit the whole model in VRAM.
This sounds like it uses the same techniques as the ones described in the "LLM in a Flash" paper posted yesterday? If so, cool to see an implementation of these techniques running models on non-Apple GPUs.
Scale-free network topology enables a crude but effective split of neurons into hot and cold classes—hot neurons at home on the GPU and larger numbers of cold neurons that benefit from more memory on the CPU. Clever!
CPU-only is impractical for most use cases and this will only become more true over time as models become larger. The mediocre perf/$ and perf/watt makes it not worth the effort
Might be worth it in a datacenter, especially if it's operating other servers alongside (perhaps I/O bound web serving or something); perf/$ does matter, definitely, but the state of the art is moving quickly (getting faster/more efficient) and optimizing some models for CPU is still relevant IMO.
"Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)
After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you're "cold" if that neuron's ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn't use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people "fine-tuning" (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity: https://huggingface.co/SparseLLM
So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with "less ideal" activations. Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?)
I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define "hot" and "cold" neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)