High-Speed Large Language Model Serving on PCs with Consumer-Grade GPUs

phh · on Dec 20, 2023

Took me a while to understand what their "hot" and "cold" neurons meant, since in most ML I do, there is no such notion. And their paper doesn't directly define it (or I missed it)

After some thoughts, in ReLU it does make sense, because half of the function is constant, so you can say that you're "cold" if that neuron's ReLU-ed output is often 0 . So I checked whether ReLU was common in LLMs, original llama doesn't use ReLU. But after (re-)reading the github, it actually only works on ReLU models. Turns out that there is a group of people "fine-tuning" (I would rather call that re-training, since you start by breaking the model?) models to use ReLU to allow for that sparsity: https://huggingface.co/SparseLLM

So this is sadly not applicable to any model you can find on the internet, but that sounds like a great progress anyway. Possibly this might shift the compromises back to bigger models but with "less ideal" activations. Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?)

I think that a possible avenue for future research in that area is keeping original activation (like llama keeping SwiGLU), but using quantification to define "hot" and "cold" neurons to be saturation areas. (For example, saying that this activation function, below -1. at 8 bit, is equivalent to -infinity, and thus this is a cold neuron)

boredumb · on Dec 20, 2023

> Also I'm curious what would be the legal impacts on it (since USA and EU refers to a model's FLOPs/number of parameters... How do you compute it with sparsity? Do you average?).

How/when did these types of regulations come about? This feels like an insane thing to have to keep in mind while developing.

radicalbyte · on Dec 20, 2023

The EU messed up with the GDPR - they should have implemented it at least a decade earlier and ignored the lobby which lead to the cookie banner instead of either an outright ban on tracking for all but a tiny number of purposes. Such a ban would have had a negligible impact on the tech industry financially but would have had huge privacy rewards.

They're trying to get in early on AI so as not to make the same mistake again. Which might result in them making the opposite mistake.

quocanh · on Dec 20, 2023

Tiny negligible impact on the industry (Except cut advertising revenue in half, but who cares. What do ads pay for anyways?)

Nextgrid · on Dec 21, 2023

> What do ads pay for anyways?

Making the world a worse place? If you look carefully you’ll realize most of the harms and negative effects of technology are due to it being primarily funded by advertising and trying to maximize ad revenue.

jayd16 · on Dec 21, 2023

Ads seem less harmful than, say, mobile game rewards (gambling). Plenty of dark patterns in the paid space too. Banning ads would not be a panacea.

genman · on Dec 21, 2023

I see again and again this non-argument on HN. Yes, if you get robbed but not killed then it is a better outcome than getting killed but this doesn't make robbing good by any measure.

8n4vidtmkvmk · on Dec 21, 2023

But what if you make the punishment for robbing harsher than murder? Maybe people start killing you after robbing you to get a lesser sentence. It happens in some parts of the world, if they accidentally hit you with their car they'll run over you again to finish the job because if you sue or go after them it'll be real bad. Point is we have to be careful about how we regulate things or we can shift things in an even worse direction.

jayd16 · on Dec 21, 2023

The claim was that the majority of tech's ills are caused by ads. By leaving that statement without analysis we're blind to other problems.

Const-me · on Dec 21, 2023

Mobile games are only harmful to a relatively tiny group of addicted gamers, while internet ads have very serious consequences acting on society as a whole.

I don’t think mobile gaming companies have a potential to destroy free press, or negatively affect mental health of wide population of teenagers, or invade privacy of billions of people. They simply don’t have the scale for any of that.

emidoots · on Dec 21, 2023

Ads are harmful, no doubt, but I do not think they are more harmful than the normalization of gambling in our society.

'I watched an ad, and then [my entire life was destroyed]' is quite hard to imagine, unless it's an ad for an MLM, crypto, entrepreneurship scam, or gambling.

On the other hand, I absolutely know people who started out in soft gambling who then proceeded to throw their life (and sometimes families) away trying to catch the next high with higher and higher stakes gambling until they lost everything, and then some.

We also don't really know the impact gambling is going to have in the near future. Loot boxes, online gambling, internet celebrity gambling, etc. really only became popular around ~2010 or later, and the kids who have been growing up with low-risk gambling as a daily accessible thing on their iPads have not come into adulthood yet.

vlovich123 · on Dec 21, 2023

Not an either or situation. We should do both.

emidoots · on Dec 21, 2023

The parent comment downplayed the importance of mobile gaming/gambling. I simply rebutted.

livrem · on Dec 21, 2023

> Mobile games are only harmful to a relatively tiny group of addicted gamers, while internet ads have very serious consequences acting on society as a whole

It is still unethical to even play "free"-to-play games. You are entertained at the expense of a small group of addicts that are often spending more money than what they can afford, and, at least in many games, just being logged in helps create a nicer environment that lures in those people. If you are not there to be a whale you are there to be lure for them. It might not be harmful to you to play, but you are being harmful to the addicts.

vlovich123 · on Dec 21, 2023

All those mobile games frequently require advertising in the first place to race their customers/victims. We should definitely ban a lot of the dark patterns which would coincidentally improve AAA games which use similar patterns (eg increasing duration of gameplay because of grinding mechanics).

quocanh · on Dec 21, 2023

And the largest benefit of modern technology comes from the fact that so much of it is "free" (ad-supported). Without ads, there would simply be no effect at all.

Jensson · on Dec 21, 2023

Wikipedia and stack overflow and forums like reddit and chat and similar are the biggest benefits of the internet and they are very cheap to run, you could run them based on donations. Reddit is more expensive than it has to be since they try to pivot to more ads and media, but a text forum is very cheap.

The biggest benefit from ad supported tech are search and video, the rest would be better without ads. Reddit would be a better place if they didn't try to get ad revenue etc, in those cases them chasing revenue makes user experience worse instead of better.

radicalbyte · on Dec 21, 2023

I don't have the study at hand but this was proven false: the impact was negligible (% points) as the fundamentals are extremely good for the big platforms. Take FB and Google: they already have extremely strong (and legitimate) profiles of users without following you around the web.

phh · on Dec 20, 2023

> How/when did these types of regulations come about?

I can't say much about US. As I see it, EU pretty much copied US about that part. There was nothing related to computation in the EU's AI Act projects until few months ago, it was purely a "what kind of data processing are you allowed to do?"

alchemist1e9 · on Dec 20, 2023

Politely, what the hell are you talking about? Who is telling anyone what they can or cannot compute?

iamjackg · on Dec 20, 2023

US:

https://www.whitehouse.gov/briefing-room/presidential-action...

"Until such technical conditions are defined, the Secretary shall require compliance with these reporting requirements for:

          (i)   any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations[...]"

EU:

https://thefuturesociety.org/wp-content/uploads/2023/12/EU-A...

geon · on Dec 20, 2023

> 1026

> 1023

Should be 10^26 and 10^23.

alchemist1e9 · on Dec 20, 2023

Probably I did this wrong but I’m getting an approximation of 300K H100s completes that in a month. At least they choose something fairly large it seems. Not sure how LoRA or other incremental training is handled.

sbierwagen · on Dec 21, 2023

Depends on which spec you used, since the law doesn't specify the floating point width. If you used FP8 ops on the H100 SXM then a single GPU would hit the limit in 25265285497.72612 seconds. 300,000 GPUs would pass 10^26 FP8 ops in 23 hours.

hayley-patton · on Dec 21, 2023

Are they trying to bring back SIMD-within-a-register? Though that only gives you ~one order of magnitude doing packed 4-bit stuff with 64-bit GPRs. And perhaps fixed-point, sign-exponent and posits are unregulated.

cyanydeez · on Dec 20, 2023

anyone with a functional government.

brucethemoose2 · on Dec 20, 2023

That is a huge caveat to leave out of a readme, especially one that claims llama compatibility.

ShamelessC · on Dec 21, 2023

They don’t make that claim as far as I can tell. Just that they support llama2 models.

brucethemoose2 · on Dec 21, 2023

Well it's not really support of llama 2 if it has to be extensively finetuned to "convert" the model.

acqq · on Dec 20, 2023

Indeed

https://huggingface.co/SparseLLM/ReluFalcon-40B

"We utilize PowerInfer for inference"

127 · on Dec 20, 2023

Running uncensored Mixtral on this would be really nice. More than 3 bits quantized for 4090.

eurekin · on Dec 20, 2023

Downvoters care to comment? Uncensored llm versions typically perform better (at least on benchmarks) to their "lobotomized" or aligned counterparts

infotainment · on Dec 20, 2023

Probably because the parent comment didn't contain much of substance. "Oh, I'd love to see this with [insert my favorite model here]" doesn't really add a lot to the discussion.

For example, the parent commenter could have talked about the specific attributes of that model that make it superior. I personally am aware that Mixtral is one of the best performing models right now, but is everyone else? Also, does Mixtral need to be uncensored? I've used vanilla Mistral for some...interesting...prompts and had no issues with it moralizing at me.

lannisterstark · on Dec 21, 2023

I mean, does it need to? Not every comment has to be plethora of hidden information. Sometimes people are just excited.

legel · on Dec 20, 2023

Yeah, so they demo a bigger model on an RTX 4090 with 24 GB VRAM. Granted an implementation of sparse activations with the Mixture of Experts could be non-trivial, I think it’s a brilliant move, that could potentially allow for even, e.g., CPU only processing and/or much cheaper GPU processing… Mixtral technically already has neural network controlled sparse activations, but like the Inception meme says: we must go deeper…

mirekrusin · on Dec 20, 2023

Dual GPUs should be considered normal/consumer grade setup, hopefully they'll add it soon, on 4bits it's enough with plenty of space for context.

This whole thing is a fork of llamacpp, also hoping it'll all go upstream sooner or later.

8n4vidtmkvmk · on Dec 21, 2023

4090s aren't really normal either. How many people have dual GPUs? I don't think it helps much with games last I checked so you'd only buy 2 for AI.

mirekrusin · on Dec 21, 2023

It's more about what's possible to build. Dual 4090 or 3090 is possible to setup without hassle. Beyond that not really because it'd be above home power socket rating, not possible to fit on the board and case etc.

It's true you can also build dual A6000 with 48+48 = 96GB VRAM also, but that's $10k+ setup just for GPUs on legacy generation.

kridsdale1 · on Dec 21, 2023

There’s the physical hassle. It was very difficult for me to fit 1 3090 in my case.

mirekrusin · on Dec 21, 2023

Yes, watercooled variants are better for dual setup (at least one, better two).

llamaInSouth · on Dec 21, 2023

looking good https://www.youtube.com/watch?v=q2KpPUOsBCs

Const-me · on Dec 20, 2023

Since they mentioned they’re working on Mistral-7B, I’d like to note that my GPU-only implementation of Mistral uses slightly over 5GB of VRAM: https://github.com/Const-me/Cgml

Runs pretty good on most consumer-grade GPUs, but so far it only supports Windows OS.

m1sta_ · on Dec 21, 2023

This looks really really interesting. Any idea whether it would run on a laptop with an Intel Core i7?

Const-me · on Dec 21, 2023

The performance on integrated GPUs is not stellar, but it should work.

On my AMD Ryzen 5 5600u with dual-channel DDR4, I’m getting 2 tokens/second. My friend with Intel Core i3 and single-channel memory was getting 1 token/second.

MacsHeadroom · on Dec 21, 2023

Yes, I run it on CPU using LLMStudio. It's very fast.

ru552 · on Dec 21, 2023

VRAM is king. If you have the VRAM to hold the model parameters, you can run it.

v3ss0n · on Dec 21, 2023

try ollama , only needs about 4GB it uses llmcpp

brucethemoose2 · on Dec 20, 2023

This is super cool.

For all the love llama.cpp gets, its method of dGPU offloading (prompt processing on GPU and then just splitting the model down the middle) is relatively simple. But its interesting that there even is so much "activation sparsity" to take advantage of. The traditional thinking in ML is that memory access is very random.

Hopefully the "cold" neurons eventually get offloaded to the IGP instead?

Also, its curious that they are considering a Metal kernel. I thought the performance advantage came from the hybrid memory pool... seems like that would only help old AMD Macs, unless I am missing something?

sroussey · on Dec 20, 2023

The only thing I could think of on the question of Apple Silicon and Metal is that they think they could still split out the cold neurons to the CPU/Accelerate and the hot ones on the GPU and utilize both. The speedup is likely less if there is already no copying of data between GPU/CPU and using the unified memory. Still, it would be great if you could use even more of the capabilities of the chip simultaneously. In order to avoid thermal throttling they should use the efficiency cores only (I think this is what game mode does).

brucethemoose2 · on Dec 21, 2023

That doesn't make much sense to me. The GPU's task energy is so much lower than even the e cores, and AFIAK the GPU's compute isn't even fully utilized for local inference.

jupp0r · on Dec 20, 2023

From my understanding in this implementation there is some amount of knowledge about the model itself needed to determine what parts to place in system memory vs what parts to place in GPU memory. Can this ideally be computed automatically or will future models have some sort of interface for placement algorithms like this to help automate this? If the algorithm needs to be adopted for each model architecture, it's going to be a lot of work to maintain this project.

loudmax · on Dec 20, 2023

That sounds about right. They provide a script to combine their "Predictor" weights to the original models, but I don't see anything obvious in the front page of the Github repo about how to create those weights.

A 10x speed improvement is really impressive. If this kind of improvement is reproducible across other models, then presumably identifying hot and cold neurons for inference optimization should go on to become a normal part of model development process.

thelastparadise · on Dec 20, 2023

Like JVM "hot spots," or JIT optimization.

jupp0r · on Dec 20, 2023

Or profile guided optimization.

EwanG · on Dec 20, 2023

The important stuff from the readme (if you're not looking to tinker with it directly):

We have tested PowerInfer on the following platforms:

x86-64 CPU (with AVX2 instructions) on Linux

x86-64 CPU and NVIDIA GPU on Linux

Apple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.)

And new features coming soon:

Mistral-7B model

Metal backend for sparse inference on macOS

rahimnathwani · on Dec 20, 2023

Also worth mentioning the downloadable llama2 models, and the convert.py file.

peter_d_sherman · on Dec 21, 2023

>"This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers."

Brilliant!

modeless · on Dec 20, 2023

Everyone compares against llama.cpp because it's easy mode. Llama.cpp is slow! Everyone should know this. They should compare against exllamav2 or other optimized implementations.

superkuh · on Dec 20, 2023

In this case they're comparing against llama.cpp because the code is literally a modification of llama.cpp. I'm not talking about using the ggml lib for matrix calculations, it's literally using the llama.cpp main.cpp and other normal llama.cpp code. It's a fork. It is directly comparable.

https://github.com/ggerganov/llama.cpp/pull/4543 [Review] Merge PowerInfer with llama.cpp mainline #4543

https://github.com/ggerganov/llama.cpp/discussions/4534#disc... "The x11 speedup is kind of cherrypicked because the llama.cpp GPU code for Falcon 40b is just not well-optimized."

modeless · on Dec 20, 2023

Thanks for pointing that out, I didn't notice that. That makes sense.

I still think a comparison with exllamav2 or other optimized inference library would make sense too.

avereveard · on Dec 20, 2023

Yeah but exllama doesn't do grammars so I'm stuck with llama.cpp

Also apparently exllama has a few side effects in coherence https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_for...

nulld3v · on Dec 20, 2023

ExLlama is GPU only right? This speedup is for GPU + CPU split use cases.

modeless · on Dec 20, 2023

Oh I see, they are running a 40B model unquantized, whereas exllamav2 would have to use 4-bit quantization to fit. Given the quality of 4-bit quantization these days and the speed boost it provides I question the utility of running unquantized for serving purposes.

I see they have a 4-bit benchmark lower down in the page. That's where they ought to compare against exllamav2.

sroussey · on Dec 20, 2023

What do you recommend that is faster that I can package into an app for distribution?

modeless · on Dec 20, 2023

I have packaged exllamav2 (plus a lot of other stuff) into an app for distribution here: https://apps.microsoft.com/detail/9NC624PBFGB7

I used pyinstaller. It was difficult because Python makes these things difficult. But it works. It does require an Nvidia GPU. MLC-LLM is another option that might be easier to package and potentially able to run on AMD.

sroussey · on Dec 20, 2023

Oh yeah, I want to work on AMD/Intel/NVIDIA and MacOS, even iOS/Android.

I've been following MLC-LLM as well. Right now I am just using JS/WASM from Huggingface, but later I will want something more performant.

modeless · on Dec 20, 2023

Yeah if you want maximum performance on multiple platforms you'll probably have to package multiple frameworks. Llama.cpp might be a decently fast option on Apple Silicon, I'm not sure of the state of the art there.

superkuh · on Dec 20, 2023

This will be really cool once there's the ability to generate the sparse predictor files for arbitrary models rather than just the 4 they've done it with. Looking through the page and code it doesn't seem like the tools to do that step are included. Guess I'll wait on this one a bit. Hopefully these features will be merged back into llama.cpp as options eventually since this is based on the normal llama.cpp code (ie, not just using the ggml matrix lib).

causality0 · on Dec 20, 2023

All the "consumer grade GPUs" terminology makes it seem like you could run it on a variety of models, but like so many of these posts, is this a 4090 exclusive?

int_19h · on Dec 21, 2023

I can't think of anything that is a 4090 exclusive. What usually matters is VRAM, so if something needs 24Gb, then 3090 is also an option, or dual 12Gb cards.

Anyway, the technique that they describe is a general one that should broadly improve the ability to run larger models on smaller GPUs by drastically improving perf for CPU offloading. They demonstrate it using both 4090 running the largest models at fp16, but also 2080Ti running the same 4-bit quantized, and they still get ~3x speedup for LLaMA. So this very much sounds like it'll make 33B models the new default on the desktops, while people with even a single 3090 or 4090 will now be able to run 70B at realtime chat speeds.

nextaccountic · on Dec 20, 2023

> Hybrid CPU/GPU Utilization: Seamlessly integrates memory/computation capabilities of CPU and GPU for a balanced workload and faster processing.

Does this means that it runs at same time at both CPU and GPU, being faster than a CPU-only or a GPU-only implementation on the same device?

edit: when running on integrated GPUs, can this benefit from the improved communication between CPU and GPU?

rahimnathwani · on Dec 20, 2023

GPU-only will be faster if you have enough VRAM.

But if you want to run a model that requires more VRAM than you have, the current approach is to use llama.cpp and specify n_gpu_layers. That works, but is slower than GPU-only.

OP claims to be 10x as fast as llama.cpp in the case when you can't fit the whole model in VRAM.

PoignardAzur · on Dec 21, 2023

This sounds like it uses the same techniques as the ones described in the "LLM in a Flash" paper posted yesterday? If so, cool to see an implementation of these techniques running models on non-Apple GPUs.

robwwilliams · on Dec 21, 2023

Scale-free network topology enables a crude but effective split of neurons into hot and cold classes—hot neurons at home on the GPU and larger numbers of cold neurons that benefit from more memory on the CPU. Clever!

ComputerGuru · on Dec 20, 2023

It’s not too much faster than exllama2 with flash attention, no?

ekianjo · on Dec 20, 2023

how much speed increase do we get on CPU only configurations? has anyone tested it in such cases?

NavinF · on Dec 20, 2023

CPU-only is impractical for most use cases and this will only become more true over time as models become larger. The mediocre perf/$ and perf/watt makes it not worth the effort

hobobaggins · on Dec 21, 2023

Might be worth it in a datacenter, especially if it's operating other servers alongside (perhaps I/O bound web serving or something); perf/$ does matter, definitely, but the state of the art is moving quickly (getting faster/more efficient) and optimizing some models for CPU is still relevant IMO.

ComputerGuru · on Dec 20, 2023

This architecture is specifically aimed at optimizing GPU use.

coder543 · on Dec 20, 2023

"Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)

HPsquared · on Dec 20, 2023

PowerToys are probably the original (going back to PowerToys for Windows 95)

Edit: https://socket3.wordpress.com/2016/10/22/using-windows-95-po...

coder543 · on Dec 20, 2023

PowerPoint existed in the late 80s, I think, although Microsoft acquired it from what I understand.

latchkey · on Dec 20, 2023

https://en.wikipedia.org/wiki/PowerPC