Hacker News new | past | comments | ask | show | jobs | submit login
Accelerating Generative AI with PyTorch II: GPT, Fast (pytorch.org)
306 points by polyrand on Nov 30, 2023 | hide | past | favorite | 69 comments



Hey, author of the blog post here. It's mentioned in the blog post, but one of the intentions of this repo is that it's more of a "tutorial" than it is a library/framework. My hope is that people will copy-paste and modify it for their own needs :)

Code can also be found here: https://github.com/pytorch-labs/gpt-fast

And a twitter thread summary here: https://twitter.com/cHHillee/status/1730293330213531844


Great work! Do you know if it's possible to port this over to pytorch's Apple Silicon/MPS support?


Unfortunately it's a little bit tricky today. The main issue is that we rely heavily on torch.compile + Triton for performance in this repo, and there isn't an Apple Silicon backend either for torch.compile or Triton.

For example, there's an AMD backend for Triton (and it's also integrated into torch.compile), which is why we can mostly do the same optimizations on Nvidia and AMD GPUs.

Ideally, there'd be an Apple Silicon backend for Triton, and then this repo would mostly work out of the box :)


What kind of workstation would you build/buy for local GPT development with a budget of $3000? Is remote dev a viable alternative to local workstations?


Local workstation is much cheaper in the long run.

Even ignoring that, most of the development is running experiments. You're gonna be hesitant to run lots of experiments if they each cost money whereas when you pay upfront for the hardware, you're gonna have the incentive to fully utilize it with lots of experiments.

I'd go with rtx 4090 and deal with memory limitation through software tricks. It's an underrated card that's as performant as cards that are magnitude pricier. It's great way to get started with that budget.


I agree with you but right now RTX 4090 cards are pushing $2000, which doesn't leave much budget left. I'd suggest picking up a used 3090 card from eBay, which are currently around $800. This will still give 24gb of VRAM like the 4090.


i've seen some blog posts saying if you buy a used 3090 that has been used for bitcoin mining then there is a risk of thermal throttling because the thermal paste on the vram is not great and worse if it was run hot for a long time.

any recommendations on how to buy one? e.g. 24GB model, any particular model to run LLMs? what is the biggest baddest LLM you can run on a single card?

have been thinking about it but was sticking with cloud/colab for experiments so far.


The good deals are gonna be on local ads. Facebook Marketplace in most of the US.


Craigslist and eBay have some great deals.


I remember videos (on youtube likely) of thermal paste replacement, that was upgrade to stock card. So, average person should be able to do it. It'll cost a few $$ for the paste. I would go with local workstation, then don't have to think much about while running stable diffusion. Plus, if it's used from ebay, prices cannot go much lower, you'll get something back at the end. Also, for image things training dataset can be quite big for network transfers.


Strong endorse here. I pick up used RTX 3090s from Facebook Marketplace and eBay at $800 maximum. Can usually find them locally for $700-750, and typically can test them too, which is fine (though I've had no issues yet).


Depending on what you're doing, 2x used 3090s are the same price and offer you more VRAM. That's what I'm planning on doing, in any case - being able to run 70B LLMs entirely on the GPU is more useful than being able to run 34B faster.


Yeah multiple 3090s is the best budget way to go for sure. Also older server boards with tons of PCIe lanes if you can swing rack mounted hardware and have some technical skills.


Agreed. I recently completed a new build with two 3090 GPUs and really appreciate being able to run 70b models.


which cpu did you go with?


i7-14700k

z790 chipset w/ mobo that supports x8/x8 bifurcation

96gb ddr5 @5600mhz


Not OP, but I asked myself that same question two years ago. Then I looked at the energy prices in Germany and knew I had no chance against cloud GPUs. Maybe you live in a country with lower energy prices, like Bermuda (or any other country on earth), in which case this may not be as important to you. A side benefit of going cloud that you can pick and choose the right GPU for whatever project you’re working on, and you’re really just paying while you’re running them. Also, no hardware or Cuda drivers that may divert your attention.


I’d go with a remote dev solution. Training/finetuning of large models requires much more resources anyway, so the GPUs in the local machine would be unused most of the time.


I got a 13900k + 4090 workstation for ~$3500. But I hear what people are doing is getting 2x (or more) 3090s instead, because they are cheap used, and having more VRAM and VRAM bandwidth is the important thing at the moment, even if it is split between cards.

I'm happy with my 4090 though. Dealing with splitting between GPUs sounds like a chore and also I like the gaming abilities of the 4090.


I would do remote dev using vast.ai and other cheap cloud computing resources to ensure you want to do this and have utility for it, then build your own. 3090s are typically the most budget friendly, and if you have any IT chops (and tolerance for noise), then server rack-mounted hardware, PSUs, and riser cables tend to be the most efficient with tons of PCIe lanes (which is a hidden issue people have with consumer-grade gaming PCs as they scale).


I made a custom pc in May with a 4090 for playing around. It's good and fast but for production workload or serious training you want more ram. It was less than ~3k in components (~2k after considering removing vat and expending it / depreciating it in a limited company)

If you also want a Mac consider an m3 MacBook with maxed ram.


I'm using an AMD 6900XT with ROCm and it's fast enough ti be usable, for a fraction of the price of a 3090 or 4090.


Just make sure you’re buying something with a warranty. Hardware failures during training are quite common.


A regular gaming PC will do, about a grand, then slot in a 3090 or something off eBay. Congratulations, you're a grand under-budget, but already have a viable development machine.


Just wanted to share, the charts and gifs are exceptionally well done. Informative, concise, and easy to read.


Thanks! I've also written a couple other things along a similar vein you might like at https://horace.io/writing.html (particularly https://horace.io/brrr_intro.html) and also some of the things I've tweeted: https://twitter.com/cHHillee/highlights


Hi, Thanks for Open sourcing the code! I was trying to reuse the code especially the dynamic quantization per channel (int8 on gpu) but couldn't get it to work, i also checked out torchao package but it looks like it has dependency on the nightly channel and SAM's dynamic implementation with triton has other issues, is there any clean implementation of int8 dynamic post-training quantization that you can point too ?


What’s the issue with getting int8 dynamic quantization to work? As in, you’re unable to get it to quantize or to run with speedups?


What's the difference between, say, Karpathy's nanoGPT and your GPT implementation?

Is it just a difference in speed, or are there some new theory details to learn?

Thanks for sharing your work.


Just a difference in speed. This repo is primarily showing how you can get really good inference perf with just native pytorch.


Great work and a really useful resource! Comprehensive guides on improving PyTorch performance are pretty hard to come by, and I learned a couple new tricks from this!


Fantastic article, great job. One small note: eke and eek are not the same word.


What GPU was used when testing this?

Is this faster than HuggingFace's Text Generation inference container?


We used an A100-80GB GPU. We didn't compare explicitly to Huggingface TGI but I think you should be able to compare the tokens/s achieved.

One note is that this release is optimized for latency, while I think HF TGI might be more optimized for throughput.


"This may sound implausible to many of you, considering how hard it is to write efficient matrix multiplication/attention kernels, and how much manpower has been put into CuBLAS and FlashAttention. The key here, however, is that transformer decoding has very unusual computational properties. In particular, because of the KV-cache, for BS=1 every single matrix multiplication in a transformer is actually a matrix vector multiplication."

How did they unlock this key ? In retrospect it seems so simple, but without the KV-cache this possibility would not have emerged at all. Hats off !


BS=1 also means you can only decode one token. This is not always what you want, and then you're back to matrix-matrix multiply again. Most of the from-scratch llama implementations start with matrix-vector because it's simpler & easier. But eventually, you will want to serve multiple users, or compute multiple beams together...


This is a great article. Regarding

> While these projects are performant, they often come with tradeoffs in ease of use, such as requiring model conversion to specific formats or building and shipping new dependencies.

I think it should be acknowledged that (at least IMO) pytorch model formats are not very portable and this is a big part of the problem. It would be nice to see industry move towards a better format (gguf?) that can easily be ported between frameworks and not leave you stuck using torch to load it. Likewise, pytorch is a massive dependency to include with a project, especially for simple inference, so while other projects have new dependencies, they can often be a lot lighter than for a pytorch model, again particularly for inference code.


Yeah, for sure. I think for deployment purposes, many times these model conversions are necessary (such as if you don't want to use Python).

However, I do think these model conversions are often a significant pain for users.

So, in some sense, the goal here is to show that the performance component and the "convert your model for deployment" component can be disentangled.

We also have work on allowing you to "export" an AOT-compiled version of your model with torch.compile, and that should allow you to deploy your models to run in other settings.


Thanks for the reply. "show that the performance component and the "convert your model for deployment" component can be disentangled" makes sense.

Also, I liked the part of the article about torch.compile producing faster matrix-vector multiplication than cublas. I've seen the same thing on CPU, that it's way faster to just write and manually optimize a loop over a bunch of dot products than it is to use BLAS routines because of how simple the "matmul" actually is. I don't know how widely known that is.


I'm wondering if gpt-fast has a version that can be run from Windows Command Prompt or Powershell?

https://github.com/pytorch-labs/gpt-fast/issues/45


Hey we are exactly dealing now with all that is related to the performance of running models from HF. Yesterday we ran vicuna-13b-q8-gguf using llamacpp on a vast.ai A40 45GB VRAM It gave us 4 tokens/s generation rate. This seems a bit slow for that GPU and a 13b model. Does anyone know where the problem could be in llamacpp, gpu, the model, something else?

Also… where are all the people like us that work on applications on top of HF models congregate?


Offtopic, but what software do they use to create the benchmark flamegraphs? I've been using cProfile with snakeviz, but am curious to try alternatives.


Assuming you're talking about the one here: https://pytorch.org/blog/accelerating-generative-ai-2/#start...

it's just the pytorch profiler + chrome profiler (chrome://tracing)



One of the notable tricks the various LLM serving frameworks provide is a special approaches to batching, e g. continuous, persistent, or in-flight batching depending on the inference framework. At some level they each allow you to start a new generation while in the middle of one or more previous generations.

Is that possible with "just" pytorch? Could it be added to gpt-fast?


Yeah it's certainly possible, but it's not the focus of this implementation, which is more latency focused (so BS=1).


yeah i'm curious how this would stack up to vllm in a batch setting

long-term, bodes well as both methodologies should be combineable, just curious for current point-in-time wrt being relevant for production use


I wouldn't recommend using it for a batch serving setting today. One crucial optimization for batched serving (which you need if you have a large number of requests) is continual batching, which this implementation doesn't have.


I'd love to see how this compares against Llama.cpp on speed. Do anyone any benchmarks that compare PyTorch.compile vs Llama.cpp?


This person claims to have compared llama.cpp against gpt-fast on a 4090, and found gpt-fast about 20% faster.

https://twitter.com/zeuxcg/status/1730450360895242355


Awesome, thanks!


What are some of the better use cases of fast inference? From my experience using ChatGPT, I don't need it to generate faster than I can read, but waiting for code generation is painful because I'm waiting for the whole code block to format correctly, be available to copy or execute (in the case of code interpreter). Anything else fall under this pattern?


The main thing is chat is just one application of LLMs. Other applications are much more latency sensitive. Imagine, for instance, an LLM-powered realtime grammar checker in an editor.


Most LLM model use shouldn't be 'raw' but as part of a smart & iterative pipeline. Ex:

* reading: If you want it to do inference over a lot of context, you'll need to do multiple inferences. If each inference is faster, you can 'read' more in the same time on the same hardware

* thinking: a lot of analytical approaches essentially use writing as both memory & thinking. Imagine iterative summarization, or automatically iteratively refining code until it's right

For louie.ai sessions, that's meant a fascinating trade-off here when doing the above:

* We can use smarter models like gpt-4 to do fewer iterations...

* ... or a faster but dumber model to get more iterations in the same amount of time

It's entirely not obvious. For example, the humaneval leaderboard has gpt4 for code being beat by gpt 3.5 for code when run by a LATS agent: https://paperswithcode.com/sota/code-generation-on-humaneval . This highlights that the agent framework is the one really responsible for final result quality, so their ability to run many iterations in the same time window matters.


Programmatic and multi-step use cases. If you need chain-of-thought or similar, tool use, etc. Generating data.

Most use cases outside of classic chat.

For example, I made an on-demand educational video project, and the slowest part was by far the content generation. RAG, TTS, Image generation, text rendering, and video processing were all a drop in the bucket, in comparison.

It would be an even wider gap now, and TTS is super-realtime, and image generation can be single step.


Perhaps this is naive, but in my mind it can be useful for learning.

- Hook LLM to VMs

- Ask for code that [counts to 10]

- Run code on VM

- Ask different LLM to Evaluate Results.

- Repeat for sufficient volume.

- Train.

The faster it can generate results the faster those results can be tested against the real world, e.g. a VM, users on X, other models with known accuracies.


One obvious use case is that it makes per-token generation much cheaper.


That's not so much a use case, but I get what you're saying. It's nice that you can find optimizations to shift down the pareto frontier of across the cost and latency dimension. The hard tradeoffs are for cases like inference batching where it's cheaper and higher throughput but slower for the end consumer.

What's a good use case for an order of magnitude decrease in price per token? Web scale "analysis" or cleaning of unstructured data?


If you were to serve this on a datacenter server, is the client to server roundtrip networking the slowest part of the inference? Curious if it would be faster to run this cloud GPUs on better hardware but farther compute, or locally with worse hardware.


Surprisingly, no. And part of this is that text generation is really expensive. Unlike traditional ML inference (like with, resnets), you don't just pass your data through your model once. You need to pass it over and over again (once for each token you generate).

So, in practice, a full "text completion request" can often take on the order of seconds, which dwarfs the client <-> server roundtrip.


Is this still the case for sliding window attention/streaming LLMs, where you have a fixed length attention window rather than infinitely passing in new tokens for quadratic scaling? You even get better performance due to purposely downsampling non-meaningful attention sink tokens.


I cover it a bit in the blog post, but unless you have a really long context length (like 32k+), your primary computational cost doesn't come from attention but rather from loading your weights from VRAM into registers.

I mean, practically speaking, completions from say, ChatGPT or Claude take seconds to finish :)


How does one select a good candidate for the draft model in speculative decoding? I imagine that there's some better intuition than just selecting the next parameter count down (i.e 70B -> 13B, 13B -> 7B).

Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts?


This is indeed a bit of a dark art. Essentially, you want a balance between "is significantly faster than base model" and "generates similar stuff to the base model".

Anecdotally, folks often seem to use say, 70B base + 7B as verifier. But I think there's a lot of room for experimentation and improvement here.

You could... say, take a 70B model and maybe just chop off the last 90% of layers and then fine-tune. Or perhaps you could use a model that's trained to generate 8 tokens at once. Or perhaps you could just use statistical "n-gram" predictor.


Great work. Actually it's sort of breakthrough because it makes interesting things possible. (if it's not too late...)


Holy hotdogs, this look amazing. So ahh. I'll jump right to it - where can I run this online without having to do a bunch of work setting it up? I have several python projects that could take advantage of this! (;


240tok/s is crazy


This is similar to exllamav2, and exllamav2's quantization is also excellent.


Can we get same performance on Apple gpu ?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: