Hacker News new | past | comments | ask | show | jobs | submit login

You can run a quantized version of the model to reduce the memory requirements, and you can do partial offload, where some of the model is on GPU and some is on CPU. If you are running a 70B Q4, that’s 40-ish GB including some context cache, and you can offload at least half onto a 3090, which will run its portion of the load very fast. It makes a huge difference even if you can’t fit every layer on the GPU.



So the more GPUs we have the faster it will be and we don't have to have the model run solely CPU or GPU -- it can be combined. Very cool. Think that is how it's running now with my single 4090.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: