You can run a quantized version of the model to reduce the memory requirements, ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

Gracana 3 months ago | parent | context | favorite | on: How to Run DeepSeek R1 671B Locally on a $2000 EPY...

You can run a quantized version of the model to reduce the memory requirements, and you can do partial offload, where some of the model is on GPU and some is on CPU. If you are running a 70B Q4, that’s 40-ish GB including some context cache, and you can offload at least half onto a 3090, which will run its portion of the load very fast. It makes a huge difference even if you can’t fit every layer on the GPU.

jjallen 3 months ago [–]

So the more GPUs we have the faster it will be and we don't have to have the model run solely CPU or GPU -- it can be combined. Very cool. Think that is how it's running now with my single 4090.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact