Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can't fit the model into 4090 without quantization, its like 64 gigs.

For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1



You don't really need it to fit all in VRAM due to the efficient MoE architecture and with llama.cpp

The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.


I have a similar setup but with 32 GB of RAM. Do you partly offload the model to RAM? Do you use LMStudio or other to achieve this? Thanks!


Yes, LMStudio and it automatically does this.


The 20B one fits.


Does it fit on a 5080 (16gb)?


Haven't tried myself but it looks like it probably does. The weight files total 13.8 GB which gives you a little left over to hold your context.


It fits on a 5070TI, so should fit on a 5080 as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: