You can't fit the model into 4090 without quantization, its like 64 gigs. For ho...

SirMaster · 2025-08-07T13:47:46 1754574466

You don't really need it to fit all in VRAM due to the efficient MoE architecture and with llama.cpp

The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.

dexterlagan · 2025-08-08T10:00:00 1754647200

I have a similar setup but with 32 GB of RAM. Do you partly offload the model to RAM? Do you use LMStudio or other to achieve this? Thanks!

SirMaster · 2025-08-18T13:29:01 1755523741

Yes, LMStudio and it automatically does this.

modeless · 2025-08-07T06:02:59 1754546579

The 20B one fits.

steinvakt2 · 2025-08-07T06:37:12 1754548632

Does it fit on a 5080 (16gb)?

jwitthuhn · 2025-08-07T07:04:05 1754550245

Haven't tried myself but it looks like it probably does. The weight files total 13.8 GB which gives you a little left over to hold your context.

northern-lights · 2025-08-07T07:18:28 1754551108

It fits on a 5070TI, so should fit on a 5080 as well.