I've been running the unsloth 200GB dynamic quantisation with 8k context on my 6...

TeMPOraL · 2025-02-01T13:20:16 1738416016

To add another datapoint, I've been running the 131GB (140GB on disk) 1.58 bit dynamic quant from Unsloth with 4k context on my 32GB Ryzen 7 2700X (8 cores, 3.70 GHz), and achieved exactly the same speed - around 0.15 tps on average, sometimes dropping to 0.11, tps occasionally going up to 0.16 tps. Roughly 1/2 of your specs, roughly 1/2 smaller quant, same tps.

I've had to disable the overload safeties in LM Studio and tweak with some loader parameters to get the model to run mostly from disk (NVMe SSD), but once it did, it also used very little CPU!

I tried offloading to GPU, but my RTX 4070 Ti (12GB VRAM) can take at most 4 layers, and it turned out to make no difference in tps.

My RAM is DDR4, maybe switching to DDR5 would improve things? Testing that would require replacing everything but the GPU, though, as my motherboard is too old :/.

Eisenstein · 2025-02-01T14:50:27 1738421427

More channels > faster ram.

Some math:

DDR5 6000 is 3000mhz x 2 (double data rate) x 64 bits / 8 for bytes = 48000 /1000 = 48GB/s

DDR3 1866 is 933mhz x 2 x 64 / 8 / 1000 = 14.93GB/s. If you have 4 channels that is 4 x 14.93 = 59.72GB/s

kristianp · 2025-02-01T23:05:04 1738451104

For a 131GB model, the biggest difference would be to fit it all in RAM, eg get 192GB of RAM. Sorry if this is too obvious, but it's pointless to run an llm if it doesn't fit in ram, even if it's an MOE model. And also obviously, it may take a server motherboard and cpu to fit that much RAM.

numpad0 · 2025-02-01T23:52:48 1738453968

I wonder if one could just replicate the "Mac mini LLM cluster" setup over Ethernet of some form and 128GB per node of DDR4 RAM. Used DDR4 RAM with likely dead bits are dirt cheap, but I would imagine that there will be challenges linking systems together.

conor_mc · 2025-02-01T17:51:38 1738432298

I wonder if the, now abandoned, Intel Optane drives could help with this. They had very low latency, high IOPS, and decent throughput. They made RAM modules as well. A ram disk made of them might be faster.

loxias · 2025-02-01T18:17:44 1738433864

Intel PMem really shines for things you need to be non-volatile (preserved when the power goes out) like fast changing rows in a database. As far as I understand it, "for when you need millions of TPS on a DB that can't fit in RAM" was/is the "killer app" of PMem.

Which suggests it wouldn't be quite the right fit here -- the precomputed constants in the model aren't changing, nor do they need to persist.

Still, interesting question, and I wonder if there's some other existing bit of tech that can be repurposed for this.

I wonder if/when this application (LLMs in general) will slow down and stabilize long enough for anything but general purpose components to make sense. Like, we could totally shove model parameters in some sort of ROM and have hardware offload for a transformer, IF it wasn't the case that 10 years from now we might be on to some other paradigm.

smcleod · 2025-02-01T13:59:52 1738418392

I get around 4-5t/s with the unsloth 1.58bit quant on my home server that has 2x3090 and 192GB of DDR5 Ryzen 9, usable but slow.

segmondy · 2025-02-01T20:01:52 1738440112

how much context size?

smcleod · 2025-02-01T20:03:33 1738440213

Just 4K. Because deepseek doesn't allow for the use of flash attention it means you can't run quantised qkv

baobun · 2025-02-01T13:57:40 1738418260

I imagine you can get more by striping drives. Depending on what chipset you have, the CPU should handle at least 4. Sucks that no AM4 APU supports PCIe 4 while the platform otherwise does.