What kind of hardware do I need to run something small scale (1 user concurrentl...

rootusrootus · on June 19, 2023

Anecdote: I can tell you that Vicuna-13B, which is a pretty decent model, runs about 4 to 5 tokens/second on an Apple M1 with 16GB. Takes about 10GB of memory when loaded. Friend of mine with a RTX 2070 Super gets comparable results.

I'm upgrading my M1 laptop on Thursday to a newer model, M2 MAX with 96GB of memory. I'm totally going to try Falcon-40B on it, though I do not expect it to run that well. But I do expect it (the M2, I mean) will be snappier on the smaller models than my original M1 is.

ben_w · on June 19, 2023

I think that strongly depends on what you count as "reasonable": smaller models take less memory, so there's a trade-off between quality and speed depending on if you can fit it all in graphics VRAM, or system RAM, or virtual memory…

Just keep in mind the speed differences between the types of memory. If you've got a 170bn 4-bit parameter model, and you're on virtual memory on a 400 Mbps port, a naive calculation says it will take at best 28 minutes per token unless your architecture lets you skip loading parts of the model. Might take longer if the network has an internal feedback loop in the structure.

rmbyrro · on June 20, 2023

Would you mind sharing the calculation memory that reaches at 28 minutes (!) per token?

ben_w · on June 20, 2023

If your model is 170bn 4-bit parameters, you have 85 Gbytes that has to be loaded into the CPU or GPU; if those parameters are on the other side of a 400 Mbps port, that takes 85 Gbyte / 400 Mbps = 1700 seconds = 28 minutes and 20 seconds.

If you don't have sufficient real RAM or VRAM, the entire model has to be re-loaded for each step of the process.

Assuming no looping (looping makes it longer) and no compartmentalisation in the network structure (if you can make it so an entire fragment might be not-activated, you have the possibility of skipping loading that section; I've not heard of any architecture that does this, it would be analogous to dark silicon or to humans not using every brain cell at the same time (outside of seizures)).