I don't think there was a market for it before LLMs. Still might not be (especia...

the8472 · on May 14, 2023

> I don't think there was a market for it before LLMs.

At $work CGI assets sometimes grow pretty big and throwing more VRAM at the problem would be easier than optimizing the scenes in the middle of the workflow. They can be optimized, but that often makes it less ergonomic to work with them.

Perhaps asset-streaming (nanite&co) will make this less of an issue, but that's also fairly new.

Do LLM implementations already stream the weights layer by layer or in whichever order they're doing the evaluation or is PCIe bandwidth too limited for that?

elabajaba · on May 14, 2023

AMD had the Radeon pro SSG that let you attach 1TB of pcie3 nvme SSDs directly to the GPU, but no one bought them and afaik they were basically unobtainable on the consumer market.

Also asset streaming has been a thing for like 20 years now in gaming, it's not really a new thing. Nanite's big thing is that it gets you perfect LODs without having to pre-create them and manually tweak them (eg. how far away does the LOD transition happen, what's the lowest LOD before it disappears, etc)

the8472 · on May 14, 2023

Loading assets JIT for the next frame from NVMe hasn't been a thing for 20 years though. Different kinds of latency floors.

What I was asking is whether LLM inference can be structured in such a way that only a fraction of the weight is needed at a time and then the next ones can be loaded JIT as the processing pipeline advances.

tpetry · on May 14, 2023

But you are not the home user target audience. They want to sell you the more expensive workstation or server models.

the8472 · on May 14, 2023

Even an A6000 tops out at 48GB while you can attach terabytes of RAM to server-class CPUs.