I don't think there was a market for it before LLMs. Still might not be (especially if they don't want to cannibalize data center products). Also, they might have hardware constraints. I wouldn't be that surprised if we see some high ram consumer GPUs in the future, though.
It won't work out unless it becomes common to run LLMs locally. Kind of a chicken-and-egg problem so I hope they try it!
> I don't think there was a market for it before LLMs.
At $work CGI assets sometimes grow pretty big and throwing more VRAM at the problem would be easier than optimizing the scenes in the middle of the workflow. They can be optimized, but that often makes it less ergonomic to work with them.
Perhaps asset-streaming (nanite&co) will make this less of an issue, but that's also fairly new.
Do LLM implementations already stream the weights layer by layer or in whichever order they're doing the evaluation or is PCIe bandwidth too limited for that?
AMD had the Radeon pro SSG that let you attach 1TB of pcie3 nvme SSDs directly to the GPU, but no one bought them and afaik they were basically unobtainable on the consumer market.
Also asset streaming has been a thing for like 20 years now in gaming, it's not really a new thing. Nanite's big thing is that it gets you perfect LODs without having to pre-create them and manually tweak them (eg. how far away does the LOD transition happen, what's the lowest LOD before it disappears, etc)
Loading assets JIT for the next frame from NVMe hasn't been a thing for 20 years though. Different kinds of latency floors.
What I was asking is whether LLM inference can be structured in such a way that only a fraction of the weight is needed at a time and then the next ones can be loaded JIT as the processing pipeline advances.
It won't work out unless it becomes common to run LLMs locally. Kind of a chicken-and-egg problem so I hope they try it!