"Sharing between users" doesn't make it cheaper. It makes it more expensive due ...

manmal · 2025-02-01T17:01:54 1738429314

No, batched inference can work very well. Depending on architecture, you can get 100x or even more tokens out of the system if you feed it multiple requests in parallel.

api · 2025-02-01T20:13:22 1738440802

Couldn't you do this locally just the same?

Of course that doesn't map well to an individual chatting with a chat bot. It does map well to something like "hey, laptop, summarize these 10,000 documents."

manmal · 2025-02-02T09:14:37 1738487677

Yes, and people do that. Some people get thousands of tokens per second that way, with affordable setups (eg 4x 3090). I was addressing GP who said there is no economies of scale to having multiple users.