> but self-hosting these wont be a viable alternative to using providers like OpenAI for business applications for a while.
Why not? While 3-4 tok/s is still on the lower end, it is still usable to the point that I can use it for any task that doesn't require me to get into a real-time communication with the model.
In other words, I don't mind waiting a 1-minute for good-enough response from the model for topic that would take me multiples of that to compile and research on my own. It's a clear net win.
If you have so little throughput that you don’t need more than 3 tokens a second, you are processing so little data that your costs from the LLM providers won’t even sniff the $2000 you will spend on the hardware to self host.
Why not? While 3-4 tok/s is still on the lower end, it is still usable to the point that I can use it for any task that doesn't require me to get into a real-time communication with the model.
In other words, I don't mind waiting a 1-minute for good-enough response from the model for topic that would take me multiples of that to compile and research on my own. It's a clear net win.