Right now Deepseek's official hosting is cheaper than everyone else who can manage to run the model, including Deepinfra. I haven't seen any good hypotheses as to why other than their large batch size and speculative decoding.
DeepSeek-V2/V3/R1's model architecture is very different from what Fireworks/Together/... were used to.
That's their "business" model (okay, they don't care about business that much for now, but still) too: you can't run it efficiently without doing months of work we already did, so even with all weights open you can't compete with us.
Then a US compute provider should be able to launch a similarly-priced competitor (e.g. to capture the enterprise market concerned about the China associations) using the open-source version and drastically undercut OpenAI.
> Then a US compute provider should be able to launch a similarly-priced competitor
Right, you just need a few months to implement efficient inference for MLA + their strangely looking MoE scheme + ..., easy!
Oh wait, the inference scheme described in their tech report is pretty much an exact fit for H800s. So if you run the recipe on H100s you are wasting the potential of your H100s. Otherwise have fun making variations to the serving architecture.
To be fair, we had chance. If someone decided to replicate the effort to serve their models back in May 2024 when DeepSeek-V2 was out we'd have it now. But nobody had interest as DS-V2 was pretty mediocre. They (and whoever realized the potential) made big bet and it is paying off.
R1 is a mixture model with only 37B active params. So while it's definitely expensive to train, it's rather light on compute during inference. All you really need is lots of memory.
https://api-docs.deepseek.com/quick_start/pricing/