Oddly, they are only charging slightly more for their hosted version: open-mistr...

Palmik · 2024-07-18T16:11:16 1721319076

They specifically call out fp8 aware training and TensoRT LLM is really good (efficient) with fp8 inference on H100 and other hopper cards. It's possible that they run the 7b natively in fp16 as smaller models suffer more from even "modest" quantization like this.

dannyw · 2024-07-18T15:41:30 1721317290

Possibly a NVIDIA subsidy. You run NEMO models, you get cheaper GPUs.