Hacker News new | past | comments | ask | show | jobs | submit login

They specifically call out fp8 aware training and TensoRT LLM is really good (efficient) with fp8 inference on H100 and other hopper cards. It's possible that they run the 7b natively in fp16 as smaller models suffer more from even "modest" quantization like this.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: