Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, CPU inference is incredibly slow, especially as the context grows. 4-bit quantized on an A6000 should in theory work.

If those rent-seeking bastards at NVidia hadn't killed NVL on the 4090, you could do it on two linked 4090s for only $4k, but we have to live under the thumb of monopolists until such time as AMD 1. catches up on hardware and 2. fixes their software support.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: