Yeah, CPU inference is incredibly slow, especially as the context grows. 4-bit quantized on an A6000 should in theory work.
If those rent-seeking bastards at NVidia hadn't killed NVL on the 4090, you could do it on two linked 4090s for only $4k, but we have to live under the thumb of monopolists until such time as AMD 1. catches up on hardware and 2. fixes their software support.
If those rent-seeking bastards at NVidia hadn't killed NVL on the 4090, you could do it on two linked 4090s for only $4k, but we have to live under the thumb of monopolists until such time as AMD 1. catches up on hardware and 2. fixes their software support.