Draw Things recently released a 8-bit quantized SD model that has comparable output as the FP16. It does use k-means based LUT and separate weights into blocks to minimize quantization errors.
I was going to search on the internet about it, but then I realized you are the author (and there is nothing online I think). I imagine that the activations are left in FP16 and the weights are converted in FP16 during inference, right?
Yes, computes are carried out in FP16 (so there is no compute efficiency gains, might be latency reductions due to memory-bandwidth saving). These savings are not realized yet because no custom kernels introduced yet.