Can someone a lot smarter than me give a basic explanation as to why something like this can run at a respectable speed on the CPU whereas Stable Diffusion is next to useless on them? (That is to say, 10-100x slower, whereas I have not seen GPU based LLaMA go 10-100x faster than the demo here.) I had assumed there were similar algorithms at play.
Quantization is the answer here. CPU running the large models at 16 bits (which is actually 32, because CPUs mostly do not support FP16) would be really slow.