You'll be able to optimize it a lot to make it fit on small systems if you are willing to modify your workflow a bit: instead of 1 prompt -> 1 image _n_ times, do 1 prompt -> _n_ images 1 time -> _m_ times... For a given prompt, run it through the T5 model and store; you can do that in CPU RAM if you have to because you only need the embedding once so you don't need a GPU which can run T5-XXL naively. Then you can get a large batch of samples from #2; 64px is enough to preview; only once you pick some do you run through #3, and then from those through #4. Your peak VRAM should be 1 image in #2 or #4 and that can be quantized or pruned down to something that will fit on many GPUs.