I'm quite curious how much of the improvement on text rendering is from the switch to pixel-space diffusion vs. the switch to a much larger pretrained text encoder. I'm leaning towards the latter, which then raises the question of what happens when you try training Stable Diffusion with T5-XXL-1.1 as the text encoder instead of CLIP — does it gain the ability to do text well?
DeepFloyd IF is effectively the same architecture/text encoder as Imagen (https://imagen.research.google/), although that paper doesn't hypothesize why text works out a lot better.
Right, I'm aware of the Imagen architecture, just curious to see further research determining which aspect of it is responsible for the improved text rendering.
EDIT: According to the figure in the Imagen paper FL33TW00D's response referred me to, it looks like the text encoder size is the biggest factor in the improved model performance all-around.
The CLIP text encoder is trained to align with the pooled image embedding (a single vector), which is why most text embeddings are not very meaningful on their own (but still convey the overall semantics of the text). With T5 every text embedding is important.