Wow this does so well on text! The original model struggled a lot, it's impressi...

simonw · on April 28, 2023

It's much better, but it's not perfect. Here's what I got for:

> a photograph of raccoon in the woods holding a sign that says "I will eat your trash"

https://twitter.com/simonw/status/1651994059781832704

indymike · on April 28, 2023

This is not a problem. The sign was clearly made by the raccoon.

patapong · on April 28, 2023

Agreed - to properly test it, we should try:

A photograph of an English professor in the woods holding a sign that says "I will eat your trash"

Semonto · on April 28, 2023

It actually adds quite a nice charm by wording it not correctly

mkaic · on April 28, 2023

I'm quite curious how much of the improvement on text rendering is from the switch to pixel-space diffusion vs. the switch to a much larger pretrained text encoder. I'm leaning towards the latter, which then raises the question of what happens when you try training Stable Diffusion with T5-XXL-1.1 as the text encoder instead of CLIP — does it gain the ability to do text well?

minimaxir · on April 28, 2023

DeepFloyd IF is effectively the same architecture/text encoder as Imagen (https://imagen.research.google/), although that paper doesn't hypothesize why text works out a lot better.

mkaic · on April 28, 2023

Right, I'm aware of the Imagen architecture, just curious to see further research determining which aspect of it is responsible for the improved text rendering.

EDIT: According to the figure in the Imagen paper FL33TW00D's response referred me to, it looks like the text encoder size is the biggest factor in the improved model performance all-around.

GaggiX · on April 28, 2023

The CLIP text encoder is trained to align with the pooled image embedding (a single vector), which is why most text embeddings are not very meaningful on their own (but still convey the overall semantics of the text). With T5 every text embedding is important.

FL33TW00D · on April 28, 2023

Check out figure 4A from the ImageGen paper: https://arxiv.org/pdf/2205.11487.pdf

From that - I would strongly suspect the answer to your question to be yes.

mkaic · on April 28, 2023

Ah, yes, this seems to be pretty strong evidence. Thanks for pointing that figure out to me!

radq · on April 28, 2023

It is most likely due to the text encoder - see "Character-Aware Models Improve Visual Text Rendering". https://arxiv.org/abs/2212.10562

piyh · on April 28, 2023

That alone is a huge milestone.