Hacker News new | past | comments | ask | show | jobs | submit login

Wow this does so well on text! The original model struggled a lot, it's impressive to see how far they've come.



It's much better, but it's not perfect. Here's what I got for:

> a photograph of raccoon in the woods holding a sign that says "I will eat your trash"

https://twitter.com/simonw/status/1651994059781832704


This is not a problem. The sign was clearly made by the raccoon.


Agreed - to properly test it, we should try:

A photograph of an English professor in the woods holding a sign that says "I will eat your trash"


It actually adds quite a nice charm by wording it not correctly


I'm quite curious how much of the improvement on text rendering is from the switch to pixel-space diffusion vs. the switch to a much larger pretrained text encoder. I'm leaning towards the latter, which then raises the question of what happens when you try training Stable Diffusion with T5-XXL-1.1 as the text encoder instead of CLIP — does it gain the ability to do text well?


DeepFloyd IF is effectively the same architecture/text encoder as Imagen (https://imagen.research.google/), although that paper doesn't hypothesize why text works out a lot better.


Right, I'm aware of the Imagen architecture, just curious to see further research determining which aspect of it is responsible for the improved text rendering.

EDIT: According to the figure in the Imagen paper FL33TW00D's response referred me to, it looks like the text encoder size is the biggest factor in the improved model performance all-around.


The CLIP text encoder is trained to align with the pooled image embedding (a single vector), which is why most text embeddings are not very meaningful on their own (but still convey the overall semantics of the text). With T5 every text embedding is important.


Check out figure 4A from the ImageGen paper: https://arxiv.org/pdf/2205.11487.pdf

From that - I would strongly suspect the answer to your question to be yes.


Ah, yes, this seems to be pretty strong evidence. Thanks for pointing that figure out to me!


It is most likely due to the text encoder - see "Character-Aware Models Improve Visual Text Rendering". https://arxiv.org/abs/2212.10562


That alone is a huge milestone.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: