Not seeing anything about the dataset, are they still using LAION? There's no mention of LAION in the paper and the results look quite different from 1.5 so I'm guessing no.
> the model may encounter challenges when synthesizing intricate structures, such as human hands
I think there's two main reasons for poor hands/text
- Humans care about certain areas of the image more than others, giving high saliency to faces, hands, body shape etc and lower saliency to backgrounds and textures. Due to the way the unet is trained it cares about all areas of the image equally. This means model capacity per area is uniform, leading to capacity problems for objects with a large number of configurations that humans care more about.
- The sampling procedure implicitly assumes a uniform amount of variance over the entire image. Text glyphs basically never change, which means we should basically have infinite CFG in the parts of the image that contain text.
I'm not sure if there's any point in working on this though, since both can be fixed by simply making a bigger model.
From hanging out around the LAION Discord server a bunch over the past few months, I've gathered that they're still using LAION-5B in some capacity, but they've done a bunch of filtering on it to remove low-quality samples. I believe Emad tweeted something to this effect at some point, too, but I can't find the tweet right now.
Are any of these txt2img models being partially trained on synthetic datasets? Automatically rendering tens of thousands of images with different textures, backgrounds, camera poses, etc should be trivial with a handful of human models, or text using different fonts.
> the model may encounter challenges when synthesizing intricate structures, such as human hands
I think there's two main reasons for poor hands/text
- Humans care about certain areas of the image more than others, giving high saliency to faces, hands, body shape etc and lower saliency to backgrounds and textures. Due to the way the unet is trained it cares about all areas of the image equally. This means model capacity per area is uniform, leading to capacity problems for objects with a large number of configurations that humans care more about.
- The sampling procedure implicitly assumes a uniform amount of variance over the entire image. Text glyphs basically never change, which means we should basically have infinite CFG in the parts of the image that contain text.
I'm not sure if there's any point in working on this though, since both can be fixed by simply making a bigger model.