Interesting, I do not get the results you do. What additional parameters are you using? Here is a link to some of my tests, with all default settings, some in v5 some in v4. https://twitter.com/eb_french/status/1651370091869786112
none and as an experienced user you should know that's it's not one shot and most of the time not even few shot... You can't compare cherry picked press images with few shots of a 5 second prompt. I don't know why you want to hype something up if you can't really compare it. It seems extremly attention grifting.
Just look at their cherry picks in this discord... https://discord.com/invite/pxewcvSvNx .
It's overfitted on images with copyright (afghan girl) and doesn't show more "compositional power" at all most of the time ignoring half of the prompt.
> as an experienced user you should know that's it's not one shot
Being "not one shot" for most nontrivial prompts is a failure of current t2i models, its what they all strive for and its what DF supposedly does a lot better. And, while its possible to spin things pretty hard when people can't bang on it themselves, I think the indication is that it is, in fact, a major leap forward from the best current consumer-available t2i models (it looks pretty comparable to Google Imagen – a little bit worse benchmark scores – which is unsurprising since it seems to be an implementation of exactly the architecture described in Google's Imagen paper.
> It’s overfitted on images with copyright (afghan girl)
It’s…not, though. Sure, the picture with a prompt which is suggestive of that (down to even specifying the same film type) gives off a vibe that completely feels, if you haven’t recently looked at the famous picture but are familiar with it, like a “cleaned up” version of that picture, so you might intuitively feel its from overfitting, that it is basically reproducing the original image with slight variations.
Look at the two pictures side-by-side and there is basically nothing similar about them except exactly the things specified in the prompt, and pretty much every aspect of the way that those elements of the prompt is interpreted in the DF image is unlike the other image.
It's not a major leap not even a small one because it's exactly like imagen. It's stability giving some Ukrainian refugees compute time to train "their" model for publicity. It's about the whom and not what as it should be.
I "feel" nothing I am telling it how it is. Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes... and most important burn in like every other overfitted image in diffusion networks.
You guys all want it to be something special and I get it, new content, new shiny toy but it's neither a good architecture nor a good implementation.
> It’s not a major leap not even a small one because it’s exactly like imagen.
I would agree, if imagen was a “consumer-available t2i model”. What’s available is a research paper with demo images from Google. The model itself is locked up inside Google, notionally because they haven’t solved filtering issues with it.
> Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes…
You look at it again, literally none of those things are the same: its not the same clothing (the material and color of the head scarf is different, the headscarf is the only visible clothing in the DF image, whereas that is not the case in the famous image), the condition of the head scarf is different, the hair color is different, the hair style is different, the hair texture is different, the face shape is different, the individual facial features are different, the eye color is much more brown in the DF image, the facial expression is different, the DF image has lipstick and eyeshadow, the famous image has a dirty face and no makeup, the headscarf is worn differently in the two images, the background is different, the lighting is different, and the faces are framed differently.
The similarities are (1) its a close up portrait, (2) a general ethnic similarity, and (3) they are both wearing a red (though very different red) head scarf, (4) and they are both looking straight into the camera. (2)-(4) are explicitly prompted, (1) is strongly implied in the prompt addressing nothing that isn’t related to the face/head. This isn’t “overfitting on a copyright image” its getting what you prompt, with no other similarity to the existing image.
> You guys all want it to be something special and I get it,
I’m actually kind of annoyed, because I’ve been collecting tooling, checkpoints, and other support for, and spending quite a bit of time getting proficient in dealing with the quirks of, Stable Diffusion. But, that’s life.
> it’s neither a good architecture nor a good implementation.
I’d be interested in hearing your specific criticism of the architecture and implementation, but hopefully its more grounded in fact than your criticism of the one image...
0/16 images have a red cube on a green sphere.