Example of how much better it can do compared to midjourney, on a complex prompt...

vitorgrs · on April 27, 2023

Midjourney always look very aesthetic pleasing, I guess because of their RLHF tuning with Discord data... But it doesn't really follow prompts as well as Dall-e for example.

But in the end, people want pretty pictures. So is a complicated situation.

BoorishBears · on April 27, 2023

The tweet they shared is from February and uses an outdated version of MJ, this is what I got from V5: https://i.imgur.com/0uxtZDe.png

Midjourney does much better overall. Composition is neat, but MJ is so incredibly far ahead in terms of quality of output, it honestly doesn't matter if you have to go and do composition manually (and with new AI based tools, that's easier than ever too. Do a bad cut and paste job then infill your way back to a coherent image)

exodust · on April 27, 2023

But it didn't work? In yours there is no "Nexus", no smiling, no frowning, and man on right doesn't look Asian? Compared with image in Tweet, MJ failed at this task.

BoorishBears · on April 27, 2023

That depends on what your goal was? If your goal was to get an AI model to generate copyrighted images, and misunderstand the relationship between Indians and Asians, then sure MJ failed (and I'm guessing that's the goal of the prompt).

But if I actually wanted a useful picture, I could work with what MJ gave me despite having minimal image editing skills. The DeepFloyd result looks like it's a 8-12 months behind what MJ gave and wouldn't be salvageable.

drcongo · on April 27, 2023

In what way does the man on the right not look like he could be from the absolutely enormous continent known as Asia?

epivosism · on April 27, 2023

Thing is, it's possible to just run experiments against your claim that "actually the bot was doing a good job"

https://twitter.com/eb_french/status/1651584746089218049

it wasn't.

Again, I don't know why everyone is so defensive. I love MJ. There's nothing wrong with admitting that other models might do certain things better. We all can use any model we want.

BoorishBears · on April 27, 2023

There's a certain irony in tweeting ad hominem attacks then claiming people are being defensive...

epivosism · on April 27, 2023

Yeah perhaps I was not good at judging tone. To me it's a matter of fact thing that MJ isn't good at this. They'd admit it, it's not a big deal and I'm a fan.

It's not ad hominem at all... mj isn't as good at certain types of composition as others. I don't get why people are pretending that isn't the case. I want everyone to have great models and IF is part of that progress. Perhaps calling it "word soup" was offensive? This isn't your religion, though, it's just a model. Listening in on the MJ office hours they're the farthest thing you can be from dogmatic or arrogant. They want to improve as we all do. I personally am just really inspired that everyone can advance together!

Also see downthread - the first 32 images I generated attempting to reproduce the claim that "actually MJ can do this" all failed. The person who challenged me then ignored it. This isn't really up for debate until someone sends a seed where mj can do the cube + sphere thing well.

BoorishBears · on April 27, 2023

"There's a funny hn thread where people aren't good at arguing or experimentation" is an insult.

drcongo · on April 28, 2023

I didn't claim the bot was doing a good job. Who was this reply supposed to be aimed at?

janekm · on April 27, 2023

A human interpreting the prompt would see "asian" as being in contrast to "indian" in the language of the prompt... Not a level of comprehension that can be expected of current models but maybe in a few years (months?).

drcongo · on April 27, 2023

I'm human. I interpreted Asian as from a non-specific part of Asia. I realise though that "Asian" has a very specific meaning in the US, but it's only the US that does this. For the rest of the world Asian means someone from Asia.

janekm · on April 27, 2023

That's kind of what I meant... If a prompt specifies one "Indian", and one "Asian", that implies that the writer of the prompt doesn't think of Indian as Asian so probably from the US background.

drcongo · on April 28, 2023

There's more than two countries in Asia. Could be one Indian and one Sri Lankan, which is what the old dude looks like to me.

janekm · on April 27, 2023

The thing is... IF is currently just a base model, it will need serious fine-tuning before it will produce aesthetically pleasing images (like MJ certainly does).

It's interesting to see what IF can do in terms of composition, text rendering etc, it's very promising if aesthetically pleasing images can be achieved via fine-tuning (the same happened with SD... current publicly fine-tuned models can achieve much higher levels of quality and cohesion than the base models, here's the prompt in an SD2.1 based model: https://imgur.com/a/ELGMSmV ).

Of course fine-tuning IF is likely more challenging, as both the two first stages and the 4x SD upscaler might need to be fine-tuned...

throwaway675309 · on April 27, 2023

Well... Kind of, photobashing with midjourney doesn't guarantee you the same image or even necessarily the objects in the same places, even if you increase the image weight value up to its maximum of two. ('--iw 2')

Many times you'll have no other choice but to use a diffusion model with img2img.

I agree with OP though, the market has spoken and the vast majority of people use prompts hardly more nuanced than a 90s Mad Magazine book of Mad Libs.

BoorishBears · on April 27, 2023

I wasn't referring to photobashing, I meant firing up SD and ArtStudio for 10 minutes and getting something that looks amazing and has the desired composition.

Overall this feels like trying to get ChatGPT to do math: just let ChatGPT offload math to Wolfram.

Similarly I'd rather just offload the composition. Now we even have SAM which will happily pick out the parts of the image you want to compose

orbital-decay · on April 27, 2023

Spatial composition can be done easily, if you stop bothering with pure text-to-image (SD has several tricks and UIs to place objects precisely, they are all janky but they do work, that's practically photobashing). Attribute separation is also easily done with tricks like token bucketing, so your Indian guy will look Indian, and your East Asian guy will look East Asian. All of that is easy if you abandon the ambiguous natural language and use higher-order guidance.

What's really required is semantic composition. Making subjects meaningfully and predictably interact, or combining them together. And also the coherence of the overall stitched picture, so you don't end up with several different perspective planes.

wheresmyshadow · on April 27, 2023

Can't wait to see how all of this is going to look like in ten years. I know we are all nitpicking right now but these results are totally mindblowing already.

quaintdev · on April 27, 2023

The thing is all these things can go downhill as well. All these things are cool today just like Google was a decade ago.

wheresmyshadow · on April 27, 2023

Even though most of these we can do locally?

cyanydeez · on April 29, 2023

He's probably referring to the _business_ of search. The B word taints a lot of things.

petesergeant · on April 27, 2023

I wonder when we can start fuzzing brains. Wire you up to a machine that measures happiness or anxiety or anger or whatever, and keep re-generating results that hone in on the given emotion.

nullsense · on April 27, 2023

Sometimes you have to ask yourself if that's a world you really want to live in.

ben_w · on April 27, 2023

So… heroin, cocaine, and methamphetamine, in memetic form?

I really hope that isn't actually possible.

rslice · on April 27, 2023

It's not because something can be done that it must be done.

circuit10 · on April 26, 2023

So it seems like Midjourney looks better and has more realistic faces but this one is better at following the prompt and generating what you actually want

dragonwriter · on April 26, 2023

Midjourney follows the “facing each other” better, but DF has readable text. This may be due to DF also suffering the effect that ISTR is common to MJ and SD that specifying conflicting things it understands leads to priority the “visibility” one (the specific text, in this case) over the “compositional” one (in this case, “facing each other”.)

Zetobal · on April 26, 2023

That's not how any of this works what do you even mean by compositional power? Every model speaks a different "language" comparing prompts like this has no merit and shows only that the person who makes the claim lacks understanding of the subject matter.

epivosism · on April 26, 2023

Compositional power might mean "the image more resembles the composition you want and describe"

i.e. if you say "a red cube on a green sphere" in DeepFloyd, you will get it. If you say that in MJ, you won't. That means you have more power to compose the image you want with this tool.

Zetobal · on April 26, 2023

No, it does mean you don't understand how to prompt MJ, you don't understand it's language. You might like french more but it doesn't mean that it's a better language than english. MJ even says that their model doesn't understand language like humans do in their FAQs...

dragonwriter · on April 26, 2023

The point of text to image model is for them to accept natural language (yes, in practice, they all benefit from specialized prompting done with an understanding of model quirks, but that’s not the goal.)

Zetobal · on April 26, 2023

The way to prompt is a preference like programming languages ofc the layman might use the javascript of generative models because it's easier to start and there are a lot of tutorials but some might prefer something more exoctic which can produce the same or better quality. Whatever floats your boat but don't try to compare it like the guy in OPs tweets.

MJ and stable also make clear that their models don't understand language like humans do.

ben_w · on April 27, 2023

> MJ and stable also make clear that their models don't understand language like humans do.

I believe the claim here is "and we would like them to".

epivosism · on April 26, 2023

For example, please help me understand how to help MJ do the prompt above!

https://twitter.com/eb_french/status/1651365078137200640

BTW I am a HUGE fan of MJ, and attend the office hours, and have done 35k+ images there. So you may have misinterpreted how much of a supporter of it I am.

Zetobal · on April 26, 2023

first shot but it's a starting point. With 35k images you should be able to do this yourself.

a single (green sphere), with a single (red cube), balancing ((on top))

https://imgur.com/a/QePgM6I

epivosism · on April 26, 2023

Interesting, I do not get the results you do. What additional parameters are you using? Here is a link to some of my tests, with all default settings, some in v5 some in v4. https://twitter.com/eb_french/status/1651370091869786112

0/16 images have a red cube on a green sphere.

Zetobal · on April 26, 2023

none and as an experienced user you should know that's it's not one shot and most of the time not even few shot... You can't compare cherry picked press images with few shots of a 5 second prompt. I don't know why you want to hype something up if you can't really compare it. It seems extremly attention grifting.

Just look at their cherry picks in this discord... https://discord.com/invite/pxewcvSvNx . It's overfitted on images with copyright (afghan girl) and doesn't show more "compositional power" at all most of the time ignoring half of the prompt.

dragonwriter · on April 27, 2023

> as an experienced user you should know that's it's not one shot

Being "not one shot" for most nontrivial prompts is a failure of current t2i models, its what they all strive for and its what DF supposedly does a lot better. And, while its possible to spin things pretty hard when people can't bang on it themselves, I think the indication is that it is, in fact, a major leap forward from the best current consumer-available t2i models (it looks pretty comparable to Google Imagen – a little bit worse benchmark scores – which is unsurprising since it seems to be an implementation of exactly the architecture described in Google's Imagen paper.

> It’s overfitted on images with copyright (afghan girl)

It’s…not, though. Sure, the picture with a prompt which is suggestive of that (down to even specifying the same film type) gives off a vibe that completely feels, if you haven’t recently looked at the famous picture but are familiar with it, like a “cleaned up” version of that picture, so you might intuitively feel its from overfitting, that it is basically reproducing the original image with slight variations.

Look at the two pictures side-by-side and there is basically nothing similar about them except exactly the things specified in the prompt, and pretty much every aspect of the way that those elements of the prompt is interpreted in the DF image is unlike the other image.

Zetobal · on April 27, 2023

Are you associated with deepfloyd?

It's not a major leap not even a small one because it's exactly like imagen. It's stability giving some Ukrainian refugees compute time to train "their" model for publicity. It's about the whom and not what as it should be.

I "feel" nothing I am telling it how it is. Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes... and most important burn in like every other overfitted image in diffusion networks.

You guys all want it to be something special and I get it, new content, new shiny toy but it's neither a good architecture nor a good implementation.

dragonwriter · on April 27, 2023

> Are you associated with deepfloyd?

No, I’m not affiliated with StabilityAI

> It’s not a major leap not even a small one because it’s exactly like imagen.

I would agree, if imagen was a “consumer-available t2i model”. What’s available is a research paper with demo images from Google. The model itself is locked up inside Google, notionally because they haven’t solved filtering issues with it.

> Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes…

You look at it again, literally none of those things are the same: its not the same clothing (the material and color of the head scarf is different, the headscarf is the only visible clothing in the DF image, whereas that is not the case in the famous image), the condition of the head scarf is different, the hair color is different, the hair style is different, the hair texture is different, the face shape is different, the individual facial features are different, the eye color is much more brown in the DF image, the facial expression is different, the DF image has lipstick and eyeshadow, the famous image has a dirty face and no makeup, the headscarf is worn differently in the two images, the background is different, the lighting is different, and the faces are framed differently.

The similarities are (1) its a close up portrait, (2) a general ethnic similarity, and (3) they are both wearing a red (though very different red) head scarf, (4) and they are both looking straight into the camera. (2)-(4) are explicitly prompted, (1) is strongly implied in the prompt addressing nothing that isn’t related to the face/head. This isn’t “overfitting on a copyright image” its getting what you prompt, with no other similarity to the existing image.

> You guys all want it to be something special and I get it,

I’m actually kind of annoyed, because I’ve been collecting tooling, checkpoints, and other support for, and spending quite a bit of time getting proficient in dealing with the quirks of, Stable Diffusion. But, that’s life.

> it’s neither a good architecture nor a good implementation.

I’d be interested in hearing your specific criticism of the architecture and implementation, but hopefully its more grounded in fact than your criticism of the one image...

epivosism · on April 26, 2023

Here are 16 more images with set seeds this time. Could you provide a complete prompt & seed to generate an image where MJ does this well? https://twitter.com/eb_french/status/1651371581514579969

epivosism · on April 26, 2023

Please give me a prompt which would back up your claim that MJ can do this! I'd love to learn