Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Example of how much better it can do compared to midjourney, on a complex prompt: https://twitter.com/eb_french/status/1623823175170805760

It is able to put people on the left/right and put the correct t-shirts and facial expressions on each one. This is compared to mj which just mixes together a soup of every word you use and plops it out into the image. Huge MJ fan of course, it's amazing, but having compositional power is another step up.



Midjourney always look very aesthetic pleasing, I guess because of their RLHF tuning with Discord data... But it doesn't really follow prompts as well as Dall-e for example.

But in the end, people want pretty pictures. So is a complicated situation.


The tweet they shared is from February and uses an outdated version of MJ, this is what I got from V5: https://i.imgur.com/0uxtZDe.png

Midjourney does much better overall. Composition is neat, but MJ is so incredibly far ahead in terms of quality of output, it honestly doesn't matter if you have to go and do composition manually (and with new AI based tools, that's easier than ever too. Do a bad cut and paste job then infill your way back to a coherent image)


But it didn't work? In yours there is no "Nexus", no smiling, no frowning, and man on right doesn't look Asian? Compared with image in Tweet, MJ failed at this task.


That depends on what your goal was? If your goal was to get an AI model to generate copyrighted images, and misunderstand the relationship between Indians and Asians, then sure MJ failed (and I'm guessing that's the goal of the prompt).

But if I actually wanted a useful picture, I could work with what MJ gave me despite having minimal image editing skills. The DeepFloyd result looks like it's a 8-12 months behind what MJ gave and wouldn't be salvageable.


In what way does the man on the right not look like he could be from the absolutely enormous continent known as Asia?


Thing is, it's possible to just run experiments against your claim that "actually the bot was doing a good job"

https://twitter.com/eb_french/status/1651584746089218049

it wasn't.

Again, I don't know why everyone is so defensive. I love MJ. There's nothing wrong with admitting that other models might do certain things better. We all can use any model we want.


There's a certain irony in tweeting ad hominem attacks then claiming people are being defensive...


Yeah perhaps I was not good at judging tone. To me it's a matter of fact thing that MJ isn't good at this. They'd admit it, it's not a big deal and I'm a fan.

It's not ad hominem at all... mj isn't as good at certain types of composition as others. I don't get why people are pretending that isn't the case. I want everyone to have great models and IF is part of that progress. Perhaps calling it "word soup" was offensive? This isn't your religion, though, it's just a model. Listening in on the MJ office hours they're the farthest thing you can be from dogmatic or arrogant. They want to improve as we all do. I personally am just really inspired that everyone can advance together!

Also see downthread - the first 32 images I generated attempting to reproduce the claim that "actually MJ can do this" all failed. The person who challenged me then ignored it. This isn't really up for debate until someone sends a seed where mj can do the cube + sphere thing well.


"There's a funny hn thread where people aren't good at arguing or experimentation" is an insult.


I didn't claim the bot was doing a good job. Who was this reply supposed to be aimed at?


A human interpreting the prompt would see "asian" as being in contrast to "indian" in the language of the prompt... Not a level of comprehension that can be expected of current models but maybe in a few years (months?).


I'm human. I interpreted Asian as from a non-specific part of Asia. I realise though that "Asian" has a very specific meaning in the US, but it's only the US that does this. For the rest of the world Asian means someone from Asia.


That's kind of what I meant... If a prompt specifies one "Indian", and one "Asian", that implies that the writer of the prompt doesn't think of Indian as Asian so probably from the US background.


There's more than two countries in Asia. Could be one Indian and one Sri Lankan, which is what the old dude looks like to me.


The thing is... IF is currently just a base model, it will need serious fine-tuning before it will produce aesthetically pleasing images (like MJ certainly does).

It's interesting to see what IF can do in terms of composition, text rendering etc, it's very promising if aesthetically pleasing images can be achieved via fine-tuning (the same happened with SD... current publicly fine-tuned models can achieve much higher levels of quality and cohesion than the base models, here's the prompt in an SD2.1 based model: https://imgur.com/a/ELGMSmV ).

Of course fine-tuning IF is likely more challenging, as both the two first stages and the 4x SD upscaler might need to be fine-tuned...


Well... Kind of, photobashing with midjourney doesn't guarantee you the same image or even necessarily the objects in the same places, even if you increase the image weight value up to its maximum of two. ('--iw 2')

Many times you'll have no other choice but to use a diffusion model with img2img.

I agree with OP though, the market has spoken and the vast majority of people use prompts hardly more nuanced than a 90s Mad Magazine book of Mad Libs.


I wasn't referring to photobashing, I meant firing up SD and ArtStudio for 10 minutes and getting something that looks amazing and has the desired composition.

Overall this feels like trying to get ChatGPT to do math: just let ChatGPT offload math to Wolfram.

Similarly I'd rather just offload the composition. Now we even have SAM which will happily pick out the parts of the image you want to compose


Spatial composition can be done easily, if you stop bothering with pure text-to-image (SD has several tricks and UIs to place objects precisely, they are all janky but they do work, that's practically photobashing). Attribute separation is also easily done with tricks like token bucketing, so your Indian guy will look Indian, and your East Asian guy will look East Asian. All of that is easy if you abandon the ambiguous natural language and use higher-order guidance.

What's really required is semantic composition. Making subjects meaningfully and predictably interact, or combining them together. And also the coherence of the overall stitched picture, so you don't end up with several different perspective planes.


Can't wait to see how all of this is going to look like in ten years. I know we are all nitpicking right now but these results are totally mindblowing already.


The thing is all these things can go downhill as well. All these things are cool today just like Google was a decade ago.


Even though most of these we can do locally?


He's probably referring to the _business_ of search. The B word taints a lot of things.


I wonder when we can start fuzzing brains. Wire you up to a machine that measures happiness or anxiety or anger or whatever, and keep re-generating results that hone in on the given emotion.


Sometimes you have to ask yourself if that's a world you really want to live in.


So… heroin, cocaine, and methamphetamine, in memetic form?

I really hope that isn't actually possible.


It's not because something can be done that it must be done.


So it seems like Midjourney looks better and has more realistic faces but this one is better at following the prompt and generating what you actually want


Midjourney follows the “facing each other” better, but DF has readable text. This may be due to DF also suffering the effect that ISTR is common to MJ and SD that specifying conflicting things it understands leads to priority the “visibility” one (the specific text, in this case) over the “compositional” one (in this case, “facing each other”.)


That's not how any of this works what do you even mean by compositional power? Every model speaks a different "language" comparing prompts like this has no merit and shows only that the person who makes the claim lacks understanding of the subject matter.


Compositional power might mean "the image more resembles the composition you want and describe"

i.e. if you say "a red cube on a green sphere" in DeepFloyd, you will get it. If you say that in MJ, you won't. That means you have more power to compose the image you want with this tool.


No, it does mean you don't understand how to prompt MJ, you don't understand it's language. You might like french more but it doesn't mean that it's a better language than english. MJ even says that their model doesn't understand language like humans do in their FAQs...


The point of text to image model is for them to accept natural language (yes, in practice, they all benefit from specialized prompting done with an understanding of model quirks, but that’s not the goal.)


The way to prompt is a preference like programming languages ofc the layman might use the javascript of generative models because it's easier to start and there are a lot of tutorials but some might prefer something more exoctic which can produce the same or better quality. Whatever floats your boat but don't try to compare it like the guy in OPs tweets.

MJ and stable also make clear that their models don't understand language like humans do.


> MJ and stable also make clear that their models don't understand language like humans do.

I believe the claim here is "and we would like them to".


For example, please help me understand how to help MJ do the prompt above!

https://twitter.com/eb_french/status/1651365078137200640

BTW I am a HUGE fan of MJ, and attend the office hours, and have done 35k+ images there. So you may have misinterpreted how much of a supporter of it I am.


first shot but it's a starting point. With 35k images you should be able to do this yourself.

a single (green sphere), with a single (red cube), balancing ((on top))

https://imgur.com/a/QePgM6I


Interesting, I do not get the results you do. What additional parameters are you using? Here is a link to some of my tests, with all default settings, some in v5 some in v4. https://twitter.com/eb_french/status/1651370091869786112

0/16 images have a red cube on a green sphere.


none and as an experienced user you should know that's it's not one shot and most of the time not even few shot... You can't compare cherry picked press images with few shots of a 5 second prompt. I don't know why you want to hype something up if you can't really compare it. It seems extremly attention grifting.

Just look at their cherry picks in this discord... https://discord.com/invite/pxewcvSvNx . It's overfitted on images with copyright (afghan girl) and doesn't show more "compositional power" at all most of the time ignoring half of the prompt.


> as an experienced user you should know that's it's not one shot

Being "not one shot" for most nontrivial prompts is a failure of current t2i models, its what they all strive for and its what DF supposedly does a lot better. And, while its possible to spin things pretty hard when people can't bang on it themselves, I think the indication is that it is, in fact, a major leap forward from the best current consumer-available t2i models (it looks pretty comparable to Google Imagen – a little bit worse benchmark scores – which is unsurprising since it seems to be an implementation of exactly the architecture described in Google's Imagen paper.

> It’s overfitted on images with copyright (afghan girl)

It’s…not, though. Sure, the picture with a prompt which is suggestive of that (down to even specifying the same film type) gives off a vibe that completely feels, if you haven’t recently looked at the famous picture but are familiar with it, like a “cleaned up” version of that picture, so you might intuitively feel its from overfitting, that it is basically reproducing the original image with slight variations.

Look at the two pictures side-by-side and there is basically nothing similar about them except exactly the things specified in the prompt, and pretty much every aspect of the way that those elements of the prompt is interpreted in the DF image is unlike the other image.


Are you associated with deepfloyd?

It's not a major leap not even a small one because it's exactly like imagen. It's stability giving some Ukrainian refugees compute time to train "their" model for publicity. It's about the whom and not what as it should be.

I "feel" nothing I am telling it how it is. Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes... and most important burn in like every other overfitted image in diffusion networks.

You guys all want it to be something special and I get it, new content, new shiny toy but it's neither a good architecture nor a good implementation.


> Are you associated with deepfloyd?

No, I’m not affiliated with StabilityAI

> It’s not a major leap not even a small one because it’s exactly like imagen.

I would agree, if imagen was a “consumer-available t2i model”. What’s available is a research paper with demo images from Google. The model itself is locked up inside Google, notionally because they haven’t solved filtering issues with it.

> Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes…

You look at it again, literally none of those things are the same: its not the same clothing (the material and color of the head scarf is different, the headscarf is the only visible clothing in the DF image, whereas that is not the case in the famous image), the condition of the head scarf is different, the hair color is different, the hair style is different, the hair texture is different, the face shape is different, the individual facial features are different, the eye color is much more brown in the DF image, the facial expression is different, the DF image has lipstick and eyeshadow, the famous image has a dirty face and no makeup, the headscarf is worn differently in the two images, the background is different, the lighting is different, and the faces are framed differently.

The similarities are (1) its a close up portrait, (2) a general ethnic similarity, and (3) they are both wearing a red (though very different red) head scarf, (4) and they are both looking straight into the camera. (2)-(4) are explicitly prompted, (1) is strongly implied in the prompt addressing nothing that isn’t related to the face/head. This isn’t “overfitting on a copyright image” its getting what you prompt, with no other similarity to the existing image.

> You guys all want it to be something special and I get it,

I’m actually kind of annoyed, because I’ve been collecting tooling, checkpoints, and other support for, and spending quite a bit of time getting proficient in dealing with the quirks of, Stable Diffusion. But, that’s life.

> it’s neither a good architecture nor a good implementation.

I’d be interested in hearing your specific criticism of the architecture and implementation, but hopefully its more grounded in fact than your criticism of the one image...


Here are 16 more images with set seeds this time. Could you provide a complete prompt & seed to generate an image where MJ does this well? https://twitter.com/eb_french/status/1651371581514579969


Please give me a prompt which would back up your claim that MJ can do this! I'd love to learn




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: