Hacker News new | past | comments | ask | show | jobs | submit login
DeepFloyd IF: open-source text-to-image model (github.com/deep-floyd)
259 points by ea016 on April 26, 2023 | hide | past | favorite | 226 comments




GitHub: https://github.com/deep-floyd/IF

Colab Notebook for running the model based on the diffusers library: https://colab.research.google.com/github/huggingface/noteboo...

Hugging Face Space for testing the model: https://huggingface.co/spaces/DeepFloyd/IF

Note that the model is substantially more compute-intensive than Stable Diffusion, so it may be slower even though that space is running on an A100.


> Note that the model is substantially more compute-intensive than Stable Diffusion, so it may be slower even though that space is running on an A100.

Any more specifics on this? Sampling has gotten better too. My primary concern is the amount of memory necessary for a generation (batch size = 1, or I guess "2" using classifier free guidance).

Similar to OpenAI's cascaded diffusion models and GLIDE, you can presumably run the models in sequence, unloading earlier models from memory to make room for the 2nd and 3rd stage models.

Right now, I really only need 256px resolution. So, will I be able to fit the first stage (64px) model in memory on its own with my 12 GB RTX 3060? What about the 2nd stage (64px -> 256px)? 3rd stage?


They’ve said the whole pipeline (using the large models in each step with a choice, IIRC) can be run sequenced with 14GB VRAM.


HF also wrote a blog post on how you can mess around with the model in a python notebook using their excellent Diffusers library: https://huggingface.co/blog/if


I knew the model would have difficulty fitting into a 16GB VRAM GPU, but "you need to load and unload parts of the model pipeline to/from the GPU" is not a workaround I expected.

At that point it's probably better to write a guide on how to set up a VM with a A100 easily instead of trying to fit it into a Colab GPU.


What about people having RTX 4090 with 24GB or even dual? Does it run on it?


I tried the HF Space and it generates images of 64x64 resolution, which are basically useless.


It generates 64x64 for the first stage, but there's a button to upscale your favorite 64x64 image to usable resolution.


There is a big "Upscale" button.


Example of how much better it can do compared to midjourney, on a complex prompt: https://twitter.com/eb_french/status/1623823175170805760

It is able to put people on the left/right and put the correct t-shirts and facial expressions on each one. This is compared to mj which just mixes together a soup of every word you use and plops it out into the image. Huge MJ fan of course, it's amazing, but having compositional power is another step up.


Midjourney always look very aesthetic pleasing, I guess because of their RLHF tuning with Discord data... But it doesn't really follow prompts as well as Dall-e for example.

But in the end, people want pretty pictures. So is a complicated situation.


The tweet they shared is from February and uses an outdated version of MJ, this is what I got from V5: https://i.imgur.com/0uxtZDe.png

Midjourney does much better overall. Composition is neat, but MJ is so incredibly far ahead in terms of quality of output, it honestly doesn't matter if you have to go and do composition manually (and with new AI based tools, that's easier than ever too. Do a bad cut and paste job then infill your way back to a coherent image)


But it didn't work? In yours there is no "Nexus", no smiling, no frowning, and man on right doesn't look Asian? Compared with image in Tweet, MJ failed at this task.


That depends on what your goal was? If your goal was to get an AI model to generate copyrighted images, and misunderstand the relationship between Indians and Asians, then sure MJ failed (and I'm guessing that's the goal of the prompt).

But if I actually wanted a useful picture, I could work with what MJ gave me despite having minimal image editing skills. The DeepFloyd result looks like it's a 8-12 months behind what MJ gave and wouldn't be salvageable.


In what way does the man on the right not look like he could be from the absolutely enormous continent known as Asia?


Thing is, it's possible to just run experiments against your claim that "actually the bot was doing a good job"

https://twitter.com/eb_french/status/1651584746089218049

it wasn't.

Again, I don't know why everyone is so defensive. I love MJ. There's nothing wrong with admitting that other models might do certain things better. We all can use any model we want.


There's a certain irony in tweeting ad hominem attacks then claiming people are being defensive...


Yeah perhaps I was not good at judging tone. To me it's a matter of fact thing that MJ isn't good at this. They'd admit it, it's not a big deal and I'm a fan.

It's not ad hominem at all... mj isn't as good at certain types of composition as others. I don't get why people are pretending that isn't the case. I want everyone to have great models and IF is part of that progress. Perhaps calling it "word soup" was offensive? This isn't your religion, though, it's just a model. Listening in on the MJ office hours they're the farthest thing you can be from dogmatic or arrogant. They want to improve as we all do. I personally am just really inspired that everyone can advance together!

Also see downthread - the first 32 images I generated attempting to reproduce the claim that "actually MJ can do this" all failed. The person who challenged me then ignored it. This isn't really up for debate until someone sends a seed where mj can do the cube + sphere thing well.


"There's a funny hn thread where people aren't good at arguing or experimentation" is an insult.


I didn't claim the bot was doing a good job. Who was this reply supposed to be aimed at?


A human interpreting the prompt would see "asian" as being in contrast to "indian" in the language of the prompt... Not a level of comprehension that can be expected of current models but maybe in a few years (months?).


I'm human. I interpreted Asian as from a non-specific part of Asia. I realise though that "Asian" has a very specific meaning in the US, but it's only the US that does this. For the rest of the world Asian means someone from Asia.


That's kind of what I meant... If a prompt specifies one "Indian", and one "Asian", that implies that the writer of the prompt doesn't think of Indian as Asian so probably from the US background.


There's more than two countries in Asia. Could be one Indian and one Sri Lankan, which is what the old dude looks like to me.


The thing is... IF is currently just a base model, it will need serious fine-tuning before it will produce aesthetically pleasing images (like MJ certainly does).

It's interesting to see what IF can do in terms of composition, text rendering etc, it's very promising if aesthetically pleasing images can be achieved via fine-tuning (the same happened with SD... current publicly fine-tuned models can achieve much higher levels of quality and cohesion than the base models, here's the prompt in an SD2.1 based model: https://imgur.com/a/ELGMSmV ).

Of course fine-tuning IF is likely more challenging, as both the two first stages and the 4x SD upscaler might need to be fine-tuned...


Well... Kind of, photobashing with midjourney doesn't guarantee you the same image or even necessarily the objects in the same places, even if you increase the image weight value up to its maximum of two. ('--iw 2')

Many times you'll have no other choice but to use a diffusion model with img2img.

I agree with OP though, the market has spoken and the vast majority of people use prompts hardly more nuanced than a 90s Mad Magazine book of Mad Libs.


I wasn't referring to photobashing, I meant firing up SD and ArtStudio for 10 minutes and getting something that looks amazing and has the desired composition.

Overall this feels like trying to get ChatGPT to do math: just let ChatGPT offload math to Wolfram.

Similarly I'd rather just offload the composition. Now we even have SAM which will happily pick out the parts of the image you want to compose


Spatial composition can be done easily, if you stop bothering with pure text-to-image (SD has several tricks and UIs to place objects precisely, they are all janky but they do work, that's practically photobashing). Attribute separation is also easily done with tricks like token bucketing, so your Indian guy will look Indian, and your East Asian guy will look East Asian. All of that is easy if you abandon the ambiguous natural language and use higher-order guidance.

What's really required is semantic composition. Making subjects meaningfully and predictably interact, or combining them together. And also the coherence of the overall stitched picture, so you don't end up with several different perspective planes.


Can't wait to see how all of this is going to look like in ten years. I know we are all nitpicking right now but these results are totally mindblowing already.


The thing is all these things can go downhill as well. All these things are cool today just like Google was a decade ago.


Even though most of these we can do locally?


He's probably referring to the _business_ of search. The B word taints a lot of things.


I wonder when we can start fuzzing brains. Wire you up to a machine that measures happiness or anxiety or anger or whatever, and keep re-generating results that hone in on the given emotion.


Sometimes you have to ask yourself if that's a world you really want to live in.


So… heroin, cocaine, and methamphetamine, in memetic form?

I really hope that isn't actually possible.


It's not because something can be done that it must be done.


So it seems like Midjourney looks better and has more realistic faces but this one is better at following the prompt and generating what you actually want


Midjourney follows the “facing each other” better, but DF has readable text. This may be due to DF also suffering the effect that ISTR is common to MJ and SD that specifying conflicting things it understands leads to priority the “visibility” one (the specific text, in this case) over the “compositional” one (in this case, “facing each other”.)


That's not how any of this works what do you even mean by compositional power? Every model speaks a different "language" comparing prompts like this has no merit and shows only that the person who makes the claim lacks understanding of the subject matter.


Compositional power might mean "the image more resembles the composition you want and describe"

i.e. if you say "a red cube on a green sphere" in DeepFloyd, you will get it. If you say that in MJ, you won't. That means you have more power to compose the image you want with this tool.


No, it does mean you don't understand how to prompt MJ, you don't understand it's language. You might like french more but it doesn't mean that it's a better language than english. MJ even says that their model doesn't understand language like humans do in their FAQs...


The point of text to image model is for them to accept natural language (yes, in practice, they all benefit from specialized prompting done with an understanding of model quirks, but that’s not the goal.)


The way to prompt is a preference like programming languages ofc the layman might use the javascript of generative models because it's easier to start and there are a lot of tutorials but some might prefer something more exoctic which can produce the same or better quality. Whatever floats your boat but don't try to compare it like the guy in OPs tweets.

MJ and stable also make clear that their models don't understand language like humans do.


> MJ and stable also make clear that their models don't understand language like humans do.

I believe the claim here is "and we would like them to".


For example, please help me understand how to help MJ do the prompt above!

https://twitter.com/eb_french/status/1651365078137200640

BTW I am a HUGE fan of MJ, and attend the office hours, and have done 35k+ images there. So you may have misinterpreted how much of a supporter of it I am.


first shot but it's a starting point. With 35k images you should be able to do this yourself.

a single (green sphere), with a single (red cube), balancing ((on top))

https://imgur.com/a/QePgM6I


Interesting, I do not get the results you do. What additional parameters are you using? Here is a link to some of my tests, with all default settings, some in v5 some in v4. https://twitter.com/eb_french/status/1651370091869786112

0/16 images have a red cube on a green sphere.


none and as an experienced user you should know that's it's not one shot and most of the time not even few shot... You can't compare cherry picked press images with few shots of a 5 second prompt. I don't know why you want to hype something up if you can't really compare it. It seems extremly attention grifting.

Just look at their cherry picks in this discord... https://discord.com/invite/pxewcvSvNx . It's overfitted on images with copyright (afghan girl) and doesn't show more "compositional power" at all most of the time ignoring half of the prompt.


> as an experienced user you should know that's it's not one shot

Being "not one shot" for most nontrivial prompts is a failure of current t2i models, its what they all strive for and its what DF supposedly does a lot better. And, while its possible to spin things pretty hard when people can't bang on it themselves, I think the indication is that it is, in fact, a major leap forward from the best current consumer-available t2i models (it looks pretty comparable to Google Imagen – a little bit worse benchmark scores – which is unsurprising since it seems to be an implementation of exactly the architecture described in Google's Imagen paper.

> It’s overfitted on images with copyright (afghan girl)

It’s…not, though. Sure, the picture with a prompt which is suggestive of that (down to even specifying the same film type) gives off a vibe that completely feels, if you haven’t recently looked at the famous picture but are familiar with it, like a “cleaned up” version of that picture, so you might intuitively feel its from overfitting, that it is basically reproducing the original image with slight variations.

Look at the two pictures side-by-side and there is basically nothing similar about them except exactly the things specified in the prompt, and pretty much every aspect of the way that those elements of the prompt is interpreted in the DF image is unlike the other image.


Are you associated with deepfloyd?

It's not a major leap not even a small one because it's exactly like imagen. It's stability giving some Ukrainian refugees compute time to train "their" model for publicity. It's about the whom and not what as it should be.

I "feel" nothing I am telling it how it is. Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes... and most important burn in like every other overfitted image in diffusion networks.

You guys all want it to be something special and I get it, new content, new shiny toy but it's neither a good architecture nor a good implementation.


> Are you associated with deepfloyd?

No, I’m not affiliated with StabilityAI

> It’s not a major leap not even a small one because it’s exactly like imagen.

I would agree, if imagen was a “consumer-available t2i model”. What’s available is a research paper with demo images from Google. The model itself is locked up inside Google, notionally because they haven’t solved filtering issues with it.

> Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes…

You look at it again, literally none of those things are the same: its not the same clothing (the material and color of the head scarf is different, the headscarf is the only visible clothing in the DF image, whereas that is not the case in the famous image), the condition of the head scarf is different, the hair color is different, the hair style is different, the hair texture is different, the face shape is different, the individual facial features are different, the eye color is much more brown in the DF image, the facial expression is different, the DF image has lipstick and eyeshadow, the famous image has a dirty face and no makeup, the headscarf is worn differently in the two images, the background is different, the lighting is different, and the faces are framed differently.

The similarities are (1) its a close up portrait, (2) a general ethnic similarity, and (3) they are both wearing a red (though very different red) head scarf, (4) and they are both looking straight into the camera. (2)-(4) are explicitly prompted, (1) is strongly implied in the prompt addressing nothing that isn’t related to the face/head. This isn’t “overfitting on a copyright image” its getting what you prompt, with no other similarity to the existing image.

> You guys all want it to be something special and I get it,

I’m actually kind of annoyed, because I’ve been collecting tooling, checkpoints, and other support for, and spending quite a bit of time getting proficient in dealing with the quirks of, Stable Diffusion. But, that’s life.

> it’s neither a good architecture nor a good implementation.

I’d be interested in hearing your specific criticism of the architecture and implementation, but hopefully its more grounded in fact than your criticism of the one image...


Here are 16 more images with set seeds this time. Could you provide a complete prompt & seed to generate an image where MJ does this well? https://twitter.com/eb_french/status/1651371581514579969


Please give me a prompt which would back up your claim that MJ can do this! I'd love to learn


Has anyone tried the Scott Alexander AI bet prompts?

1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth

2. An oil painting of a man in a factory looking at a cat wearing a top hat

3. A digital art picture of a child riding a llama with a bell on its tail through a desert

4. A 3D render of an astronaut in space holding a fox wearing lipstick

5. Pixel art of a farmer in a cathedral holding a red basketball


Yes, I tried them here on an earlier version of IF: https://twitter.com/eb_french/status/1618354180577714176


I thought it was pretty definitive at the time, but when you look really closely (as Scott's opponent is likely to do), it didn't seem like a clear win yet. But that was 3 months ago, and hopefully DF is even better now.


where are these prompts from?


Scott Alexander made a bet with those prompts here: https://astralcodexten.substack.com/p/a-guide-to-asking-robo...

And followed up with this article when he won the bet: https://astralcodexten.substack.com/p/i-won-my-three-year-ai...


New restriction in their License suggests the software can't be modified.

"2. All persons obtaining a copy or substantial portion of the Software, a modified version of the Software (or substantial portion thereof), or a derivative work based upon this Software (or substantial portion thereof) must not delete, remove, disable, diminish, or circumvent any inference filters or inference filter mechanisms in the Software, or any portion of the Software that implements any such filters or filter mechanisms."


As someone who's largely "OK" with morality clauses in otherwise liberal AI licenses, I think we should start calling these "weights-available" models to distinguish from capital-F Free Software[1] ones.

I'm starting to get irritated by all these 'non-commercial' licensed models, though, because there is no such thing as a non-commercial license. In copyright law, merely having the work in question is considered a commercial benefit. So you need to specify every single act you think is 'non-commercial', and users of the license have to read and understand that. Even Creative Commons' NC clause only specifies one; they say that filesharing is not commercial. So it's just a fancy covenant not to sue BitTorrent users.

And then there's LLaMA, whose model weights were only ever shared privately with other researchers. Everyone using LLaMA publicly is likely pirating it. Actual weights-available or Free models already exist, such as BLOOM, Dolly, StableLM[0], Pythia, GPT-J, GPT-NeoX, and CerebrasGPT.

[0] Untuned only; the instruction-tuned models are frustratingly CC-BY-NC-SA because apparently nobody made an open dataset for instruction tuning.

[1] Insamuch as an AI model trained on copyrighted data can even be considered Free.


Then by definition it isn't open source, violating points 3, 4, and 6 of the open source definition. https://opensource.org/osd/


Yep. It's getting really exhausting seeing projects falsely advertising themselves as "open source". Either be FOSS or don't be; don't pretend to be while using some nonsense like the BSL or whatever adhocery is in play here.


In the README they even call it "Modified MIT", the modification being where they turned it from a very permissive license into a fully proprietary one. Very cool model though.


Well, the Apache Software Foundation were rightly getting annoyed at 'modified Apache' licences...


I bet it's a new benchmark for the labs to get funding


>New restriction in their License suggests the software can't be modified.

To remove filters.

"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:"


You can't remove the filters per the license, but the weights will be available soon and so anyone can just reimplement this code using the weights



There is a similar license clause for the weights[0] as well, so I'm not sure this would apply unless you write the code and train your model from scratch.

[0] https://github.com/deep-floyd/IF/blob/main/LICENSE-MODEL#L54


Or unless, as seems to be fairly widely expected but untested, model weights are not actually copyrightable, so model licenses are superfluous.


> New restriction in their License suggests the software can't be modified.

It can be modified. That just says it can't be modified to bypass their filters.


Neither the source code nor the weights are open source... This is actually worse than Stability AI's previous offering, in that regard.


They are technically open source. It's just that the model license prohibits commercial use and the code license prohibits bypassing the filters. So it's kind of worse than closed source in a way because it's like a tease. With no API apparently.

Theoretically large companies or rich people might be able to make a licensing agreement.


> They are technically open source. It's just that the model license prohibits commercial use and the code license prohibits bypassing the filters.

Your second sentence contradicts the first. Prohibiting commercial use and prohibiting modification are each in and of themselves mutually exclusive being being "technically open source" (let alone both at the same time).


I am a lawyer, and as flimsy and wishy-washy as the term "open-source" already is, I can't even fathom what is meant by "open source" here?

Are people suggesting that "look at the code but don't touch" actually fits what some people think of as open source?


NOT a lawyer here, basially they meant, here the thing, please don't sue me if you messed up.


That's weird, because we do have extremely standard language to take care of that. That's just "warranty disclaimers"


> model license prohibits commercial use

I thought that at first, but I think it only prohibits commercial use that breaks regional copyright or privacy laws.


That's already prohibited by, you know, those very same copyright and privacy laws. Adding those same prohibitions to the license not only makes the software nonfree, but pointlessly does so.


Its not pointless, it means the model licensor has a claim against you, as well as whoever would for violating the referenced laws; it also means, and this is probably more important, that in some juridictions, the model licensor has a better defense against liability for contributory infringement if the licensee infringes.

EDIT: That said, it’s unambiguously not open source.


> it means the model licensor has a claim against you

Right, but to what end? The only reason the licensor should care one way or another is the licensor being held liable for what folks do with the software, in which case...

> it also means, and this is probably more important, that in some juridictions, the model licensor has a better defense against liability for contributory infringement if the licensee infringes.

Do hardware stores need to demand "thou shalt not use this tool to kill people" to their customers to avoid liability for axe murders under such jurisdictions? Or car manufacturers needing to specify "you will not use this product to run over schoolchildren at crosswalks"?

Like, I'm sure such jurisdictions exist, but I somehow doubt license terms in an EULA nobody (except for us nerds) will ever read would be sufficient in such a kangaroo court.

(EDIT: also, I'm pretty sure the standard warranty disclaimer in your average FOSS license already covers this, without making the software nonfree in the process)


> Do hardware stores need to demand "thou shalt not use this tool to kill people" to their customers to avoid liability for axe murders under such jurisdictions?

Generally, not, because vicarious liability for battery and wrongful death doesn’t work like, e.g., contributory copyright infringement.

> I'm pretty sure the standard warranty disclaimer in your average FOSS license already covers this

No, warranty disclaimers don’t cover this, because (1) its not a warranty issue, and (2) disclaimers, if they have legal effect at all, effect liability the disclaiming party would otherwise have to the party accepting the disclaimer, not liability the disclaiming party would have to third parties.


> Generally, not, because vicarious liability for battery and wrongful death doesn’t work like, e.g., contributory copyright infringement.

Judging by youtube-dl, it seems like it does work that way, at least in my jurisdiction; I guess we'll see if the RIAA doubles down on trying to wipe it from the face of the Earth, but considering there hasn't been much noise, I wouldn't count on it. Also, to my adjacent point, I highly doubt the RIAA would've refrained from attempting to take down youtube-dl even if youtube-dl's license prohibited its users from circumventing DRM with it.


> The only reason the licensor should care one way or another is the licensor being held liable for what folks do with the software, in which case...

There are current open cases of people claiming “harm” for misinformation spouted by ChatGPT where it just makes up facts to satisfy user prompts.

There are current open cases where people are claiming “copyright violation” due to diffusion models satisfying user prompts.

AFAICT none of these cases are against the users who are prompting the models.


It prohibits both commercial use, whether or not you break regional laws; and it prohibits breaking certain laws. As another user said, encoding the law into a licence is pointless but makes it non-free.

There are also problematic restrictions on your ability to modify the software under clause 2(c). And nor do you have the right to sublicence, it's not clear to me what rights somebody has if you give them a copy.


Where does it prohibit commercial use? I’m not seeing that in the license.


The model license, 1(a) and 2(a)(i).


Ah I guess I read it right then I reread it wrong, my bad and thanks for the pointer! That's a shame but hopefully it's released in a more open way in the future. My interest is in building good collaborative interfaces (and games!) on top of these things.


That'll change when the full non-research release occurs... https://twitter.com/EMostaque/status/1651328161148174337


That tweet is vague. Besides, it says like 'like SD', so I will be pleasantly shocked if the models are open source.


For anyone who doesn't know, DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM (11b params). The result is that it is much better at responding to more complex prompts.

In theory, it is also smarter at learning from its training data.


>StableDiffusion style

Not really, it's a cascaded diffusion model conditioned on the T5 encoder, there is nothing really in common, unless you mean that using a diffusion model is "SD style".


It isn't like Stable Diffusion, it's more like Google's Imagen model.


> It isn’t like Stable Diffusion, it’s more like Google’s Imagen model.

Yeah, it looks exactly (architecturally) like Imagen.

Google would be running circles around everyone in Generative AI (maybe OpenAI would still have a better core LLM, maybe, but portfolio-wise) if they simply had the ability to cross the gap between building technologies and writing up research papers on them and actually releasing products.



It looks like the model on Hugging Face either hasn't been published yet or was withdrawn. I got this error in their Colab notebook:

OSError: DeepFloyd/IF-I-IF-v1.0 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.


You need to accept the license on the HuggingFace model card.



It's working now.


it doesn't seem like they have anything published https://huggingface.co/DeepFloyd


I swear I saw it a few minutes ago but I might be crazy.


Same got the weights on gdrive.


Everyone is waiting for you to share the weights if it's even true that you saved them.


could you link them ?


Did they just take down the whole model?


Pleaseeeeeeee


Wow this does so well on text! The original model struggled a lot, it's impressive to see how far they've come.


It's much better, but it's not perfect. Here's what I got for:

> a photograph of raccoon in the woods holding a sign that says "I will eat your trash"

https://twitter.com/simonw/status/1651994059781832704


This is not a problem. The sign was clearly made by the raccoon.


Agreed - to properly test it, we should try:

A photograph of an English professor in the woods holding a sign that says "I will eat your trash"


It actually adds quite a nice charm by wording it not correctly


I'm quite curious how much of the improvement on text rendering is from the switch to pixel-space diffusion vs. the switch to a much larger pretrained text encoder. I'm leaning towards the latter, which then raises the question of what happens when you try training Stable Diffusion with T5-XXL-1.1 as the text encoder instead of CLIP — does it gain the ability to do text well?


DeepFloyd IF is effectively the same architecture/text encoder as Imagen (https://imagen.research.google/), although that paper doesn't hypothesize why text works out a lot better.


Right, I'm aware of the Imagen architecture, just curious to see further research determining which aspect of it is responsible for the improved text rendering.

EDIT: According to the figure in the Imagen paper FL33TW00D's response referred me to, it looks like the text encoder size is the biggest factor in the improved model performance all-around.


The CLIP text encoder is trained to align with the pooled image embedding (a single vector), which is why most text embeddings are not very meaningful on their own (but still convey the overall semantics of the text). With T5 every text embedding is important.


Check out figure 4A from the ImageGen paper: https://arxiv.org/pdf/2205.11487.pdf

From that - I would strongly suspect the answer to your question to be yes.


Ah, yes, this seems to be pretty strong evidence. Thanks for pointing that figure out to me!


It is most likely due to the text encoder - see "Character-Aware Models Improve Visual Text Rendering". https://arxiv.org/abs/2212.10562


That alone is a huge milestone.


I think this model will result in a massive new wave of meme culture. AI's already seen success in memes up to this point, but the ability for readable text to be incorporated into images totally changes the game. Going to be an interesting next few months on the interwebz, that's for sure. Exciting times!


"Hi! I'm B-19-7, but to everyperson I'm called Floyd." -Planetfall (1983)

My first thought on seeing "Floyd" and "IF" together. It looks like a Pink Floyd reference from the About page on https://deepfloyd.ai/ though.


This could be super cool for logos. I've tried using Stable Diffusion to generate logos and it does pretty good at helping brainstorm, but the text is always gibberish so you can use its idea, but you have to add your own text which basically means creating a logo from scratch using its designs as inspiration.


Well, surely you don't expect to just take the generated stuff and use it right away? For logos you usually need something that some entity can hold the claim on, so you probably need a human touch in any case.


MidJourney's v5 model did a great job producing our logo https://summer.ai/static/nearby/android-chrome-192x192.png

Admittedly it was one of several hundred I had it spit out for me. But the design was completely original and caught me by surprise.

I tried making modifications but everyone kept telling me they prefer the original exactly as MidJourney made it!


Authorship is not a requirement for trademark, though having copyright as well is nice.


You can't use this to make logos for any commercial product, and it's not safe to use it for hobby projects either, based on the current model license.


> You can't use this to make logos for any commercial product

Yeah good luck figuring out that a particular logo was generated with this particular model. And if someone does good luck doing anything about it.

With this amount of fear one wouldn't dare to cross a road without three layers of bubble wrap, plus written authorisation from a lawyer plus a feasibility study from a traffic engineer.


You've never gone through an acquisition or due diligence, have you?


Clearly.

Explain it to me what will happen with the generated logo at an acquisition or due diligence.


You'll be asked to produce documents verifying your ownership and/or compliance with any licenses for all intellectual property. That includes code and graphics (logos).


Sure. And what documents and paperwork do you expect if I the owner of the company drawn the logo myself with a bit of a crayon/inkscape/gimp/blender? Those exact documents will be produced.


I would expect a lawsuit if you gave them to investors, and jailtime if it was part of an IPO. Keep draggin down that bar in tech, I guess.


You also cannot just ingest people's media to train a for-profit AI image generation service, and well, here we are.

PS: not an AI apologist, just pointing the irony. Feels like those fan sonic characters "original content do not steal".



Try controlNet with stable diffusion, I’ve been having good results for text.


The examples on the README are extremely compelling; the state of the art has been raised yet again.


The current license makes this largely unusable for nearly any purpose. Really disappointing release from SAI.


I think this is just for pre-release, and they will release fully licensed for Commercial. It doesn't make sense to have a model like this that can do game changing text and logos... but then not license it for commercial. If they don't, that would be ridiculous.


> Text

> Hands

good god it solves the two biggest meme issues with image models in one go. Will this be the new state of the art every other model is compared to?


There's fundamental tradeoffs, as there always will be when you're compressing things into an image model.

So, this is going to have new different issues. Since it's similar to Imagen, it probably can't handle long complex prompts as well, since they developed Parti afterward.

Here's my question: are there any image models where, if you prompt "1+1", you get an image showing "3"?


> So, this is going to have new different issues.

Well, yeah, its a bigger set of models (particular the language model) that takes more resources (both to train and for inference.) That’s the tradeoff.

> Here’s my question: are there any image models where, if you prompt “1+1”, you get an image showing “3”?

You want a t2i model that does arithmetic in the prompt, translates to it to “text displaying the number <result>”, but, also does the arithmetic wrong?

Yeah, I don’t think that combination of features is in any existing model or, really, in any of the datasets used for evaluation, or otherwise on anyone’s roadmap.


Pretend I wrote 2, edit timeout closed.

"Actually thinking about your prompt" is a necessary part of being able to make the prompts natural language instead of a long list of fantasy google image search terms.

Useful example being "my bedroom but in a new color", but some things I've typed into Midjourney that don't work include "a really long guinea pig" (you get a regular size one), "world's best coffee" (the coffee cup gets a world on it), etc. It's just too literal.

And yes, preprocessing with an LLM could do this.


I don't think they're saying that's a goal, I think they're curious if it is the case. LLMs are bad at arithmetic, this uses a LLM to process the prompt, that class of result seems plausible.


We already knew those were going to be solved by scale like using T5 instead of the really small bad text encoder SD used, because they were solved by Imagen etc.


There are good reasons to believe that this will be the new state of the art by a comfortable margin. Hard to know until we can actually play with it.


I understand, I have a decade old 2 nvidia 1080 to card, can we infer and train IF on them ?


16GB VRAM minimum is a bit steep. Sadly excludes my 3080 which is annoying because I'd like something better than Stable Diffusion locally.


There's a note which suggests you might be able to get by on lower. My 3060 struggles with SD on the defaults, but works fine with float16.

There are multiple ways to speed up the inference time and lower the memory consumption even more with diffusers. To do so, please have a look at the Diffusers docs:

        Optimizing for inference time [1]
        Optimizing for low memory during inference [2]

[1] https://huggingface.co/docs/diffusers/api/pipelines/if#optim...

[2] https://huggingface.co/docs/diffusers/api/pipelines/if#optim...


If you don't mind the power consumption I noticed that older nvidia P6000's (24GB) are pretty cheap on ebay! My 16GB P5000 is pretty handy for this stuff.


An M40 24GB is less than $200, if you don't mind the trouble to get it's drivers installed, cooled, etc. It's also important to note your motherboard must support larger VRAM addressing; many older chipsets won't be able to boot with it (i.e. some, perhaps almost all, Zen 1 supporters).


Looks like P6000 24Gb goes for $800-$1200 while you can get superior 3090 24Gb for $800-$1000 .


4090s are only $1600 or so now, for that matter.


oh! My mistake thanks for letting me know.


IIRC SD’s first party code had singificantly higher requirements than is possible with some of the downstream optimization most people are using, which have lower speed and more system RAM use but reduce peak VRAM use.

The architecture here looks different, but the code is licensed in a way which still makes downstream optimization and redistribution possible, so maybe there will be something there.


Once these are quantized (I assume they can be), they should be ~1/4th the size.

Can anyone explain why it needs so much ram in the first place though? 4.3B is only ~9GB at 16bit (I'm not as familiar with image models).

I'm really happy to see that fits under 24GB - that's what I consider the limit for being able to run on "consumer hardware".


They took down the blogpost, but from what I remember the model is composite and consists of a text encoder as well as 3 "stages":

1. (11B) T5-XXL text encoder [1]

2. (4.3B) Stage 1 UNet

3. (1.3B) Stage 2 upscaler (64x64 -> 256x256)

4. (?B) Stage 3 upscaler (256x256 -> 1024x1024)

Resolution numbers could be off though. Also the third stage can apparently use the existing stable diffusion x4, or a new upscaler that they aren't releasing yet (ever?).

> Once these are quantized (I assume they can be)

Based on the success of LLaMA 4bit quantization, I believe the text encoder could be. As for the other modules, I'm not sure.

edit: the text encoder is 11B, not 4.5B as I initially wrote.

[1]: https://huggingface.co/google/t5-v1_1-xxl


You'll be able to optimize it a lot to make it fit on small systems if you are willing to modify your workflow a bit: instead of 1 prompt -> 1 image _n_ times, do 1 prompt -> _n_ images 1 time -> _m_ times... For a given prompt, run it through the T5 model and store; you can do that in CPU RAM if you have to because you only need the embedding once so you don't need a GPU which can run T5-XXL naively. Then you can get a large batch of samples from #2; 64px is enough to preview; only once you pick some do you run through #3, and then from those through #4. Your peak VRAM should be 1 image in #2 or #4 and that can be quantized or pruned down to something that will fit on many GPUs.


The entire T5-XXL model is 11B but you don't need the decoder.


>Can anyone explain why it needs so much ram in the first place though?

The T5-XXL text encoder is really large, also we do not quantize the UNets, the UNet outputs 8-bit pixels, so quantizing the UNet to that precision will create pretty bad outputs.


> Gorbachev holding meatball pasta in both hands. 1980s synth futuristic max headroom aesthetic. Neon lights.

> Aristotle in ancient greek clothes. Toga. New york, rain, film noir, fog, art deco, neon lights, blade runner sci fi

Seems to be holding up recently well with the first promt. Second was only OK.


So this one can create perfect text in images? If true, that’s insane


LDM-400M was already able to generate text (predecessor of Stable Diffusion), thanks to the fact that every token in the text encoder (trained from scratch) was available in the attention layer.


>thanks to the fact that every token in the text encoder (trained from scratch) was available in the attention layer.

>ChatGPT explain this like I'm 5


“Every word in the text can be used to help create the image.”


interesting there are different models: https://github.com/deep-floyd/IF#-model-zoo-

I'm also very happy for the release of the two upscaler, I can use them to upscale to result of my small 64x64 DDIM models (maybe with some finetuning).


I would be more interested in image-to-text models. Does someone know of any decent model? I saw the GPT4 demo, and they showed that they do image-to-text... but then that was actually a fake (i.e., the model was interpreting the image filename).


Can you provide a source on gpt4 image model being fake? I haven’t heard that before, though I have wondered why I haven’t heard anything about the image part and don’t have access to image processing myself.


CLIP Interrogator[1], which is also build into AUTOMATIC1111, gives quite reasonable results, at least if all you need is a prompt, it can't handle complex interactions:

Image: https://i.imgur.com/husplYZ.png

Output: "a white horse with a sign that says rexel's in space, pixelperfect, inspired by Paul Kelpe, official simpsons movie artwork, alternate album cover, in style of nanospace, by Apelles, pickles, pespective, pop surrealism, ingame, in a space cadet outfit, sifi"

[1] https://huggingface.co/spaces/pharma/CLIP-Interrogator


For a fast-but-less-robust model, you can use a ViT encode/GPT-2 decoder model: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

For a more-robust-but-hard-to-run model, you can use BLIP2: https://huggingface.co/Salesforce/blip2-opt-2.7b


AFAIK, converting an image to a text summary isn't really a thing by itself. The related work would be "visual reasoning" which is the ability to ask things about the image in natural language and get responses back also in natural language.

I believe the current SOTA test for NLVR is VQAv2[0] or GQA[1].

0: https://visualqa.org/ 1: https://arxiv.org/pdf/1902.09506.pdf


This is text + image -> text but pretty cool and still might be of interest to you:

https://llava-vl.github.io


MidJourney has the describe function which is kind of like that. Not sure how decent it actually is


Is this intended to replace Stable Diffusion? Somebody want to give the eli5?


This does outperform Stable Diffusion 2.1, but uses a different architecture and requires more memory and compute. Stable Diffusion runs its denoising process in a compressed "latent space" which is how it was able to be so compute-efficient compared to other diffusion models. It also uses the (relatively) small text encoder from OpenAI's CLIP model to encode user prompts. Both of these optimizations meant that it could run much faster compared to say, DALLE or Imagen, but it didn't follow complicated user prompts especially well and had trouble with things like counting and text-rendering.

DeepFloyd IF is based on Google's Imagen model, which has two key differences from Stable Diffusion: (1) it denoises in pixel space instead of a compressed latent space, and (2) it uses a 10x larger pretrained text encoder (T5-XXL-1.1) compared to SD's CLIP encoder. (1) allows it to better render high-frequency details and text, and (2) allows it to understand complex prompts much better. These improvements come at the cost of multiple times more memory usage and compute requirements compared to SD, though.

In terms of "will it replace SD?"—in the short term I think yes. But I still think latent diffusion models are the future. For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.


> For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.

SDXL is available on StabilityAI’s hosted services already, so they can be compared head to head.


I believe the version available on DreamStudio is heavily RLHF-tuned, no? I'm mostly interested to see how the raw weights perform out of the box compared to IF, which we have to wait for the release for.


Does the increased memory footprint mean it can't be run on a normal desktop like SD?


The VRAM requirements are higher (14GB) so lots of things that can do SD won’t do this with thr existing toolchain. But some of that is “aftermarket” SD optimization, and this maybe could see some of that, too.

But there are consumer cards with 14GB+ VRAM, so its not, even before optimization, out of reach of consumer hardware.


Damn. That's basically just the 4090 and 4080.


The 3090 has 24GB so it's an option as well.


thanks for the explanation!

denoising in latent space certainly seems like the "correct" path. My (amateur) thinking is, the more you can do in latent space, the better.


I’m not sure why “denoise in latent space at 64x64 and decode to pixel space at target resolution” is fundamentally better than “denoise in pixel space at 64x64, then upscale to pixel space at target resolution and denoise some more”.

The former seems likely to be lower compute-for-resolution, but that’s not the only consideration for “better”...


This was an excellent summary, ty.


tldr: bigger text encoder is better. SD will catch up quickly, as conditioning on a new set of precomputed text embeddings is a trivial change


I didn't think about that but you're totally right, assuming they have those embeddings cached it would be super easy to retrain SD using them. 11B parameter count is rather unfortunate though tbh, I've never been the biggest fan of "scale is all you need" even though it seems to ring irritatingly true most of the time.


Seems to be entirely a different approach for diffusion.

>DeepFloyd IF works in pixel space. The diffusion is implemented on a pixel level, unlike latent diffusion models (like Stable Diffusion), where latent representations are used.


> Stability AI releases DeepFloyd IF, a powerful text-to-image model

Hope not. This is a worse license.


As far as I can tell from Emad's discord and twitter discussion, the idea appears to be to make this a "research" release, and therefore the worse license.

At a later point the model will be renamed "StableIf", and released with a similar license to StableDiffusion.


Well, its a better license than SDXL is available under right now (which is “you can’t have it, but you can use it on StabilityAI’s hosted services”.)


Yeah Emad was clarifying this on the LAION discord the other day — plan is to have a better-licensed version out Eventually™, guess we'll see how long that takes.


for certain values of soon™ and eventually™


This is the dumb part about open-source models. Criminals, governments, and propaganda spreaders need not worry about the license; but legitimate users do.


This is the same problem with laws. The only people that follow them are legitimate users.


I hear this complaint often, especially in regards to gun control. Yes, there are a subset of people who do what they are going to do irrespective of laws. There is also a middle ground of people where the laws might curtail unwanted behavior. But the main purpose is it provides a basis for punishing unwanted behavior.

To make it concrete, one could argue that bank robbers rob banks even though it is illegal, so why have a law against it since law abiding people aren't going to rob banks. Does anyone really think we should remove such laws?


Second paragraph in the link:

DeepFloyd IF is a state-of-the-art text-to-image model released on a non-commercial, research-permissible license that provides an opportunity for research labs to examine and experiment with advanced text-to-image generation approaches. In line with other Stability AI models, Stability AI intends to release a DeepFloyd IF model fully open source at a future date.


Looks like music generation is on their roadmap. Fun!

https://stability.ai/careers?gh_jid=4142190101


Any web based front ends yet? I put together a system that runs a variety of web based open source AI image generation and editing tools on Vultr GPU instances. It spins up instances on demand, mounts an NFS filesystem with local caching and a COW layer, spawns the services, proxies the requests, and then spins down idle instances when I'm done. Would love to add this, suppose I could whip something up if none exists.


It'll probably be in the Auto1111 WebUI within a week.


You think? Automatic1111 is still on pytorch 1.7 and SD1.5


Stable Diffusion 2.x has been supported for a while.


Yup, lots of misinformation in this thread from those who are not in-the-know.

Automatic1111 is the defacto main UI for these kind of models. It will be supported there, quite quickly.


Automatic hasn't been updated for several weeks at this point. Several people are trying to fork the repo to make their own continuation.


What's your app / service called?!


Not publicly released. I'd be happy to show you if you send me an email. See my profile.


Here are some play markets on manifold markets tracking its release: https://manifold.markets/markets?s=relevance&f=all&q=deepflo...

35% to full release by end of month, although it may not have adjusted.



Seeing a lot of text-to-image out there recently. Does anyone know what the current state of the art is on image-to-text? Thinking something similar to Midjourney's /describe command that they added in v5


While it's not publicly available yet, I have strong suspicions that multimodal GPT-4 may actually be SOTA in image-to-text. The examples shown in the Sparks of AGI paper were extremely impressive imo, though of course those are cherry-picked so it's unclear how well the model will perform on non-cherry-picked images.


This is text + image -> text but pretty cool and still might be of interest to you:

https://llava-vl.github.io


Just entering "Describe this image" in the chat prompt got me exactly what I was looking for. Thanks!


There's a discord with tons of sample images, where we've been waiting patiently for the release, coming SOON, for 3 months now. https://discord.gg/pxewcvSvNx


What these AI companies need are some good old-fashioned leakers. We should be seeing these models show up on sketchy pirate sites, complete with garish 80s-style cracking screens crediting various '1337 haX0rs with witty pseudonyms.


Well, NovelAI was hacked and had their image generation model leaked last year.


Website design main page. Bright vibrant neon colors of the rainbow slimes, slime business, kid attention grabbing, splashes of bright neon colors. Professional looking Website page, high quality resolution 8k


What are the official and unofficial discords?

I found only this one on their subreddit: https://discord.gg/GvsvNrVkk5


Meh, results feel hodge podge like a bunch of models were stitched together


Website design for slime. Professional looking, high-quality, 8k, brightest neon colors of the rainbow slimes, splashes of neon colors in background, kid attention grabbing, eye catching


Tried using right now, and it's way better than Stable Diffusion (be it 1.5, 2.1 or SDXL).

But is harder to get a good picture. This fine tuned with a good RLHF will be amazing.


What does this mean? Isn't the quality of a model determined by how easy it is to get a good picture?


Not necessarily. IMO a good model needs to follow your prompt well, and that was my problem with Stable Diffusion.

I've been trying to get a good portrait picture with "neon lights" on Stable Diffusion and it is almost impossible. Meanwhile with the new Dall-e, that was possible. The picture specially with SDXL is good, but it doesn't really have neon lights...

I tried now similar prompt on deepfloyd and managed to get there!


would be interesting if you could used deepfloyd first for image composition, then apply stable diffusion after for purely stylistic modifications


Definitely possible :) I've been doing this with new Dall-e + img2img with Stable Diffusion.

Explaining: I created a model of me, and wanted to create some good realistic portrait pictures. First I tried to create a model of me using some of the custom models already exist and the result was bad.

Then I tried SD 1.5/2.1... It was better, but couldn't really get some of the prompts make real...

Then I tried new Dall-e, saved, and inserted my face with img2img on SD and it worked much better!


Does paying Hugggingface to run it on the GPU count as commercial use?


Currently down on hugging face


"Imagen free"




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: