DeepFloyd IF: open-source text-to-image model

dang · on April 28, 2023

Related: https://stability.ai/blog/deepfloyd-if-text-to-image-model

(via https://news.ycombinator.com/item?id=35743727, but we've merged that thread into this earlier one)

minimaxir · on April 28, 2023

GitHub: https://github.com/deep-floyd/IF

Colab Notebook for running the model based on the diffusers library: https://colab.research.google.com/github/huggingface/noteboo...

Hugging Face Space for testing the model: https://huggingface.co/spaces/DeepFloyd/IF

Note that the model is substantially more compute-intensive than Stable Diffusion, so it may be slower even though that space is running on an A100.

ShamelessC · on April 28, 2023

> Note that the model is substantially more compute-intensive than Stable Diffusion, so it may be slower even though that space is running on an A100.

Any more specifics on this? Sampling has gotten better too. My primary concern is the amount of memory necessary for a generation (batch size = 1, or I guess "2" using classifier free guidance).

Similar to OpenAI's cascaded diffusion models and GLIDE, you can presumably run the models in sequence, unloading earlier models from memory to make room for the 2nd and 3rd stage models.

Right now, I really only need 256px resolution. So, will I be able to fit the first stage (64px) model in memory on its own with my 12 GB RTX 3060? What about the 2nd stage (64px -> 256px)? 3rd stage?

dragonwriter · on April 28, 2023

They’ve said the whole pipeline (using the large models in each step with a choice, IIRC) can be run sequenced with 14GB VRAM.

mkaic · on April 28, 2023

HF also wrote a blog post on how you can mess around with the model in a python notebook using their excellent Diffusers library: https://huggingface.co/blog/if

minimaxir · on April 28, 2023

I knew the model would have difficulty fitting into a 16GB VRAM GPU, but "you need to load and unload parts of the model pipeline to/from the GPU" is not a workaround I expected.

At that point it's probably better to write a guide on how to set up a VM with a A100 easily instead of trying to fit it into a Colab GPU.

mirekrusin · on April 28, 2023

What about people having RTX 4090 with 24GB or even dual? Does it run on it?

bwesz · on April 28, 2023

I tried the HF Space and it generates images of 64x64 resolution, which are basically useless.

ollin · on April 28, 2023

It generates 64x64 for the first stage, but there's a button to upscale your favorite 64x64 image to usable resolution.

GaggiX · on April 28, 2023

There is a big "Upscale" button.

epivosism · on April 26, 2023

Example of how much better it can do compared to midjourney, on a complex prompt: https://twitter.com/eb_french/status/1623823175170805760

It is able to put people on the left/right and put the correct t-shirts and facial expressions on each one. This is compared to mj which just mixes together a soup of every word you use and plops it out into the image. Huge MJ fan of course, it's amazing, but having compositional power is another step up.

vitorgrs · on April 27, 2023

Midjourney always look very aesthetic pleasing, I guess because of their RLHF tuning with Discord data... But it doesn't really follow prompts as well as Dall-e for example.

But in the end, people want pretty pictures. So is a complicated situation.

BoorishBears · on April 27, 2023

The tweet they shared is from February and uses an outdated version of MJ, this is what I got from V5: https://i.imgur.com/0uxtZDe.png

Midjourney does much better overall. Composition is neat, but MJ is so incredibly far ahead in terms of quality of output, it honestly doesn't matter if you have to go and do composition manually (and with new AI based tools, that's easier than ever too. Do a bad cut and paste job then infill your way back to a coherent image)

exodust · on April 27, 2023

But it didn't work? In yours there is no "Nexus", no smiling, no frowning, and man on right doesn't look Asian? Compared with image in Tweet, MJ failed at this task.

BoorishBears · on April 27, 2023

That depends on what your goal was? If your goal was to get an AI model to generate copyrighted images, and misunderstand the relationship between Indians and Asians, then sure MJ failed (and I'm guessing that's the goal of the prompt).

But if I actually wanted a useful picture, I could work with what MJ gave me despite having minimal image editing skills. The DeepFloyd result looks like it's a 8-12 months behind what MJ gave and wouldn't be salvageable.

drcongo · on April 27, 2023

In what way does the man on the right not look like he could be from the absolutely enormous continent known as Asia?

epivosism · on April 27, 2023

Thing is, it's possible to just run experiments against your claim that "actually the bot was doing a good job"

https://twitter.com/eb_french/status/1651584746089218049

it wasn't.

Again, I don't know why everyone is so defensive. I love MJ. There's nothing wrong with admitting that other models might do certain things better. We all can use any model we want.

BoorishBears · on April 27, 2023

There's a certain irony in tweeting ad hominem attacks then claiming people are being defensive...

epivosism · on April 27, 2023

Yeah perhaps I was not good at judging tone. To me it's a matter of fact thing that MJ isn't good at this. They'd admit it, it's not a big deal and I'm a fan.

It's not ad hominem at all... mj isn't as good at certain types of composition as others. I don't get why people are pretending that isn't the case. I want everyone to have great models and IF is part of that progress. Perhaps calling it "word soup" was offensive? This isn't your religion, though, it's just a model. Listening in on the MJ office hours they're the farthest thing you can be from dogmatic or arrogant. They want to improve as we all do. I personally am just really inspired that everyone can advance together!

Also see downthread - the first 32 images I generated attempting to reproduce the claim that "actually MJ can do this" all failed. The person who challenged me then ignored it. This isn't really up for debate until someone sends a seed where mj can do the cube + sphere thing well.

BoorishBears · on April 27, 2023

"There's a funny hn thread where people aren't good at arguing or experimentation" is an insult.

drcongo · on April 28, 2023

I didn't claim the bot was doing a good job. Who was this reply supposed to be aimed at?

janekm · on April 27, 2023

A human interpreting the prompt would see "asian" as being in contrast to "indian" in the language of the prompt... Not a level of comprehension that can be expected of current models but maybe in a few years (months?).

drcongo · on April 27, 2023

I'm human. I interpreted Asian as from a non-specific part of Asia. I realise though that "Asian" has a very specific meaning in the US, but it's only the US that does this. For the rest of the world Asian means someone from Asia.

janekm · on April 27, 2023

That's kind of what I meant... If a prompt specifies one "Indian", and one "Asian", that implies that the writer of the prompt doesn't think of Indian as Asian so probably from the US background.

drcongo · on April 28, 2023

There's more than two countries in Asia. Could be one Indian and one Sri Lankan, which is what the old dude looks like to me.

janekm · on April 27, 2023

The thing is... IF is currently just a base model, it will need serious fine-tuning before it will produce aesthetically pleasing images (like MJ certainly does).

It's interesting to see what IF can do in terms of composition, text rendering etc, it's very promising if aesthetically pleasing images can be achieved via fine-tuning (the same happened with SD... current publicly fine-tuned models can achieve much higher levels of quality and cohesion than the base models, here's the prompt in an SD2.1 based model: https://imgur.com/a/ELGMSmV ).

Of course fine-tuning IF is likely more challenging, as both the two first stages and the 4x SD upscaler might need to be fine-tuned...

throwaway675309 · on April 27, 2023

Well... Kind of, photobashing with midjourney doesn't guarantee you the same image or even necessarily the objects in the same places, even if you increase the image weight value up to its maximum of two. ('--iw 2')

Many times you'll have no other choice but to use a diffusion model with img2img.

I agree with OP though, the market has spoken and the vast majority of people use prompts hardly more nuanced than a 90s Mad Magazine book of Mad Libs.

BoorishBears · on April 27, 2023

I wasn't referring to photobashing, I meant firing up SD and ArtStudio for 10 minutes and getting something that looks amazing and has the desired composition.

Overall this feels like trying to get ChatGPT to do math: just let ChatGPT offload math to Wolfram.

Similarly I'd rather just offload the composition. Now we even have SAM which will happily pick out the parts of the image you want to compose

orbital-decay · on April 27, 2023

Spatial composition can be done easily, if you stop bothering with pure text-to-image (SD has several tricks and UIs to place objects precisely, they are all janky but they do work, that's practically photobashing). Attribute separation is also easily done with tricks like token bucketing, so your Indian guy will look Indian, and your East Asian guy will look East Asian. All of that is easy if you abandon the ambiguous natural language and use higher-order guidance.

What's really required is semantic composition. Making subjects meaningfully and predictably interact, or combining them together. And also the coherence of the overall stitched picture, so you don't end up with several different perspective planes.

wheresmyshadow · on April 27, 2023

Can't wait to see how all of this is going to look like in ten years. I know we are all nitpicking right now but these results are totally mindblowing already.

quaintdev · on April 27, 2023

The thing is all these things can go downhill as well. All these things are cool today just like Google was a decade ago.

wheresmyshadow · on April 27, 2023

Even though most of these we can do locally?

cyanydeez · on April 29, 2023

He's probably referring to the _business_ of search. The B word taints a lot of things.

petesergeant · on April 27, 2023

I wonder when we can start fuzzing brains. Wire you up to a machine that measures happiness or anxiety or anger or whatever, and keep re-generating results that hone in on the given emotion.

nullsense · on April 27, 2023

Sometimes you have to ask yourself if that's a world you really want to live in.

ben_w · on April 27, 2023

So… heroin, cocaine, and methamphetamine, in memetic form?

I really hope that isn't actually possible.

rslice · on April 27, 2023

It's not because something can be done that it must be done.

circuit10 · on April 26, 2023

So it seems like Midjourney looks better and has more realistic faces but this one is better at following the prompt and generating what you actually want

dragonwriter · on April 26, 2023

Midjourney follows the “facing each other” better, but DF has readable text. This may be due to DF also suffering the effect that ISTR is common to MJ and SD that specifying conflicting things it understands leads to priority the “visibility” one (the specific text, in this case) over the “compositional” one (in this case, “facing each other”.)

Zetobal · on April 26, 2023

That's not how any of this works what do you even mean by compositional power? Every model speaks a different "language" comparing prompts like this has no merit and shows only that the person who makes the claim lacks understanding of the subject matter.

epivosism · on April 26, 2023

Compositional power might mean "the image more resembles the composition you want and describe"

i.e. if you say "a red cube on a green sphere" in DeepFloyd, you will get it. If you say that in MJ, you won't. That means you have more power to compose the image you want with this tool.

Zetobal · on April 26, 2023

No, it does mean you don't understand how to prompt MJ, you don't understand it's language. You might like french more but it doesn't mean that it's a better language than english. MJ even says that their model doesn't understand language like humans do in their FAQs...

dragonwriter · on April 26, 2023

The point of text to image model is for them to accept natural language (yes, in practice, they all benefit from specialized prompting done with an understanding of model quirks, but that’s not the goal.)

Zetobal · on April 26, 2023

The way to prompt is a preference like programming languages ofc the layman might use the javascript of generative models because it's easier to start and there are a lot of tutorials but some might prefer something more exoctic which can produce the same or better quality. Whatever floats your boat but don't try to compare it like the guy in OPs tweets.

MJ and stable also make clear that their models don't understand language like humans do.

ben_w · on April 27, 2023

> MJ and stable also make clear that their models don't understand language like humans do.

I believe the claim here is "and we would like them to".

epivosism · on April 26, 2023

For example, please help me understand how to help MJ do the prompt above!

https://twitter.com/eb_french/status/1651365078137200640

BTW I am a HUGE fan of MJ, and attend the office hours, and have done 35k+ images there. So you may have misinterpreted how much of a supporter of it I am.

Zetobal · on April 26, 2023

first shot but it's a starting point. With 35k images you should be able to do this yourself.

a single (green sphere), with a single (red cube), balancing ((on top))

https://imgur.com/a/QePgM6I

epivosism · on April 26, 2023

Interesting, I do not get the results you do. What additional parameters are you using? Here is a link to some of my tests, with all default settings, some in v5 some in v4. https://twitter.com/eb_french/status/1651370091869786112

0/16 images have a red cube on a green sphere.

Zetobal · on April 26, 2023

none and as an experienced user you should know that's it's not one shot and most of the time not even few shot... You can't compare cherry picked press images with few shots of a 5 second prompt. I don't know why you want to hype something up if you can't really compare it. It seems extremly attention grifting.

Just look at their cherry picks in this discord... https://discord.com/invite/pxewcvSvNx . It's overfitted on images with copyright (afghan girl) and doesn't show more "compositional power" at all most of the time ignoring half of the prompt.

dragonwriter · on April 27, 2023

> as an experienced user you should know that's it's not one shot

Being "not one shot" for most nontrivial prompts is a failure of current t2i models, its what they all strive for and its what DF supposedly does a lot better. And, while its possible to spin things pretty hard when people can't bang on it themselves, I think the indication is that it is, in fact, a major leap forward from the best current consumer-available t2i models (it looks pretty comparable to Google Imagen – a little bit worse benchmark scores – which is unsurprising since it seems to be an implementation of exactly the architecture described in Google's Imagen paper.

> It’s overfitted on images with copyright (afghan girl)

It’s…not, though. Sure, the picture with a prompt which is suggestive of that (down to even specifying the same film type) gives off a vibe that completely feels, if you haven’t recently looked at the famous picture but are familiar with it, like a “cleaned up” version of that picture, so you might intuitively feel its from overfitting, that it is basically reproducing the original image with slight variations.

Look at the two pictures side-by-side and there is basically nothing similar about them except exactly the things specified in the prompt, and pretty much every aspect of the way that those elements of the prompt is interpreted in the DF image is unlike the other image.

Zetobal · on April 27, 2023

Are you associated with deepfloyd?

It's not a major leap not even a small one because it's exactly like imagen. It's stability giving some Ukrainian refugees compute time to train "their" model for publicity. It's about the whom and not what as it should be.

I "feel" nothing I am telling it how it is. Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes... and most important burn in like every other overfitted image in diffusion networks.

You guys all want it to be something special and I get it, new content, new shiny toy but it's neither a good architecture nor a good implementation.

dragonwriter · on April 27, 2023

> Are you associated with deepfloyd?

No, I’m not affiliated with StabilityAI

> It’s not a major leap not even a small one because it’s exactly like imagen.

I would agree, if imagen was a “consumer-available t2i model”. What’s available is a research paper with demo images from Google. The model itself is locked up inside Google, notionally because they haven’t solved filtering issues with it.

> Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes…

You look at it again, literally none of those things are the same: its not the same clothing (the material and color of the head scarf is different, the headscarf is the only visible clothing in the DF image, whereas that is not the case in the famous image), the condition of the head scarf is different, the hair color is different, the hair style is different, the hair texture is different, the face shape is different, the individual facial features are different, the eye color is much more brown in the DF image, the facial expression is different, the DF image has lipstick and eyeshadow, the famous image has a dirty face and no makeup, the headscarf is worn differently in the two images, the background is different, the lighting is different, and the faces are framed differently.

The similarities are (1) its a close up portrait, (2) a general ethnic similarity, and (3) they are both wearing a red (though very different red) head scarf, (4) and they are both looking straight into the camera. (2)-(4) are explicitly prompted, (1) is strongly implied in the prompt addressing nothing that isn’t related to the face/head. This isn’t “overfitting on a copyright image” its getting what you prompt, with no other similarity to the existing image.

> You guys all want it to be something special and I get it,

I’m actually kind of annoyed, because I’ve been collecting tooling, checkpoints, and other support for, and spending quite a bit of time getting proficient in dealing with the quirks of, Stable Diffusion. But, that’s life.

> it’s neither a good architecture nor a good implementation.

I’d be interested in hearing your specific criticism of the architecture and implementation, but hopefully its more grounded in fact than your criticism of the one image...

epivosism · on April 26, 2023

Here are 16 more images with set seeds this time. Could you provide a complete prompt & seed to generate an image where MJ does this well? https://twitter.com/eb_french/status/1651371581514579969

epivosism · on April 26, 2023

Please give me a prompt which would back up your claim that MJ can do this! I'd love to learn

lalaithion · on April 26, 2023

Has anyone tried the Scott Alexander AI bet prompts?

1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth

2. An oil painting of a man in a factory looking at a cat wearing a top hat

3. A digital art picture of a child riding a llama with a bell on its tail through a desert

4. A 3D render of an astronaut in space holding a fox wearing lipstick

5. Pixel art of a farmer in a cathedral holding a red basketball

epivosism · on April 26, 2023

Yes, I tried them here on an earlier version of IF: https://twitter.com/eb_french/status/1618354180577714176

epivosism · on April 26, 2023

I thought it was pretty definitive at the time, but when you look really closely (as Scott's opponent is likely to do), it didn't seem like a clear win yet. But that was 3 months ago, and hopefully DF is even better now.

swyx · on April 26, 2023

where are these prompts from?

raddles · on April 26, 2023

Scott Alexander made a bet with those prompts here: https://astralcodexten.substack.com/p/a-guide-to-asking-robo...

And followed up with this article when he won the bet: https://astralcodexten.substack.com/p/i-won-my-three-year-ai...

hunkins · on April 26, 2023

New restriction in their License suggests the software can't be modified.

"2. All persons obtaining a copy or substantial portion of the Software, a modified version of the Software (or substantial portion thereof), or a derivative work based upon this Software (or substantial portion thereof) must not delete, remove, disable, diminish, or circumvent any inference filters or inference filter mechanisms in the Software, or any portion of the Software that implements any such filters or filter mechanisms."

kmeisthax · on April 26, 2023

As someone who's largely "OK" with morality clauses in otherwise liberal AI licenses, I think we should start calling these "weights-available" models to distinguish from capital-F Free Software[1] ones.

I'm starting to get irritated by all these 'non-commercial' licensed models, though, because there is no such thing as a non-commercial license. In copyright law, merely having the work in question is considered a commercial benefit. So you need to specify every single act you think is 'non-commercial', and users of the license have to read and understand that. Even Creative Commons' NC clause only specifies one; they say that filesharing is not commercial. So it's just a fancy covenant not to sue BitTorrent users.

And then there's LLaMA, whose model weights were only ever shared privately with other researchers. Everyone using LLaMA publicly is likely pirating it. Actual weights-available or Free models already exist, such as BLOOM, Dolly, StableLM[0], Pythia, GPT-J, GPT-NeoX, and CerebrasGPT.

[0] Untuned only; the instruction-tuned models are frustratingly CC-BY-NC-SA because apparently nobody made an open dataset for instruction tuning.

[1] Insamuch as an AI model trained on copyrighted data can even be considered Free.

RobotToaster · on April 26, 2023

Then by definition it isn't open source, violating points 3, 4, and 6 of the open source definition. https://opensource.org/osd/

yellowapple · on April 26, 2023

Yep. It's getting really exhausting seeing projects falsely advertising themselves as "open source". Either be FOSS or don't be; don't pretend to be while using some nonsense like the BSL or whatever adhocery is in play here.

Mizza · on April 26, 2023

In the README they even call it "Modified MIT", the modification being where they turned it from a very permissive license into a fully proprietary one. Very cool model though.

orra · on April 27, 2023

Well, the Apache Software Foundation were rightly getting annoyed at 'modified Apache' licences...

aledalgrande · on April 28, 2023

I bet it's a new benchmark for the labs to get funding

GaggiX · on April 26, 2023

>New restriction in their License suggests the software can't be modified.

To remove filters.

"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:"

oh_sigh · on April 26, 2023

You can't remove the filters per the license, but the weights will be available soon and so anyone can just reimplement this code using the weights

ronsor · on April 26, 2023

That's already been done: https://github.com/lucidrains/imagen-pytorch

Jackson__ · on April 26, 2023

There is a similar license clause for the weights[0] as well, so I'm not sure this would apply unless you write the code and train your model from scratch.

[0] https://github.com/deep-floyd/IF/blob/main/LICENSE-MODEL#L54

dragonwriter · on April 26, 2023

Or unless, as seems to be fairly widely expected but untested, model weights are not actually copyrightable, so model licenses are superfluous.

thewataccount · on April 26, 2023

> New restriction in their License suggests the software can't be modified.

It can be modified. That just says it can't be modified to bypass their filters.

orra · on April 26, 2023

Neither the source code nor the weights are open source... This is actually worse than Stability AI's previous offering, in that regard.

ilaksh · on April 26, 2023

They are technically open source. It's just that the model license prohibits commercial use and the code license prohibits bypassing the filters. So it's kind of worse than closed source in a way because it's like a tease. With no API apparently.

Theoretically large companies or rich people might be able to make a licensing agreement.

yellowapple · on April 26, 2023

> They are technically open source. It's just that the model license prohibits commercial use and the code license prohibits bypassing the filters.

Your second sentence contradicts the first. Prohibiting commercial use and prohibiting modification are each in and of themselves mutually exclusive being being "technically open source" (let alone both at the same time).

jrm4 · on April 26, 2023

I am a lawyer, and as flimsy and wishy-washy as the term "open-source" already is, I can't even fathom what is meant by "open source" here?

Are people suggesting that "look at the code but don't touch" actually fits what some people think of as open source?

netdur · on April 27, 2023

NOT a lawyer here, basially they meant, here the thing, please don't sue me if you messed up.

jrm4 · on April 27, 2023

That's weird, because we do have extremely standard language to take care of that. That's just "warranty disclaimers"

rgbrgb · on April 26, 2023

> model license prohibits commercial use

I thought that at first, but I think it only prohibits commercial use that breaks regional copyright or privacy laws.

yellowapple · on April 26, 2023

That's already prohibited by, you know, those very same copyright and privacy laws. Adding those same prohibitions to the license not only makes the software nonfree, but pointlessly does so.

dragonwriter · on April 26, 2023

Its not pointless, it means the model licensor has a claim against you, as well as whoever would for violating the referenced laws; it also means, and this is probably more important, that in some juridictions, the model licensor has a better defense against liability for contributory infringement if the licensee infringes.

EDIT: That said, it’s unambiguously not open source.

yellowapple · on April 26, 2023

> it means the model licensor has a claim against you

Right, but to what end? The only reason the licensor should care one way or another is the licensor being held liable for what folks do with the software, in which case...

> it also means, and this is probably more important, that in some juridictions, the model licensor has a better defense against liability for contributory infringement if the licensee infringes.

Do hardware stores need to demand "thou shalt not use this tool to kill people" to their customers to avoid liability for axe murders under such jurisdictions? Or car manufacturers needing to specify "you will not use this product to run over schoolchildren at crosswalks"?

Like, I'm sure such jurisdictions exist, but I somehow doubt license terms in an EULA nobody (except for us nerds) will ever read would be sufficient in such a kangaroo court.

(EDIT: also, I'm pretty sure the standard warranty disclaimer in your average FOSS license already covers this, without making the software nonfree in the process)

dragonwriter · on April 27, 2023

> Do hardware stores need to demand "thou shalt not use this tool to kill people" to their customers to avoid liability for axe murders under such jurisdictions?

Generally, not, because vicarious liability for battery and wrongful death doesn’t work like, e.g., contributory copyright infringement.

> I'm pretty sure the standard warranty disclaimer in your average FOSS license already covers this

No, warranty disclaimers don’t cover this, because (1) its not a warranty issue, and (2) disclaimers, if they have legal effect at all, effect liability the disclaiming party would otherwise have to the party accepting the disclaimer, not liability the disclaiming party would have to third parties.

yellowapple · on April 27, 2023

> Generally, not, because vicarious liability for battery and wrongful death doesn’t work like, e.g., contributory copyright infringement.

Judging by youtube-dl, it seems like it does work that way, at least in my jurisdiction; I guess we'll see if the RIAA doubles down on trying to wipe it from the face of the Earth, but considering there hasn't been much noise, I wouldn't count on it. Also, to my adjacent point, I highly doubt the RIAA would've refrained from attempting to take down youtube-dl even if youtube-dl's license prohibited its users from circumventing DRM with it.

UncleEntity · on April 27, 2023

> The only reason the licensor should care one way or another is the licensor being held liable for what folks do with the software, in which case...

There are current open cases of people claiming “harm” for misinformation spouted by ChatGPT where it just makes up facts to satisfy user prompts.

There are current open cases where people are claiming “copyright violation” due to diffusion models satisfying user prompts.

AFAICT none of these cases are against the users who are prompting the models.

orra · on April 26, 2023

It prohibits both commercial use, whether or not you break regional laws; and it prohibits breaking certain laws. As another user said, encoding the law into a licence is pointless but makes it non-free.

There are also problematic restrictions on your ability to modify the software under clause 2(c). And nor do you have the right to sublicence, it's not clear to me what rights somebody has if you give them a copy.

rgbrgb · on April 27, 2023

Where does it prohibit commercial use? I’m not seeing that in the license.

orra · on April 27, 2023

The model license, 1(a) and 2(a)(i).

rgbrgb · on April 27, 2023

Ah I guess I read it right then I reread it wrong, my bad and thanks for the pointer! That's a shame but hopefully it's released in a more open way in the future. My interest is in building good collaborative interfaces (and games!) on top of these things.

connerruhl · on April 26, 2023

That'll change when the full non-research release occurs... https://twitter.com/EMostaque/status/1651328161148174337

orra · on April 26, 2023

That tweet is vague. Besides, it says like 'like SD', so I will be pleasantly shocked if the models are open source.

Taek · on April 26, 2023

For anyone who doesn't know, DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM (11b params). The result is that it is much better at responding to more complex prompts.

In theory, it is also smarter at learning from its training data.

GaggiX · on April 26, 2023

>StableDiffusion style

Not really, it's a cascaded diffusion model conditioned on the T5 encoder, there is nothing really in common, unless you mean that using a diffusion model is "SD style".

tmabraham · on April 26, 2023

It isn't like Stable Diffusion, it's more like Google's Imagen model.

dragonwriter · on April 27, 2023

> It isn’t like Stable Diffusion, it’s more like Google’s Imagen model.

Yeah, it looks exactly (architecturally) like Imagen.

Google would be running circles around everyone in Generative AI (maybe OpenAI would still have a better core LLM, maybe, but portfolio-wise) if they simply had the ability to cross the gap between building technologies and writing up research papers on them and actually releasing products.

connerruhl · on April 26, 2023

The full release will be soon!

https://twitter.com/EMostaque/status/1651328161148174337

simonw · on April 26, 2023

It looks like the model on Hugging Face either hasn't been published yet or was withdrawn. I got this error in their Colab notebook:

OSError: DeepFloyd/IF-I-IF-v1.0 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

Zetobal · on April 26, 2023

You need to accept the license on the HuggingFace model card.

simonw · on April 26, 2023

https://huggingface.co/DeepFloyd/IF-I-IF-v1.0 is a 404 currently.

simonw · on April 28, 2023

It's working now.

lerchmo · on April 26, 2023

it doesn't seem like they have anything published https://huggingface.co/DeepFloyd

thewataccount · on April 26, 2023

I swear I saw it a few minutes ago but I might be crazy.

Zetobal · on April 26, 2023

Same got the weights on gdrive.

GaggiX · on April 27, 2023

Everyone is waiting for you to share the weights if it's even true that you saved them.

og_kalu · on April 26, 2023

could you link them ?

teelelbrit · on April 27, 2023

Did they just take down the whole model?

teelelbrit · on April 27, 2023

Pleaseeeeeeee

srajabi · on April 28, 2023

Wow this does so well on text! The original model struggled a lot, it's impressive to see how far they've come.

simonw · on April 28, 2023

It's much better, but it's not perfect. Here's what I got for:

> a photograph of raccoon in the woods holding a sign that says "I will eat your trash"

https://twitter.com/simonw/status/1651994059781832704

indymike · on April 28, 2023

This is not a problem. The sign was clearly made by the raccoon.

patapong · on April 28, 2023

Agreed - to properly test it, we should try:

A photograph of an English professor in the woods holding a sign that says "I will eat your trash"

Semonto · on April 28, 2023

It actually adds quite a nice charm by wording it not correctly

mkaic · on April 28, 2023

I'm quite curious how much of the improvement on text rendering is from the switch to pixel-space diffusion vs. the switch to a much larger pretrained text encoder. I'm leaning towards the latter, which then raises the question of what happens when you try training Stable Diffusion with T5-XXL-1.1 as the text encoder instead of CLIP — does it gain the ability to do text well?

minimaxir · on April 28, 2023

DeepFloyd IF is effectively the same architecture/text encoder as Imagen (https://imagen.research.google/), although that paper doesn't hypothesize why text works out a lot better.

mkaic · on April 28, 2023

Right, I'm aware of the Imagen architecture, just curious to see further research determining which aspect of it is responsible for the improved text rendering.

EDIT: According to the figure in the Imagen paper FL33TW00D's response referred me to, it looks like the text encoder size is the biggest factor in the improved model performance all-around.

GaggiX · on April 28, 2023

The CLIP text encoder is trained to align with the pooled image embedding (a single vector), which is why most text embeddings are not very meaningful on their own (but still convey the overall semantics of the text). With T5 every text embedding is important.

FL33TW00D · on April 28, 2023

Check out figure 4A from the ImageGen paper: https://arxiv.org/pdf/2205.11487.pdf

From that - I would strongly suspect the answer to your question to be yes.

mkaic · on April 28, 2023

Ah, yes, this seems to be pretty strong evidence. Thanks for pointing that figure out to me!

radq · on April 28, 2023

It is most likely due to the text encoder - see "Character-Aware Models Improve Visual Text Rendering". https://arxiv.org/abs/2212.10562

piyh · on April 28, 2023

That alone is a huge milestone.

mkaic · on April 28, 2023

I think this model will result in a massive new wave of meme culture. AI's already seen success in memes up to this point, but the ability for readable text to be incorporated into images totally changes the game. Going to be an interesting next few months on the interwebz, that's for sure. Exciting times!

Thoreandan · on April 28, 2023

"Hi! I'm B-19-7, but to everyperson I'm called Floyd." -Planetfall (1983)

My first thought on seeing "Floyd" and "IF" together. It looks like a Pink Floyd reference from the About page on https://deepfloyd.ai/ though.

itslennysfault · on April 28, 2023

This could be super cool for logos. I've tried using Stable Diffusion to generate logos and it does pretty good at helping brainstorm, but the text is always gibberish so you can use its idea, but you have to add your own text which basically means creating a logo from scratch using its designs as inspiration.

ZoomZoomZoom · on April 28, 2023

Well, surely you don't expect to just take the generated stuff and use it right away? For logos you usually need something that some entity can hold the claim on, so you probably need a human touch in any case.

robbomacrae · on April 28, 2023

MidJourney's v5 model did a great job producing our logo https://summer.ai/static/nearby/android-chrome-192x192.png

Admittedly it was one of several hundred I had it spit out for me. But the design was completely original and caught me by surprise.

I tried making modifications but everyone kept telling me they prefer the original exactly as MidJourney made it!

dragonwriter · on April 28, 2023

Authorship is not a requirement for trademark, though having copyright as well is nice.

alex_sf · on April 28, 2023

You can't use this to make logos for any commercial product, and it's not safe to use it for hobby projects either, based on the current model license.

krisoft · on April 28, 2023

> You can't use this to make logos for any commercial product

Yeah good luck figuring out that a particular logo was generated with this particular model. And if someone does good luck doing anything about it.

With this amount of fear one wouldn't dare to cross a road without three layers of bubble wrap, plus written authorisation from a lawyer plus a feasibility study from a traffic engineer.

alex_sf · on April 29, 2023

You've never gone through an acquisition or due diligence, have you?

krisoft · on April 29, 2023

Clearly.

Explain it to me what will happen with the generated logo at an acquisition or due diligence.

alex_sf · on April 29, 2023

You'll be asked to produce documents verifying your ownership and/or compliance with any licenses for all intellectual property. That includes code and graphics (logos).

krisoft · on April 30, 2023

Sure. And what documents and paperwork do you expect if I the owner of the company drawn the logo myself with a bit of a crayon/inkscape/gimp/blender? Those exact documents will be produced.

alex_sf · on May 1, 2023

I would expect a lawsuit if you gave them to investors, and jailtime if it was part of an IPO. Keep draggin down that bar in tech, I guess.

dormento · on April 28, 2023

You also cannot just ingest people's media to train a for-profit AI image generation service, and well, here we are.

PS: not an AI apologist, just pointing the irony. Feels like those fan sonic characters "original content do not steal".

alex_sf · on April 29, 2023

Yes you can. This is nonsense.

https://www.gesetze-im-internet.de/urhg/__44b.html

phonescreen_man · on April 28, 2023

Try controlNet with stable diffusion, I’ve been having good results for text.

kingcharles · on April 26, 2023

The examples on the README are extremely compelling; the state of the art has been raised yet again.

alex_sf · on April 28, 2023

The current license makes this largely unusable for nearly any purpose. Really disappointing release from SAI.

edkennedy · on April 28, 2023

I think this is just for pre-release, and they will release fully licensed for Commercial. It doesn't make sense to have a model like this that can do game changing text and logos... but then not license it for commercial. If they don't, that would be ridiculous.

atleastoptimal · on April 26, 2023

> Text

> Hands

good god it solves the two biggest meme issues with image models in one go. Will this be the new state of the art every other model is compared to?

astrange · on April 27, 2023

There's fundamental tradeoffs, as there always will be when you're compressing things into an image model.

So, this is going to have new different issues. Since it's similar to Imagen, it probably can't handle long complex prompts as well, since they developed Parti afterward.

Here's my question: are there any image models where, if you prompt "1+1", you get an image showing "3"?

dragonwriter · on April 27, 2023

> So, this is going to have new different issues.

Well, yeah, its a bigger set of models (particular the language model) that takes more resources (both to train and for inference.) That’s the tradeoff.

> Here’s my question: are there any image models where, if you prompt “1+1”, you get an image showing “3”?

You want a t2i model that does arithmetic in the prompt, translates to it to “text displaying the number <result>”, but, also does the arithmetic wrong?

Yeah, I don’t think that combination of features is in any existing model or, really, in any of the datasets used for evaluation, or otherwise on anyone’s roadmap.

astrange · on April 27, 2023

Pretend I wrote 2, edit timeout closed.

"Actually thinking about your prompt" is a necessary part of being able to make the prompts natural language instead of a long list of fantasy google image search terms.

Useful example being "my bedroom but in a new color", but some things I've typed into Midjourney that don't work include "a really long guinea pig" (you get a regular size one), "world's best coffee" (the coffee cup gets a world on it), etc. It's just too literal.

And yes, preprocessing with an LLM could do this.

jamilton · on April 27, 2023

I don't think they're saying that's a goal, I think they're curious if it is the case. LLMs are bad at arithmetic, this uses a LLM to process the prompt, that class of result seems plausible.

gwern · on April 26, 2023

We already knew those were going to be solved by scale like using T5 instead of the really small bad text encoder SD used, because they were solved by Imagen etc.

Taek · on April 26, 2023

There are good reasons to believe that this will be the new state of the art by a comfortable margin. Hard to know until we can actually play with it.

bicepjai · on May 8, 2023

I understand, I have a decade old 2 nvidia 1080 to card, can we infer and train IF on them ?

zimpenfish · on April 26, 2023

16GB VRAM minimum is a bit steep. Sadly excludes my 3080 which is annoying because I'd like something better than Stable Diffusion locally.

specproc · on April 26, 2023

There's a note which suggests you might be able to get by on lower. My 3060 struggles with SD on the defaults, but works fine with float16.

There are multiple ways to speed up the inference time and lower the memory consumption even more with diffusers. To do so, please have a look at the Diffusers docs:

        Optimizing for inference time [1]
        Optimizing for low memory during inference [2]

[1] https://huggingface.co/docs/diffusers/api/pipelines/if#optim...

[2] https://huggingface.co/docs/diffusers/api/pipelines/if#optim...

SequoiaHope · on April 26, 2023

If you don't mind the power consumption I noticed that older nvidia P6000's (24GB) are pretty cheap on ebay! My 16GB P5000 is pretty handy for this stuff.

NBJack · on April 26, 2023

An M40 24GB is less than $200, if you don't mind the trouble to get it's drivers installed, cooled, etc. It's also important to note your motherboard must support larger VRAM addressing; many older chipsets won't be able to boot with it (i.e. some, perhaps almost all, Zen 1 supporters).

coolspot · on April 26, 2023

Looks like P6000 24Gb goes for $800-$1200 while you can get superior 3090 24Gb for $800-$1000 .

CamperBob2 · on April 26, 2023

4090s are only $1600 or so now, for that matter.

SequoiaHope · on April 26, 2023

oh! My mistake thanks for letting me know.

dragonwriter · on April 26, 2023

IIRC SD’s first party code had singificantly higher requirements than is possible with some of the downstream optimization most people are using, which have lower speed and more system RAM use but reduce peak VRAM use.

The architecture here looks different, but the code is licensed in a way which still makes downstream optimization and redistribution possible, so maybe there will be something there.

thewataccount · on April 26, 2023

Once these are quantized (I assume they can be), they should be ~1/4th the size.

Can anyone explain why it needs so much ram in the first place though? 4.3B is only ~9GB at 16bit (I'm not as familiar with image models).

I'm really happy to see that fits under 24GB - that's what I consider the limit for being able to run on "consumer hardware".

SekstiNi · on April 26, 2023

They took down the blogpost, but from what I remember the model is composite and consists of a text encoder as well as 3 "stages":

1. (11B) T5-XXL text encoder [1]

2. (4.3B) Stage 1 UNet

3. (1.3B) Stage 2 upscaler (64x64 -> 256x256)

4. (?B) Stage 3 upscaler (256x256 -> 1024x1024)

Resolution numbers could be off though. Also the third stage can apparently use the existing stable diffusion x4, or a new upscaler that they aren't releasing yet (ever?).

> Once these are quantized (I assume they can be)

Based on the success of LLaMA 4bit quantization, I believe the text encoder could be. As for the other modules, I'm not sure.

edit: the text encoder is 11B, not 4.5B as I initially wrote.

[1]: https://huggingface.co/google/t5-v1_1-xxl

gwern · on April 26, 2023

You'll be able to optimize it a lot to make it fit on small systems if you are willing to modify your workflow a bit: instead of 1 prompt -> 1 image _n_ times, do 1 prompt -> _n_ images 1 time -> _m_ times... For a given prompt, run it through the T5 model and store; you can do that in CPU RAM if you have to because you only need the embedding once so you don't need a GPU which can run T5-XXL naively. Then you can get a large batch of samples from #2; 64px is enough to preview; only once you pick some do you run through #3, and then from those through #4. Your peak VRAM should be 1 image in #2 or #4 and that can be quantized or pruned down to something that will fit on many GPUs.

GaggiX · on April 26, 2023

The entire T5-XXL model is 11B but you don't need the decoder.

GaggiX · on April 26, 2023

>Can anyone explain why it needs so much ram in the first place though?

The T5-XXL text encoder is really large, also we do not quantize the UNets, the UNet outputs 8-bit pixels, so quantizing the UNet to that precision will create pretty bad outputs.

marginalia_nu · on April 28, 2023

> Gorbachev holding meatball pasta in both hands. 1980s synth futuristic max headroom aesthetic. Neon lights.

> Aristotle in ancient greek clothes. Toga. New york, rain, film noir, fog, art deco, neon lights, blade runner sci fi

Seems to be holding up recently well with the first promt. Second was only OK.

55555 · on April 26, 2023

So this one can create perfect text in images? If true, that’s insane

GaggiX · on April 26, 2023

LDM-400M was already able to generate text (predecessor of Stable Diffusion), thanks to the fact that every token in the text encoder (trained from scratch) was available in the attention layer.

flangola7 · on April 26, 2023

>thanks to the fact that every token in the text encoder (trained from scratch) was available in the attention layer.

>ChatGPT explain this like I'm 5

coolspot · on April 26, 2023

“Every word in the text can be used to help create the image.”

GaggiX · on April 26, 2023

interesting there are different models: https://github.com/deep-floyd/IF#-model-zoo-

I'm also very happy for the release of the two upscaler, I can use them to upscale to result of my small 64x64 DDIM models (maybe with some finetuning).

danwee · on April 28, 2023

I would be more interested in image-to-text models. Does someone know of any decent model? I saw the GPT4 demo, and they showed that they do image-to-text... but then that was actually a fake (i.e., the model was interpreting the image filename).

theRealMe · on April 28, 2023

Can you provide a source on gpt4 image model being fake? I haven’t heard that before, though I have wondered why I haven’t heard anything about the image part and don’t have access to image processing myself.

grumbel · on April 28, 2023

CLIP Interrogator[1], which is also build into AUTOMATIC1111, gives quite reasonable results, at least if all you need is a prompt, it can't handle complex interactions:

Image: https://i.imgur.com/husplYZ.png

Output: "a white horse with a sign that says rexel's in space, pixelperfect, inspired by Paul Kelpe, official simpsons movie artwork, alternate album cover, in style of nanospace, by Apelles, pickles, pespective, pop surrealism, ingame, in a space cadet outfit, sifi"

[1] https://huggingface.co/spaces/pharma/CLIP-Interrogator

minimaxir · on April 28, 2023

For a fast-but-less-robust model, you can use a ViT encode/GPT-2 decoder model: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

For a more-robust-but-hard-to-run model, you can use BLIP2: https://huggingface.co/Salesforce/blip2-opt-2.7b

runnerup · on April 28, 2023

AFAIK, converting an image to a text summary isn't really a thing by itself. The related work would be "visual reasoning" which is the ability to ask things about the image in natural language and get responses back also in natural language.

I believe the current SOTA test for NLVR is VQAv2[0] or GQA[1].

0: https://visualqa.org/ 1: https://arxiv.org/pdf/1902.09506.pdf

jah242 · on April 28, 2023

This is text + image -> text but pretty cool and still might be of interest to you:

https://llava-vl.github.io

jkea · on April 28, 2023

MidJourney has the describe function which is kind of like that. Not sure how decent it actually is

causality0 · on April 28, 2023

Is this intended to replace Stable Diffusion? Somebody want to give the eli5?

mkaic · on April 28, 2023

This does outperform Stable Diffusion 2.1, but uses a different architecture and requires more memory and compute. Stable Diffusion runs its denoising process in a compressed "latent space" which is how it was able to be so compute-efficient compared to other diffusion models. It also uses the (relatively) small text encoder from OpenAI's CLIP model to encode user prompts. Both of these optimizations meant that it could run much faster compared to say, DALLE or Imagen, but it didn't follow complicated user prompts especially well and had trouble with things like counting and text-rendering.

DeepFloyd IF is based on Google's Imagen model, which has two key differences from Stable Diffusion: (1) it denoises in pixel space instead of a compressed latent space, and (2) it uses a 10x larger pretrained text encoder (T5-XXL-1.1) compared to SD's CLIP encoder. (1) allows it to better render high-frequency details and text, and (2) allows it to understand complex prompts much better. These improvements come at the cost of multiple times more memory usage and compute requirements compared to SD, though.

In terms of "will it replace SD?"—in the short term I think yes. But I still think latent diffusion models are the future. For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.

dragonwriter · on April 28, 2023

> For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.

SDXL is available on StabilityAI’s hosted services already, so they can be compared head to head.

mkaic · on April 28, 2023

I believe the version available on DreamStudio is heavily RLHF-tuned, no? I'm mostly interested to see how the raw weights perform out of the box compared to IF, which we have to wait for the release for.

causality0 · on April 28, 2023

Does the increased memory footprint mean it can't be run on a normal desktop like SD?

dragonwriter · on April 28, 2023

The VRAM requirements are higher (14GB) so lots of things that can do SD won’t do this with thr existing toolchain. But some of that is “aftermarket” SD optimization, and this maybe could see some of that, too.

But there are consumer cards with 14GB+ VRAM, so its not, even before optimization, out of reach of consumer hardware.

causality0 · on April 28, 2023

Damn. That's basically just the 4090 and 4080.

mkaic · on April 29, 2023

The 3090 has 24GB so it's an option as well.

2bitencryption · on April 28, 2023

thanks for the explanation!

denoising in latent space certainly seems like the "correct" path. My (amateur) thinking is, the more you can do in latent space, the better.

dragonwriter · on April 28, 2023

I’m not sure why “denoise in latent space at 64x64 and decode to pixel space at target resolution” is fundamentally better than “denoise in pixel space at 64x64, then upscale to pixel space at target resolution and denoise some more”.

The former seems likely to be lower compute-for-resolution, but that’s not the only consideration for “better”...

mftb · on April 28, 2023

This was an excellent summary, ty.

lucidrains · on April 28, 2023

tldr: bigger text encoder is better. SD will catch up quickly, as conditioning on a new set of precomputed text embeddings is a trivial change

mkaic · on April 28, 2023

I didn't think about that but you're totally right, assuming they have those embeddings cached it would be super easy to retrain SD using them. 11B parameter count is rather unfortunate though tbh, I've never been the biggest fan of "scale is all you need" even though it seems to ring irritatingly true most of the time.

reckless · on April 28, 2023

Seems to be entirely a different approach for diffusion.

>DeepFloyd IF works in pixel space. The diffusion is implemented on a pixel level, unlike latent diffusion models (like Stable Diffusion), where latent representations are used.

nickthegreek · on April 28, 2023

> Stability AI releases DeepFloyd IF, a powerful text-to-image model

Hope not. This is a worse license.

Jackson__ · on April 28, 2023

As far as I can tell from Emad's discord and twitter discussion, the idea appears to be to make this a "research" release, and therefore the worse license.

At a later point the model will be renamed "StableIf", and released with a similar license to StableDiffusion.

dragonwriter · on April 28, 2023

Well, its a better license than SDXL is available under right now (which is “you can’t have it, but you can use it on StabilityAI’s hosted services”.)

mkaic · on April 28, 2023

Yeah Emad was clarifying this on the LAION discord the other day — plan is to have a better-licensed version out Eventually™, guess we'll see how long that takes.

anticensor · on April 29, 2023

for certain values of soon™ and eventually™

gjsman-1000 · on April 28, 2023

This is the dumb part about open-source models. Criminals, governments, and propaganda spreaders need not worry about the license; but legitimate users do.

manv1 · on April 28, 2023

This is the same problem with laws. The only people that follow them are legitimate users.

tasty_freeze · on April 28, 2023

I hear this complaint often, especially in regards to gun control. Yes, there are a subset of people who do what they are going to do irrespective of laws. There is also a middle ground of people where the laws might curtail unwanted behavior. But the main purpose is it provides a basis for punishing unwanted behavior.

To make it concrete, one could argue that bank robbers rob banks even though it is illegal, so why have a law against it since law abiding people aren't going to rob banks. Does anyone really think we should remove such laws?

thejarren · on April 28, 2023

Second paragraph in the link:

DeepFloyd IF is a state-of-the-art text-to-image model released on a non-commercial, research-permissible license that provides an opportunity for research labs to examine and experiment with advanced text-to-image generation approaches. In line with other Stability AI models, Stability AI intends to release a DeepFloyd IF model fully open source at a future date.

dr_kiszonka · on April 28, 2023

Looks like music generation is on their roadmap. Fun!

https://stability.ai/careers?gh_jid=4142190101

jacob019 · on April 26, 2023

Any web based front ends yet? I put together a system that runs a variety of web based open source AI image generation and editing tools on Vultr GPU instances. It spins up instances on demand, mounts an NFS filesystem with local caching and a COW layer, spawns the services, proxies the requests, and then spins down idle instances when I'm done. Would love to add this, suppose I could whip something up if none exists.

ronsor · on April 26, 2023

It'll probably be in the Auto1111 WebUI within a week.

jacob019 · on April 27, 2023

You think? Automatic1111 is still on pytorch 1.7 and SD1.5