Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is this intended to replace Stable Diffusion? Somebody want to give the eli5?


This does outperform Stable Diffusion 2.1, but uses a different architecture and requires more memory and compute. Stable Diffusion runs its denoising process in a compressed "latent space" which is how it was able to be so compute-efficient compared to other diffusion models. It also uses the (relatively) small text encoder from OpenAI's CLIP model to encode user prompts. Both of these optimizations meant that it could run much faster compared to say, DALLE or Imagen, but it didn't follow complicated user prompts especially well and had trouble with things like counting and text-rendering.

DeepFloyd IF is based on Google's Imagen model, which has two key differences from Stable Diffusion: (1) it denoises in pixel space instead of a compressed latent space, and (2) it uses a 10x larger pretrained text encoder (T5-XXL-1.1) compared to SD's CLIP encoder. (1) allows it to better render high-frequency details and text, and (2) allows it to understand complex prompts much better. These improvements come at the cost of multiple times more memory usage and compute requirements compared to SD, though.

In terms of "will it replace SD?"—in the short term I think yes. But I still think latent diffusion models are the future. For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.


> For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.

SDXL is available on StabilityAI’s hosted services already, so they can be compared head to head.


I believe the version available on DreamStudio is heavily RLHF-tuned, no? I'm mostly interested to see how the raw weights perform out of the box compared to IF, which we have to wait for the release for.


Does the increased memory footprint mean it can't be run on a normal desktop like SD?


The VRAM requirements are higher (14GB) so lots of things that can do SD won’t do this with thr existing toolchain. But some of that is “aftermarket” SD optimization, and this maybe could see some of that, too.

But there are consumer cards with 14GB+ VRAM, so its not, even before optimization, out of reach of consumer hardware.


Damn. That's basically just the 4090 and 4080.


The 3090 has 24GB so it's an option as well.


thanks for the explanation!

denoising in latent space certainly seems like the "correct" path. My (amateur) thinking is, the more you can do in latent space, the better.


I’m not sure why “denoise in latent space at 64x64 and decode to pixel space at target resolution” is fundamentally better than “denoise in pixel space at 64x64, then upscale to pixel space at target resolution and denoise some more”.

The former seems likely to be lower compute-for-resolution, but that’s not the only consideration for “better”...


This was an excellent summary, ty.


tldr: bigger text encoder is better. SD will catch up quickly, as conditioning on a new set of precomputed text embeddings is a trivial change


I didn't think about that but you're totally right, assuming they have those embeddings cached it would be super easy to retrain SD using them. 11B parameter count is rather unfortunate though tbh, I've never been the biggest fan of "scale is all you need" even though it seems to ring irritatingly true most of the time.


Seems to be entirely a different approach for diffusion.

>DeepFloyd IF works in pixel space. The diffusion is implemented on a pixel level, unlike latent diffusion models (like Stable Diffusion), where latent representations are used.


> Stability AI releases DeepFloyd IF, a powerful text-to-image model

Hope not. This is a worse license.


As far as I can tell from Emad's discord and twitter discussion, the idea appears to be to make this a "research" release, and therefore the worse license.

At a later point the model will be renamed "StableIf", and released with a similar license to StableDiffusion.


Well, its a better license than SDXL is available under right now (which is “you can’t have it, but you can use it on StabilityAI’s hosted services”.)


Yeah Emad was clarifying this on the LAION discord the other day — plan is to have a better-licensed version out Eventually™, guess we'll see how long that takes.


for certain values of soon™ and eventually™


This is the dumb part about open-source models. Criminals, governments, and propaganda spreaders need not worry about the license; but legitimate users do.


This is the same problem with laws. The only people that follow them are legitimate users.


I hear this complaint often, especially in regards to gun control. Yes, there are a subset of people who do what they are going to do irrespective of laws. There is also a middle ground of people where the laws might curtail unwanted behavior. But the main purpose is it provides a basis for punishing unwanted behavior.

To make it concrete, one could argue that bank robbers rob banks even though it is illegal, so why have a law against it since law abiding people aren't going to rob banks. Does anyone really think we should remove such laws?


Second paragraph in the link:

DeepFloyd IF is a state-of-the-art text-to-image model released on a non-commercial, research-permissible license that provides an opportunity for research labs to examine and experiment with advanced text-to-image generation approaches. In line with other Stability AI models, Stability AI intends to release a DeepFloyd IF model fully open source at a future date.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: