> Note that the model is substantially more compute-intensive than Stable Diffusion, so it may be slower even though that space is running on an A100.
Any more specifics on this? Sampling has gotten better too. My primary concern is the amount of memory necessary for a generation (batch size = 1, or I guess "2" using classifier free guidance).
Similar to OpenAI's cascaded diffusion models and GLIDE, you can presumably run the models in sequence, unloading earlier models from memory to make room for the 2nd and 3rd stage models.
Right now, I really only need 256px resolution. So, will I be able to fit the first stage (64px) model in memory on its own with my 12 GB RTX 3060? What about the 2nd stage (64px -> 256px)? 3rd stage?
HF also wrote a blog post on how you can mess around with the model in a python notebook using their excellent Diffusers library: https://huggingface.co/blog/if
I knew the model would have difficulty fitting into a 16GB VRAM GPU, but "you need to load and unload parts of the model pipeline to/from the GPU" is not a workaround I expected.
At that point it's probably better to write a guide on how to set up a VM with a A100 easily instead of trying to fit it into a Colab GPU.
It is able to put people on the left/right and put the correct t-shirts and facial expressions on each one. This is compared to mj which just mixes together a soup of every word you use and plops it out into the image. Huge MJ fan of course, it's amazing, but having compositional power is another step up.
Midjourney always look very aesthetic pleasing, I guess because of their RLHF tuning with Discord data... But it doesn't really follow prompts as well as Dall-e for example.
But in the end, people want pretty pictures. So is a complicated situation.
The tweet they shared is from February and uses an outdated version of MJ, this is what I got from V5: https://i.imgur.com/0uxtZDe.png
Midjourney does much better overall. Composition is neat, but MJ is so incredibly far ahead in terms of quality of output, it honestly doesn't matter if you have to go and do composition manually (and with new AI based tools, that's easier than ever too. Do a bad cut and paste job then infill your way back to a coherent image)
But it didn't work? In yours there is no "Nexus", no smiling, no frowning, and man on right doesn't look Asian? Compared with image in Tweet, MJ failed at this task.
That depends on what your goal was? If your goal was to get an AI model to generate copyrighted images, and misunderstand the relationship between Indians and Asians, then sure MJ failed (and I'm guessing that's the goal of the prompt).
But if I actually wanted a useful picture, I could work with what MJ gave me despite having minimal image editing skills. The DeepFloyd result looks like it's a 8-12 months behind what MJ gave and wouldn't be salvageable.
Again, I don't know why everyone is so defensive. I love MJ. There's nothing wrong with admitting that other models might do certain things better. We all can use any model we want.
Yeah perhaps I was not good at judging tone. To me it's a matter of fact thing that MJ isn't good at this. They'd admit it, it's not a big deal and I'm a fan.
It's not ad hominem at all... mj isn't as good at certain types of composition as others. I don't get why people are pretending that isn't the case. I want everyone to have great models and IF is part of that progress. Perhaps calling it "word soup" was offensive? This isn't your religion, though, it's just a model. Listening in on the MJ office hours they're the farthest thing you can be from dogmatic or arrogant. They want to improve as we all do. I personally am just really inspired that everyone can advance together!
Also see downthread - the first 32 images I generated attempting to reproduce the claim that "actually MJ can do this" all failed. The person who challenged me then ignored it. This isn't really up for debate until someone sends a seed where mj can do the cube + sphere thing well.
A human interpreting the prompt would see "asian" as being in contrast to "indian" in the language of the prompt... Not a level of comprehension that can be expected of current models but maybe in a few years (months?).
I'm human. I interpreted Asian as from a non-specific part of Asia. I realise though that "Asian" has a very specific meaning in the US, but it's only the US that does this. For the rest of the world Asian means someone from Asia.
That's kind of what I meant... If a prompt specifies one "Indian", and one "Asian", that implies that the writer of the prompt doesn't think of Indian as Asian so probably from the US background.
The thing is... IF is currently just a base model, it will need serious fine-tuning before it will produce aesthetically pleasing images (like MJ certainly does).
It's interesting to see what IF can do in terms of composition, text rendering etc, it's very promising if aesthetically pleasing images can be achieved via fine-tuning (the same happened with SD... current publicly fine-tuned models can achieve much higher levels of quality and cohesion than the base models, here's the prompt in an SD2.1 based model: https://imgur.com/a/ELGMSmV ).
Of course fine-tuning IF is likely more challenging, as both the two first stages and the 4x SD upscaler might need to be fine-tuned...
Well... Kind of, photobashing with midjourney doesn't guarantee you the same image or even necessarily the objects in the same places, even if you increase the image weight value up to its maximum of two. ('--iw 2')
Many times you'll have no other choice but to use a diffusion model with img2img.
I agree with OP though, the market has spoken and the vast majority of people use prompts hardly more nuanced than a 90s Mad Magazine book of Mad Libs.
I wasn't referring to photobashing, I meant firing up SD and ArtStudio for 10 minutes and getting something that looks amazing and has the desired composition.
Overall this feels like trying to get ChatGPT to do math: just let ChatGPT offload math to Wolfram.
Similarly I'd rather just offload the composition. Now we even have SAM which will happily pick out the parts of the image you want to compose
Spatial composition can be done easily, if you stop bothering with pure text-to-image (SD has several tricks and UIs to place objects precisely, they are all janky but they do work, that's practically photobashing). Attribute separation is also easily done with tricks like token bucketing, so your Indian guy will look Indian, and your East Asian guy will look East Asian. All of that is easy if you abandon the ambiguous natural language and use higher-order guidance.
What's really required is semantic composition. Making subjects meaningfully and predictably interact, or combining them together. And also the coherence of the overall stitched picture, so you don't end up with several different perspective planes.
Can't wait to see how all of this is going to look like in ten years. I know we are all nitpicking right now but these results are totally mindblowing already.
I wonder when we can start fuzzing brains. Wire you up to a machine that measures happiness or anxiety or anger or whatever, and keep re-generating results that hone in on the given emotion.
So it seems like Midjourney looks better and has more realistic faces but this one is better at following the prompt and generating what you actually want
Midjourney follows the “facing each other” better, but DF has readable text. This may be due to DF also suffering the effect that ISTR is common to MJ and SD that specifying conflicting things it understands leads to priority the “visibility” one (the specific text, in this case) over the “compositional” one (in this case, “facing each other”.)
That's not how any of this works what do you even mean by compositional power? Every model speaks a different "language" comparing prompts like this has no merit and shows only that the person who makes the claim lacks understanding of the subject matter.
Compositional power might mean "the image more resembles the composition you want and describe"
i.e. if you say "a red cube on a green sphere" in DeepFloyd, you will get it. If you say that in MJ, you won't. That means you have more power to compose the image you want with this tool.
No, it does mean you don't understand how to prompt MJ, you don't understand it's language. You might like french more but it doesn't mean that it's a better language than english. MJ even says that their model doesn't understand language like humans do in their FAQs...
The point of text to image model is for them to accept natural language (yes, in practice, they all benefit from specialized prompting done with an understanding of model quirks, but that’s not the goal.)
The way to prompt is a preference like programming languages ofc the layman might use the javascript of generative models because it's easier to start and there are a lot of tutorials but some might prefer something more exoctic which can produce the same or better quality. Whatever floats your boat but don't try to compare it like the guy in OPs tweets.
MJ and stable also make clear that their models don't understand language like humans do.
BTW I am a HUGE fan of MJ, and attend the office hours, and have done 35k+ images there. So you may have misinterpreted how much of a supporter of it I am.
Interesting, I do not get the results you do. What additional parameters are you using? Here is a link to some of my tests, with all default settings, some in v5 some in v4. https://twitter.com/eb_french/status/1651370091869786112
none and as an experienced user you should know that's it's not one shot and most of the time not even few shot... You can't compare cherry picked press images with few shots of a 5 second prompt. I don't know why you want to hype something up if you can't really compare it. It seems extremly attention grifting.
Just look at their cherry picks in this discord... https://discord.com/invite/pxewcvSvNx .
It's overfitted on images with copyright (afghan girl) and doesn't show more "compositional power" at all most of the time ignoring half of the prompt.
> as an experienced user you should know that's it's not one shot
Being "not one shot" for most nontrivial prompts is a failure of current t2i models, its what they all strive for and its what DF supposedly does a lot better. And, while its possible to spin things pretty hard when people can't bang on it themselves, I think the indication is that it is, in fact, a major leap forward from the best current consumer-available t2i models (it looks pretty comparable to Google Imagen – a little bit worse benchmark scores – which is unsurprising since it seems to be an implementation of exactly the architecture described in Google's Imagen paper.
> It’s overfitted on images with copyright (afghan girl)
It’s…not, though. Sure, the picture with a prompt which is suggestive of that (down to even specifying the same film type) gives off a vibe that completely feels, if you haven’t recently looked at the famous picture but are familiar with it, like a “cleaned up” version of that picture, so you might intuitively feel its from overfitting, that it is basically reproducing the original image with slight variations.
Look at the two pictures side-by-side and there is basically nothing similar about them except exactly the things specified in the prompt, and pretty much every aspect of the way that those elements of the prompt is interpreted in the DF image is unlike the other image.
It's not a major leap not even a small one because it's exactly like imagen. It's stability giving some Ukrainian refugees compute time to train "their" model for publicity. It's about the whom and not what as it should be.
I "feel" nothing I am telling it how it is. Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes... and most important burn in like every other overfitted image in diffusion networks.
You guys all want it to be something special and I get it, new content, new shiny toy but it's neither a good architecture nor a good implementation.
> It’s not a major leap not even a small one because it’s exactly like imagen.
I would agree, if imagen was a “consumer-available t2i model”. What’s available is a research paper with demo images from Google. The model itself is locked up inside Google, notionally because they haven’t solved filtering issues with it.
> Look at the afghan girl example again it. Close up portrait, same clothing, same comp, expressive eyes…
You look at it again, literally none of those things are the same: its not the same clothing (the material and color of the head scarf is different, the headscarf is the only visible clothing in the DF image, whereas that is not the case in the famous image), the condition of the head scarf is different, the hair color is different, the hair style is different, the hair texture is different, the face shape is different, the individual facial features are different, the eye color is much more brown in the DF image, the facial expression is different, the DF image has lipstick and eyeshadow, the famous image has a dirty face and no makeup, the headscarf is worn differently in the two images, the background is different, the lighting is different, and the faces are framed differently.
The similarities are (1) its a close up portrait, (2) a general ethnic similarity, and (3) they are both wearing a red (though very different red) head scarf, (4) and they are both looking straight into the camera. (2)-(4) are explicitly prompted, (1) is strongly implied in the prompt addressing nothing that isn’t related to the face/head. This isn’t “overfitting on a copyright image” its getting what you prompt, with no other similarity to the existing image.
> You guys all want it to be something special and I get it,
I’m actually kind of annoyed, because I’ve been collecting tooling, checkpoints, and other support for, and spending quite a bit of time getting proficient in dealing with the quirks of, Stable Diffusion. But, that’s life.
> it’s neither a good architecture nor a good implementation.
I’d be interested in hearing your specific criticism of the architecture and implementation, but hopefully its more grounded in fact than your criticism of the one image...
I thought it was pretty definitive at the time, but when you look really closely (as Scott's opponent is likely to do), it didn't seem like a clear win yet. But that was 3 months ago, and hopefully DF is even better now.
New restriction in their License suggests the software can't be modified.
"2. All persons obtaining a copy or substantial portion of the Software,
a modified version of the Software (or substantial portion thereof), or
a derivative work based upon this Software (or substantial portion thereof)
must not delete, remove, disable, diminish, or circumvent any inference filters or
inference filter mechanisms in the Software, or any portion of the Software that
implements any such filters or filter mechanisms."
As someone who's largely "OK" with morality clauses in otherwise liberal AI licenses, I think we should start calling these "weights-available" models to distinguish from capital-F Free Software[1] ones.
I'm starting to get irritated by all these 'non-commercial' licensed models, though, because there is no such thing as a non-commercial license. In copyright law, merely having the work in question is considered a commercial benefit. So you need to specify every single act you think is 'non-commercial', and users of the license have to read and understand that. Even Creative Commons' NC clause only specifies one; they say that filesharing is not commercial. So it's just a fancy covenant not to sue BitTorrent users.
And then there's LLaMA, whose model weights were only ever shared privately with other researchers. Everyone using LLaMA publicly is likely pirating it. Actual weights-available or Free models already exist, such as BLOOM, Dolly, StableLM[0], Pythia, GPT-J, GPT-NeoX, and CerebrasGPT.
[0] Untuned only; the instruction-tuned models are frustratingly CC-BY-NC-SA because apparently nobody made an open dataset for instruction tuning.
[1] Insamuch as an AI model trained on copyrighted data can even be considered Free.
Yep. It's getting really exhausting seeing projects falsely advertising themselves as "open source". Either be FOSS or don't be; don't pretend to be while using some nonsense like the BSL or whatever adhocery is in play here.
In the README they even call it "Modified MIT", the modification being where they turned it from a very permissive license into a fully proprietary one. Very cool model though.
>New restriction in their License suggests the software can't be modified.
To remove filters.
"Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:"
There is a similar license clause for the weights[0] as well, so I'm not sure this would apply unless you write the code and train your model from scratch.
They are technically open source. It's just that the model license prohibits commercial use and the code license prohibits bypassing the filters. So it's kind of worse than closed source in a way because it's like a tease. With no API apparently.
Theoretically large companies or rich people might be able to make a licensing agreement.
> They are technically open source. It's just that the model license prohibits commercial use and the code license prohibits bypassing the filters.
Your second sentence contradicts the first. Prohibiting commercial use and prohibiting modification are each in and of themselves mutually exclusive being being "technically open source" (let alone both at the same time).
That's already prohibited by, you know, those very same copyright and privacy laws. Adding those same prohibitions to the license not only makes the software nonfree, but pointlessly does so.
Its not pointless, it means the model licensor has a claim against you, as well as whoever would for violating the referenced laws; it also means, and this is probably more important, that in some juridictions, the model licensor has a better defense against liability for contributory infringement if the licensee infringes.
EDIT: That said, it’s unambiguously not open source.
> it means the model licensor has a claim against you
Right, but to what end? The only reason the licensor should care one way or another is the licensor being held liable for what folks do with the software, in which case...
> it also means, and this is probably more important, that in some juridictions, the model licensor has a better defense against liability for contributory infringement if the licensee infringes.
Do hardware stores need to demand "thou shalt not use this tool to kill people" to their customers to avoid liability for axe murders under such jurisdictions? Or car manufacturers needing to specify "you will not use this product to run over schoolchildren at crosswalks"?
Like, I'm sure such jurisdictions exist, but I somehow doubt license terms in an EULA nobody (except for us nerds) will ever read would be sufficient in such a kangaroo court.
(EDIT: also, I'm pretty sure the standard warranty disclaimer in your average FOSS license already covers this, without making the software nonfree in the process)
> Do hardware stores need to demand "thou shalt not use this tool to kill people" to their customers to avoid liability for axe murders under such jurisdictions?
Generally, not, because vicarious liability for battery and wrongful death doesn’t work like, e.g., contributory copyright infringement.
> I'm pretty sure the standard warranty disclaimer in your average FOSS license already covers this
No, warranty disclaimers don’t cover this, because (1) its not a warranty issue, and (2) disclaimers, if they have legal effect at all, effect liability the disclaiming party would otherwise have to the party accepting the disclaimer, not liability the disclaiming party would have to third parties.
> Generally, not, because vicarious liability for battery and wrongful death doesn’t work like, e.g., contributory copyright infringement.
Judging by youtube-dl, it seems like it does work that way, at least in my jurisdiction; I guess we'll see if the RIAA doubles down on trying to wipe it from the face of the Earth, but considering there hasn't been much noise, I wouldn't count on it. Also, to my adjacent point, I highly doubt the RIAA would've refrained from attempting to take down youtube-dl even if youtube-dl's license prohibited its users from circumventing DRM with it.
It prohibits both commercial use, whether or not you break regional laws; and it prohibits breaking certain laws. As another user said, encoding the law into a licence is pointless but makes it non-free.
There are also problematic restrictions on your ability to modify the software under clause 2(c). And nor do you have the right to sublicence, it's not clear to me what rights somebody has if you give them a copy.
Ah I guess I read it right then I reread it wrong, my bad and thanks for the pointer! That's a shame but hopefully it's released in a more open way in the future. My interest is in building good collaborative interfaces (and games!) on top of these things.
For anyone who doesn't know, DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM (11b params). The result is that it is much better at responding to more complex prompts.
In theory, it is also smarter at learning from its training data.
Not really, it's a cascaded diffusion model conditioned on the T5 encoder, there is nothing really in common, unless you mean that using a diffusion model is "SD style".
> It isn’t like Stable Diffusion, it’s more like Google’s Imagen model.
Yeah, it looks exactly (architecturally) like Imagen.
Google would be running circles around everyone in Generative AI (maybe OpenAI would still have a better core LLM, maybe, but portfolio-wise) if they simply had the ability to cross the gap between building technologies and writing up research papers on them and actually releasing products.
It looks like the model on Hugging Face either hasn't been published yet or was withdrawn. I got this error in their Colab notebook:
OSError: DeepFloyd/IF-I-IF-v1.0 is not a local folder and is not a valid model identifier listed on
'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or
log in with `huggingface-cli login` and pass `use_auth_token=True`.
I'm quite curious how much of the improvement on text rendering is from the switch to pixel-space diffusion vs. the switch to a much larger pretrained text encoder. I'm leaning towards the latter, which then raises the question of what happens when you try training Stable Diffusion with T5-XXL-1.1 as the text encoder instead of CLIP — does it gain the ability to do text well?
DeepFloyd IF is effectively the same architecture/text encoder as Imagen (https://imagen.research.google/), although that paper doesn't hypothesize why text works out a lot better.
Right, I'm aware of the Imagen architecture, just curious to see further research determining which aspect of it is responsible for the improved text rendering.
EDIT: According to the figure in the Imagen paper FL33TW00D's response referred me to, it looks like the text encoder size is the biggest factor in the improved model performance all-around.
The CLIP text encoder is trained to align with the pooled image embedding (a single vector), which is why most text embeddings are not very meaningful on their own (but still convey the overall semantics of the text). With T5 every text embedding is important.
I think this model will result in a massive new wave of meme culture. AI's already seen success in memes up to this point, but the ability for readable text to be incorporated into images totally changes the game. Going to be an interesting next few months on the interwebz, that's for sure. Exciting times!
This could be super cool for logos. I've tried using Stable Diffusion to generate logos and it does pretty good at helping brainstorm, but the text is always gibberish so you can use its idea, but you have to add your own text which basically means creating a logo from scratch using its designs as inspiration.
Well, surely you don't expect to just take the generated stuff and use it right away? For logos you usually need something that some entity can hold the claim on, so you probably need a human touch in any case.
You can't use this to make logos for any commercial product, and it's not safe to use it for hobby projects either, based on the current model license.
> You can't use this to make logos for any commercial product
Yeah good luck figuring out that a particular logo was generated with this particular model. And if someone does good luck doing anything about it.
With this amount of fear one wouldn't dare to cross a road without three layers of bubble wrap, plus written authorisation from a lawyer plus a feasibility study from a traffic engineer.
You'll be asked to produce documents verifying your ownership and/or compliance with any licenses for all intellectual property. That includes code and graphics (logos).
Sure. And what documents and paperwork do you expect if I the owner of the company drawn the logo myself with a bit of a crayon/inkscape/gimp/blender? Those exact documents will be produced.
I think this is just for pre-release, and they will release fully licensed for Commercial. It doesn't make sense to have a model like this that can do game changing text and logos... but then not license it for commercial. If they don't, that would be ridiculous.
There's fundamental tradeoffs, as there always will be when you're compressing things into an image model.
So, this is going to have new different issues. Since it's similar to Imagen, it probably can't handle long complex prompts as well, since they developed Parti afterward.
Here's my question: are there any image models where, if you prompt "1+1", you get an image showing "3"?
Well, yeah, its a bigger set of models (particular the language model) that takes more resources (both to train and for inference.) That’s the tradeoff.
> Here’s my question: are there any image models where, if you prompt “1+1”, you get an image showing “3”?
You want a t2i model that does arithmetic in the prompt, translates to it to “text displaying the number <result>”, but, also does the arithmetic wrong?
Yeah, I don’t think that combination of features is in any existing model or, really, in any of the datasets used for evaluation, or otherwise on anyone’s roadmap.
"Actually thinking about your prompt" is a necessary part of being able to make the prompts natural language instead of a long list of fantasy google image search terms.
Useful example being "my bedroom but in a new color", but some things I've typed into Midjourney that don't work include "a really long guinea pig" (you get a regular size one), "world's best coffee" (the coffee cup gets a world on it), etc. It's just too literal.
I don't think they're saying that's a goal, I think they're curious if it is the case. LLMs are bad at arithmetic, this uses a LLM to process the prompt, that class of result seems plausible.
We already knew those were going to be solved by scale like using T5 instead of the really small bad text encoder SD used, because they were solved by Imagen etc.
There's a note which suggests you might be able to get by on lower. My 3060 struggles with SD on the defaults, but works fine with float16.
There are multiple ways to speed up the inference time and lower the memory consumption even more with diffusers. To do so, please have a look at the Diffusers docs:
Optimizing for inference time [1]
Optimizing for low memory during inference [2]
If you don't mind the power consumption I noticed that older nvidia P6000's (24GB) are pretty cheap on ebay! My 16GB P5000 is pretty handy for this stuff.
An M40 24GB is less than $200, if you don't mind the trouble to get it's drivers installed, cooled, etc. It's also important to note your motherboard must support larger VRAM addressing; many older chipsets won't be able to boot with it (i.e. some, perhaps almost all, Zen 1 supporters).
IIRC SD’s first party code had singificantly higher requirements than is possible with some of the downstream optimization most people are using, which have lower speed and more system RAM use but reduce peak VRAM use.
The architecture here looks different, but the code is licensed in a way which still makes downstream optimization and redistribution possible, so maybe there will be something there.
They took down the blogpost, but from what I remember the model is composite and consists of a text encoder as well as 3 "stages":
1. (11B) T5-XXL text encoder [1]
2. (4.3B) Stage 1 UNet
3. (1.3B) Stage 2 upscaler (64x64 -> 256x256)
4. (?B) Stage 3 upscaler (256x256 -> 1024x1024)
Resolution numbers could be off though. Also the third stage can apparently use the existing stable diffusion x4, or a new upscaler that they aren't releasing yet (ever?).
> Once these are quantized (I assume they can be)
Based on the success of LLaMA 4bit quantization, I believe the text encoder could be. As for the other modules, I'm not sure.
edit: the text encoder is 11B, not 4.5B as I initially wrote.
You'll be able to optimize it a lot to make it fit on small systems if you are willing to modify your workflow a bit: instead of 1 prompt -> 1 image _n_ times, do 1 prompt -> _n_ images 1 time -> _m_ times... For a given prompt, run it through the T5 model and store; you can do that in CPU RAM if you have to because you only need the embedding once so you don't need a GPU which can run T5-XXL naively. Then you can get a large batch of samples from #2; 64px is enough to preview; only once you pick some do you run through #3, and then from those through #4. Your peak VRAM should be 1 image in #2 or #4 and that can be quantized or pruned down to something that will fit on many GPUs.
>Can anyone explain why it needs so much ram in the first place though?
The T5-XXL text encoder is really large, also we do not quantize the UNets, the UNet outputs 8-bit pixels, so quantizing the UNet to that precision will create pretty bad outputs.
LDM-400M was already able to generate text (predecessor of Stable Diffusion), thanks to the fact that every token in the text encoder (trained from scratch) was available in the attention layer.
I'm also very happy for the release of the two upscaler, I can use them to upscale to result of my small 64x64 DDIM models (maybe with some finetuning).
I would be more interested in image-to-text models. Does someone know of any decent model? I saw the GPT4 demo, and they showed that they do image-to-text... but then that was actually a fake (i.e., the model was interpreting the image filename).
Can you provide a source on gpt4 image model being fake? I haven’t heard that before, though I have wondered why I haven’t heard anything about the image part and don’t have access to image processing myself.
CLIP Interrogator[1], which is also build into AUTOMATIC1111, gives quite reasonable results, at least if all you need is a prompt, it can't handle complex interactions:
Output: "a white horse with a sign that says rexel's in space, pixelperfect, inspired by Paul Kelpe, official simpsons movie artwork, alternate album cover, in style of nanospace, by Apelles, pickles, pespective, pop surrealism, ingame, in a space cadet outfit, sifi"
AFAIK, converting an image to a text summary isn't really a thing by itself. The related work would be "visual reasoning" which is the ability to ask things about the image in natural language and get responses back also in natural language.
I believe the current SOTA test for NLVR is VQAv2[0] or GQA[1].
This does outperform Stable Diffusion 2.1, but uses a different architecture and requires more memory and compute. Stable Diffusion runs its denoising process in a compressed "latent space" which is how it was able to be so compute-efficient compared to other diffusion models. It also uses the (relatively) small text encoder from OpenAI's CLIP model to encode user prompts. Both of these optimizations meant that it could run much faster compared to say, DALLE or Imagen, but it didn't follow complicated user prompts especially well and had trouble with things like counting and text-rendering.
DeepFloyd IF is based on Google's Imagen model, which has two key differences from Stable Diffusion: (1) it denoises in pixel space instead of a compressed latent space, and (2) it uses a 10x larger pretrained text encoder (T5-XXL-1.1) compared to SD's CLIP encoder. (1) allows it to better render high-frequency details and text, and (2) allows it to understand complex prompts much better. These improvements come at the cost of multiple times more memory usage and compute requirements compared to SD, though.
In terms of "will it replace SD?"—in the short term I think yes. But I still think latent diffusion models are the future. For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.
> For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.
SDXL is available on StabilityAI’s hosted services already, so they can be compared head to head.
I believe the version available on DreamStudio is heavily RLHF-tuned, no? I'm mostly interested to see how the raw weights perform out of the box compared to IF, which we have to wait for the release for.
The VRAM requirements are higher (14GB) so lots of things that can do SD won’t do this with thr existing toolchain. But some of that is “aftermarket” SD optimization, and this maybe could see some of that, too.
But there are consumer cards with 14GB+ VRAM, so its not, even before optimization, out of reach of consumer hardware.
I’m not sure why “denoise in latent space at 64x64 and decode to pixel space at target resolution” is fundamentally better than “denoise in pixel space at 64x64, then upscale to pixel space at target resolution and denoise some more”.
The former seems likely to be lower compute-for-resolution, but that’s not the only consideration for “better”...
I didn't think about that but you're totally right, assuming they have those embeddings cached it would be super easy to retrain SD using them. 11B parameter count is rather unfortunate though tbh, I've never been the biggest fan of "scale is all you need" even though it seems to ring irritatingly true most of the time.
Seems to be entirely a different approach for diffusion.
>DeepFloyd IF works in pixel space. The diffusion is implemented on a pixel level, unlike latent diffusion models (like Stable Diffusion), where latent representations are used.
As far as I can tell from Emad's discord and twitter discussion, the idea appears to be to make this a "research" release, and therefore the worse license.
At a later point the model will be renamed "StableIf", and released with a similar license to StableDiffusion.
Yeah Emad was clarifying this on the LAION discord the other day — plan is to have a better-licensed version out Eventually™, guess we'll see how long that takes.
This is the dumb part about open-source models. Criminals, governments, and propaganda spreaders need not worry about the license; but legitimate users do.
I hear this complaint often, especially in regards to gun control. Yes, there are a subset of people who do what they are going to do irrespective of laws. There is also a middle ground of people where the laws might curtail unwanted behavior. But the main purpose is it provides a basis for punishing unwanted behavior.
To make it concrete, one could argue that bank robbers rob banks even though it is illegal, so why have a law against it since law abiding people aren't going to rob banks. Does anyone really think we should remove such laws?
DeepFloyd IF is a state-of-the-art text-to-image model released on a non-commercial, research-permissible license that provides an opportunity for research labs to examine and experiment with advanced text-to-image generation approaches. In line with other Stability AI models, Stability AI intends to release a DeepFloyd IF model fully open source at a future date.
Any web based front ends yet? I put together a system that runs a variety of web based open source AI image generation and editing tools on Vultr GPU instances. It spins up instances on demand, mounts an NFS filesystem with local caching and a COW layer, spawns the services, proxies the requests, and then spins down idle instances when I'm done. Would love to add this, suppose I could whip something up if none exists.
Seeing a lot of text-to-image out there recently. Does anyone know what the current state of the art is on image-to-text? Thinking something similar to Midjourney's /describe command that they added in v5
While it's not publicly available yet, I have strong suspicions that multimodal GPT-4 may actually be SOTA in image-to-text. The examples shown in the Sparks of AGI paper were extremely impressive imo, though of course those are cherry-picked so it's unclear how well the model will perform on non-cherry-picked images.
There's a discord with tons of sample images, where we've been waiting patiently for the release, coming SOON, for 3 months now. https://discord.gg/pxewcvSvNx
What these AI companies need are some good old-fashioned leakers. We should be seeing these models show up on sketchy pirate sites, complete with garish 80s-style cracking screens crediting various '1337 haX0rs with witty pseudonyms.
Website design main page. Bright vibrant neon colors of the rainbow slimes, slime business, kid attention grabbing, splashes of bright neon colors. Professional looking Website page, high quality resolution 8k
Website design for slime. Professional looking, high-quality, 8k, brightest neon colors of the rainbow slimes, splashes of neon colors in background, kid attention grabbing, eye catching
Not necessarily. IMO a good model needs to follow your prompt well, and that was my problem with Stable Diffusion.
I've been trying to get a good portrait picture with "neon lights" on Stable Diffusion and it is almost impossible. Meanwhile with the new Dall-e, that was possible. The picture specially with SDXL is good, but it doesn't really have neon lights...
I tried now similar prompt on deepfloyd and managed to get there!
Definitely possible :)
I've been doing this with new Dall-e + img2img with Stable Diffusion.
Explaining: I created a model of me, and wanted to create some good realistic portrait pictures. First I tried to create a model of me using some of the custom models already exist and the result was bad.
Then I tried SD 1.5/2.1... It was better, but couldn't really get some of the prompts make real...
Then I tried new Dall-e, saved, and inserted my face with img2img on SD and it worked much better!
(via https://news.ycombinator.com/item?id=35743727, but we've merged that thread into this earlier one)