Hacker News new | past | comments | ask | show | jobs | submit login
Stable Cascade (github.com/stability-ai)
694 points by davidbarker 11 months ago | hide | past | favorite | 170 comments



Been using it for a couple of hours and it seems it’s much better at following the prompt. Right away it seems the quality is worse compared to some SDXL models but I’ll reserve judgement until a couple more days of testing.

It’s fast too! I would reckon about 2-3x faster than non-turbo SDXL.


I'll take prompt adherence over quality any day. The machinery otherwise isn't worth it i.e the controlnets, openpose, depthmaps just to force a particular look or to achieve depth. Th solution becomes bespoke for each generation.

Had a test of it and my option is it's an improvement when it comes to following prompts and I do find the images more visually appealing.


Can we use its output as input to SDXL? Presumably it would just fill in the details, and not create whole new images.


I was thinking that exactly. You could use the same trick as the hires-fix for an adherence-fix.


Yeah chain it in comfy to a turbo model for detail


A turbo model isn't the first thing I'd think of when it comes to finalizing a picture. Have you found one that produces high-quality output?


For detail, it'd probably be better to use a full model with a small number of steps (something like KSampler Advanced node with 40 total steps, but starting at step 32-ish.) Might even try using the SDXL refiner model for that.

Turbo models are decent at low-iteration-decent-results, but not so much at adding fine details to an mostly-done image.


How much VRAM does it need? They mention that the largest model uses 1.4 billion parameters more than SDXL, which in turn need a lot of VRAM.


There was a leak from Japan yesterday, prior to this release, and in that it was suggested 20gb for the largest model.

This text was part of the Stability Japan leak (the 20gb VRAM reference was dropped in the release today):

"Stages C and B will be released in two different models. Stage C uses parameters of 1B and 3.6B, and Stage B uses parameters of 700M and 1.5B. However, if you want to minimize your hardware needs, you can also use the 1B parameter version. In Stage B, both give great results, but 1.5 billion is better at reconstructing finer details. Thanks to Stable Cascade's modular approach, the expected amount of VRAM required for inference can be kept at around 20GB, but can be reduced even further by using smaller variations (as mentioned earlier, this (which may reduce the final output quality)."


Thanks. I guess this means that fewer people will be able to use it on their own computer, but the improved efficiency makes it cheaper to run on servers with enough VRAM.

Maybe running stage C first, unloading it from VRAM, and then do B and A would make it fit in 12 or even 8 GB, but I wonder if the memory transfers would negate any time saving. Might still be worth it if it produces better images though.


Sequential model offloading isn’t too bad. It adds about a second or less to inference, assuming it still fits in main memory.


Sometimes I forget how fast modern computers are. PCIe v4 x16 has a transfer speed of 31.5 GB/s, so theoretically it should take less than 100 ms to transfer stage B and A. Maybe it's not so bad after all, it will be interesting to see what happens.


If you're serious about doing image gen locally you should be running a 24GB card anyway because honestly Nvidia's current generation 24GB is the sweet spot price to performance. 3080 ram is laughably the same as the 6 year old 1080Ti and 4080 ram is only slightly more at 16 and costs about 1.5 times the 3090 second hand.

Any speed benefits of the 4080 are gonna be worthless the second it has to cycle a model in and out of ram anyway vs the 3090 in image gen.


> because honestly Nvidia's current generation 24GB is the sweet spot price to performance

How is the halo product of a range the "sweet spot"?

I think nVidia are extremely exposed on this front. The RX 7900XTX is also 24GB and under half the price (In UK at least - £800 vs £1,700 for the 4090). It's difficult to get a performance comparison on compute tasks, but I think it's around 70-80% of the 4090 given what I can find. Even a 3090, if you can find one, is £1,500.

The software isn't as stable on AMD hardware, but it does work. I'm running a RX7600 - 8GB myself, and happily doing SDXL. The main problem is that exhausting VRAM causes instability. Exceed it by a lot, and everything is handled fine, but if it's marginal... problems ensue.

The AMD engineers are actively making the experience better, and it may not be long before it's a practical alternative. If/When that happens nVidia will need to slash their prices to sell anything in this sphere, which I can't really see themselves doing.


>How is the halo product of a range the "sweet spot"?

Because it’s actually a bargain second hand (got another for £650 last week buy it now eBay) and cheap for the benefit it offers for any professional who needs it.

3090 is the iPhone of AI, people should be ecstatic it even exists not complaining about it.


> because honestly Nvidia's current generation 24GB is the sweet spot price to performance

You're aware the 3090 is not the current generation? You can see why I would think you were talking about the 4090?


Think it's weirder you assumed I meant 4090 when it doesn't really offer enough benefit over the 3090 to justify it's cost, and you mentioned an incorrect price for the 3090 anyway so it's not like you weren't talking about 3090.


> If/When that happens nVidia will need to slash their prices to sell anything in this sphere

It's just as likely that AMD will raise prices to compensate.


You think they're going to say "Hey, compute became competitive but nothing else changed performance therefore... PRICE HIKE!"? They don't have the reputation to burn in this domain for that IMHO.

Granted you could see a supply/demand related increase from retailers if demand spiked, but that's the retailers capitalising.


The price hike would come proportionally with each new hardware generation as their software stack becomes competitive with CUDA. Since it will most likely take them multiple generations to catch up. It doesn't make sense for them to sell hardware with comparable capability at a deep discount. Plenty of companies will want to diversify hardware vendors just to have a bargaining chip against Nvidia.


If it worked I imagine large batching could make it worth the load/unload time cost.


Shouldn't be a reason you couldn't do a ton of Layer C work on different images, and then swap in Layer B.


Should use no more than 6GiB for FP16 models at each stage. The current implementation is not RAM optimized.


The large C model uses 3.6 billion parameters which is 6.7 GiB if each parameter is 16 bits.


The large C model have fair bit of parameters tied to text-conditioning, not to the main denoising process. Similar to how we split the network for SDXL Base, I am pretty confident we can split non-trivial amount of parameters to text-conditioning hence during denoising process, loading less than 3.6B parameters.


What's more, they can presumably be swapped in and out like the SDXL base + refiner, right?


Can one run it on CPU?


Stable Diffusion on a 16 core AMD CPU takes for me about 2-3 hours to generate an image, just to give you a rough idea of the performance. (On the same AMD's iGPU it takes 2 minutes or so).


WTF!

On my 5900X, so 12 cores, I was able to get SDXL to around 10-15 minutes. I did do a few things to get to that.

1. I used an AMD Zen optimised BLAS library. In particular the AMDBLIS one, although it wasn't that different to the Intel MKL one.

2. I preload the jemalloc library to get better aligned memory allocations.

3. I manually set the number of threads to 12.

This is the start of my ComfyUI CPU invocation script.

    export OMP_NUM_THREADS=12
    export LD_PRELOAD=/opt/aocl/4.1.0/aocc/lib_LP64/libblis-mt.so:$LD_PRELOAD
    export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
    export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
Honestly, 12 threads wasn't much better than 8, and more than 12 was detrimental. I was memory bandwidth limited I think, not compute.


Even older GPUs are worth using then I take it?

For example I pulled a (2GB I think, 4 tops) 6870 out of my desktop because it's a beast (in physical size, and power consumption) and I wasn't using it for gaming or anything, figured I'd be fine just with the Intel integrated graphics. But if I wanted to play around with some models locally, it'd be worth putting it back & figuring out how to use it as a secondary card?


One counterintuitive advantage of the integrated GPU is it has access to system RAM (instead of using a dedicated and fixed amount of VRAM). That means I'm able to give the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when running. The system RAM is slower than VRAM which is the trade-off here.


Yeah I did wonder about that as I typed, which is why I mentioned the low amount (by modern standards anyway) on the card. OK, thanks!


2GB is really low. I've been able to use A111 stable diffusion on my old gaming laptop's 1060 (6GB VRAM) and it takes a little bit less than a minute to generate an image. You would probably need to try the --lowvram flag on startup.


No, I don't think so. I think you would need more VRAM to start with.


SDXL Turbo is much better, albeit kinda fuzzy and distorted. I was able to get decent single-sample response times (~80-100s) from my 4 core ARM Ampere instance, good enough for a Discord bot with friends.


Sd turbo runs nicely on a m2 MacBook Air (as does stable lm 2!)

Much faster models will come


Which AMD CPU/iGPU are these timings for?


AMD Ryzen 9 7950X 16-Core Processor

The iGPU is gfx1036 (RDNA 2).


If that is true, then the CPU variant must be a much worse implementation of the algorithm than the GPU variant, because the true ratio of the GPU and CPU performances is many times less than that.


Not if you want to finish the generation before you have stopped caring about the results.


You can run any ML model on CPU. The question is the performance


Very impressive.

From what I understand, Stability AI is currently VC funded. It’s bound to burn through tons of money and it’s not clear whether the business model (if any) is sustainable. Perhaps worthy of government funding.


Stability AI has been burning through tons of money for awhile now, which is the reason newer models like Stable Cascade are not commercially-friendly-licensed open source anymore.

> The company is spending significant amounts of money to grow its business. At the time of its deal with Intel, Stability was spending roughly $8 million a month on bills and payroll and earning a fraction of that in revenue, two of the people familiar with the matter said.

> It made $1.2 million in revenue in August and was on track to make $3 million this month from software and services, according to a post Mostaque wrote on Monday on X, the platform formerly known as Twitter. The post has since been deleted.

https://fortune.com/2023/11/29/stability-ai-sale-intel-ceo-r...


I get the impression that a lot of open source adjacent AI companies, including Stability AI, are in the "???" phase of execution, hoping the "Profit" phase comes next.

Given how much VC money is chasing the AI space, this isn't necessarily a bad plan. Give stuff away for free while developing deep expertise, then either figure out something to sell, or pivot to proprietary, or get aquihired by a tech giant.


That is indeed the case, hence the more recent pushes toward building moats by every AI company.


> which is the reason newer models like Stable Cascade are not commercially-friendly-licensed open source anymore.

The main reason is probably Mid journey and OpenAi using their tech without any kind of contribution back. AI desperately needs a GPL equivalent…


It's highly doubtful that Midjourney and OpenAI use Stable Diffusion or other Stability models.


Midjourney 100% at least used to use Stable Diffusion: https://twitter.com/EMostaque/status/1561917541743841280

I am not sure if that is still the case.


It trialled it as an explicitly optional model for a moment a couple years ago. (or only a year? time moves so fast. somewhere in v2/v3 timeframe and around when SD came out). I am sure it is no longer the case.


DALL-E shares the same autoencoders as SD v1.x. It is probably similar to how Meta's Emu-class models work though. They tweaked the architecture quite a bit, trained on their own dataset, reused some components (or in Emu case, trained all the components from scratch but reused the same arch).


How do you know though?


You can't use off-the-shelf models to get the results Midjourney and DALL-E generate, even with strong finetuning.


I pay for both MJ and DALL-E (though OpenAI mostly gets my money for GPT) and don't find them to produce significantly better images than popular checkpoints on CivitAI. What I do find is that they are significantly easier to work with. (Actually, my experience with hundreds of DALL-E generations is that it's actually quite poor in quality. I'm in several IRC channels where it's the image generator of choice for some IRC bots, and I'm never particularly impressed with the visual quality.)

For MJ in particular, knowing that they at least used to use Stable Diffusion under the hood, it would not surprise me if the majority of the secret sauce is actually a middle layer that processes the prompt and converts it to one that is better for working with SD. Prompting SD to get output at the MJ quality level takes significantly more tokens, lots of refinement, heavy tweaking of negative prompting, etc. Also a stack of embeddings and LoRAs, though I would place those more in the category of finetuning like you had mentioned.


If you try diffusionGPT with regional prompting added and a GAN corrector you can get a good idea of what is possible https://diffusiongpt.github.io


That looks very impressive unless the demo is cherrypicked, would be great if this could be implemented into a frontend like Fooocus https://github.com/lllyasviel/Fooocus


What do you use it for? I haven't found a great use for it myself (outside of generating assets for landing pages / apps, where it's really really good). But I have seen endless subreddits / instagram pages dedicated to various forms of AI content, so it seems lots of people are using it for fun?


Nothing professional. I run a variety of tabletop RPGs for friends, so I mostly use it for making visual aids there. I've also got a large format printer that I was no longer using for it's original purpose, so I bought a few front-loading art frames that I generate art for and rotate through periodically.

I've also used it to generate art for deskmats I got printed at https://specterlabs.co/

For commercial stuff I still pay human artists.


Whose frames do you use? Do you like them? I print my photos to frame and hang, and wouldn't at all mind being able to rotate them more conveniently and inexpensively than dedicating a frame to each allows.


https://www.spotlightdisplays.com/

I like them quite a bit, and you can get basically any size cut to fit your needs even if they don't directly offer it on the site.


Perfectly suited to go alongside the style of frame I already have lots of, and very reasonably priced off the shelf for the 13x19 my printer tops out at. Thanks so much! It'll be easier to fill that one blank wall now.


I use comfyUI/SD and MJ and I have never seen anything on the level of what I get out of MJ. Nothing at CivitAI is impressive to me next to what I get from MJ.

Of course, art is so subjective none of this has any real meaning. MJ routinely blows my mind though and it is very rare something from SD does. The secret MJ sauce is obviously all the human feedback that has gone into the model at this point.

I think AI video will be a different story though. I think that is when comfyUI/SD will destroy MJ because MJ is simply not going to be able to have an economic model with the amount of compute needed.


What IRC Channels do you frequent?


Largely some old channels from the 90s/00s that really only exist as vestiges of their former selves - not really related to their original purpose, just rooms for hanging out with friends made there back when they had a point besides being a group chat.


Midjourney has absolutely nothing to offer compared to proper finetunes. DALL-E has: it generalizes well (can make objects interact properly for example) and has great prompt adherence. But it can also be unpredictable as hell because it rewrites the prompts. DALL-E's quality is meh - it has terrible artifacts on all pixel-sized details, hallucinations on small details, and limited resolution. Controlnets, finetuning/zero-shot reference transfer, and open tooling would have made a beast of a model of it, but they aren't available.


You obviously have no idea what you are talking about or just make stupid anime porn.

For fine art MJ destroys everything and it is even close. I say this with using comfyUI/SD all the time.

But you do you and your anime porn.


I'm actually a person making technical decisions (art decisions in the past) in a VFX/art studio, and I'm talking about production use. No generative AI currently passes any reasonable production quality bar, but is being tried by everyone for doing the work that can't be done or is cost-prohibitive otherwise, for example animation, long series with style transfer, filler assets creation etc. Anything that only has a text prompt can be discarded instantly. You have to be able to finetune it on your own material for consistency (of course I'm not talking about dubious 3rd party models), you need higher order guidance (e.g. controlnets, especially custom ones) and many other things. In the hands of a skilled person, a trivial Krita/Photoshop plugin (Firefly, SD, SD realtime) blows anything MJ can offer out of the water, simply because it has all that and you can't do much with text, it doesn't have enough semantic capacity to express artistic intent. I'm not even starting on animation.

In fact, anything that involves non-explicitly guided one-shot generation of anything with light/shadow/colors/perspective is entirely out of the question with the current crop, because all models are hallucinating hard and aren't controllable within a single generation. There are attempts at fixing the perspective without explicit guidance, but it's going to be a long way and it's not super relevant to how things are done anyway.

And for fine art, nothing beats a human painter, doing it by throwing prompts at AI mostly misses the point. I'm not even sure what you mean by fine art in this context, actually - surely not generating artsy-looking images from a prompt for fun?


That's not really true, MJ and DALL-E are just more beginner friendly.


I think it'd be interesting to have a non-profit "model sharing" platform, where people can buy/sell compute. When you run someone's model, they get royalties on the compute you buy.


More specifically, it's so Stability AI can theoretically make a business on selling commercial access to those models through a membership: https://stability.ai/news/introducing-stability-ai-membershi...


The net flow of knowledge about text-to-image generation from OpenAI has definitely been outward. The early open source methods used CLIP, which OpenAI came up with. Dall-e (1) was also the first demonstration that we could do text to image at all. (There were some earlier papers which could give you a red splotch if you said stop sign or something years earlier).


> AI desperately needs a GPL equivalent

Why not just the GPL then?


The GPL was intended for computer code that gets compiled to a binary form. You can share the binary, but you also have to share the code that the binary is compiled from. Pre-trained model weights might be thought of as analogous to compiled code, and the training data may be analogous to program code, but they're not the same thing.

The model weights are shared openly, but the training data used to create these models isn't. This is at least partly because all these models, including OpenAI's, are trained on copyrighted data, so the copyright status of the models themselves is somewhat murky.

In the future we may see models that are 100% trained in the open, but foundational models are currently very expensive to train from scratch. Either prices would need to come down, or enthusiasts will need some way to share radically distributed GPU resources.


Tbh I think these models will largely be trained on synthetic datasets in the future. They are mostly trained on garbage now. We have been doing opt outs on these, has been interesting to see quality differential (or lack thereof), eg removing books3 from stableLM 3b zephyr https://stability.wandb.io/stability-llm/stable-lm/reports/S...


Why aren’t the big models trained on synthetic datasets now? What’s the bottleneck? And how do you avoid amplifying the weaknesses of LLMs when you train on LLM output vs. novel material from the comparatively very intelligent members of the human species. Would be interesting to see your take on this.


We are starting to see that, see phi2 for example

There are approaches to get the right type of augmented and generated data to feed these models right, check out our QDAIF paper we worked on for example

https://arxiv.org/pdf/2310.13032.pdf


I’ve wondered whether books3 makes a difference, and how much. If you ever train a model with a proper books3 ablation I’d be curious to know how it does. Books are an important data source, but if users find the model useful without them then that’s a good datapoint.


We did try stableLM 3b4 with books3 and it got worse in general and benchmarks

Just did some pes2o ablations too which were eh


What I mean is, it’s important to train a model with and without books3. That’s the only way to know whether it was books3 itself causing the issue, or some artifact of the training process.

One thing that’s hard to measure is the knowledge contained in books3. If someone asks about certain books, it won’t be able to give an answer unless the knowledge is there in some form. I’ve often wondered whether scraping the internet is enough rather than training on books directly.

But be careful about relying too much on evals. Ultimately the only benchmark that matters is whether users find the model useful. The clearest test of this would be to train two models side by side, with and without books3, and then ask some people which they prefer.

It’s really tricky to get all of this right. But if there’s more details on the pes2o ablations I’d be curious to see.


What about CC licenses for model weights? It's common for files ("images", "video", "audio", ...) So maybe appropriate.


I've seen Emad (Stability AI founder) commenting here on HN somewhere about this before, what exactly their business model is/will be, and similar thoughts.

HN search doesn't seem to agree with me today though and I cannot find the specific comment/s I have in mind, maybe someone else has any luck? This is their user https://news.ycombinator.com/user?id=emadm


https://x.com/EMostaque/status/1649152422634221593?s=20

We now have top models of every type, sites like www.stableaudio.com, memberships, custom model deals etc so lots of demand

We're the only AI company that can make a model of any type for anyone from scratch & are the most liked / one of the most downloaded on HuggingFace (https://x.com/Jarvis_Data/status/1730394474285572148?s=20, https://x.com/EMostaque/status/1727055672057962634?s=20)

Its going ok, team working hard and shipping good models, the team are accelerating their work on building ComfyUI to bring it all together.

My favourite recent model was CheXagent, I think medical models should be open & will really save lives: https://x.com/Kseniase_/status/1754575702824038717?s=20


exactly my thought. stability should be receiving research grants


We should, we haven't yet...

Instead we've given 10m+ supercomputer hours in grants to all sorts of projects, now we have our grant team in place & there is a huge increase in available funding for folk that can actually build stuff we can tap into.


None of the researchers are associated with stability.ai, but with universities in Germany and Canada. How does this work? Is this exclusive work for stability.ai?


Dom and Pablo both work for Stability AI (Dom finishing his degree).

All the original Stable Diffusion researchers (Robin Rombach, Patrick Esser, Dominik Lorenz, Andreas Blattman) also work for Stability AI.


Finally a good use to burn VC money!


I see in the commits that the license was changed from MIT to their own custom one: https://github.com/Stability-AI/StableCascade/commit/209a526...

Is it legal to use an older snapshot before the license was changed in accordance with the previous MIT license?


It seems pretty clear the intent was to use a non-commercial license, so it’s probably something that would go to court, if you really wanted to press the issue.

Generally courts are more holistic and look at intent, and understand that clerical errors happen. One exception to this is if a business claims it relied on the previous license and invested a bunch of resources as a result.

I believe the timing of commits is pretty important— it would be hard to claim your business made a substantial investment on a pre-announcement repo that was only MIT’ed for a few hours.


If I clone/fork that repo before the license change, and start putting any amount of time into developing my own fork in good faith, they shouldn't be allowed to claim a clerical error when they lied to me upon delivery about what I was allowed to do with the code.

Licenses are important. If you are going to expose your code to the world, make sure it has the right license. If you publish your code with the wrong license, you shouldn't be allowed to take it back. Not for an organization of this size that is going to see a new repo cloned thousands of times upon release.


No, sadly this won’t fly in court.

For the same reason you cannot publish a private corporate repo with an MIT license and then have other people claim in “good faith” to be using it.

All they need is to assert that the license was published in error, or that the person publishing it did not have the authority to publish it.

You can’t “magically” make a license stick by putting it in a repo, any more than putting a “name here” sticker on someone’s car and then claiming to own it.

The license file in the repo is simply the notice of the license.

It does not indicate a binding legal agreement.

You of course, can challenge it in court, and ianal, but I assure you, there is president in incorrectly labelled repos removing and changing their licenses.


>All they need is to assert that the license was published in error, or that the person publishing it did not have the authority to publish it.

Show me where they did this. All they have is a commit that says "Update License". An update doesn't imply a correction.


It could very well fly. Agency law, promissory estoppel, ...


There’s no case law here, so if you’re volunteering to find out what a judge thinks we’d surely appreciate it!


Yes, you can continue to do what you want with that commit^ in accordance with the MIT licence it was released under. Kind of like if you buy an ebook, and then they publish a second edition but only as a hardback - the first edition ebook is still yours to read.


I think the model architecture (training code etc.) itself is still under MIT while the weights (which was the result of training in a huge GPU cluster as well as the dataset they have used [not sure if they publicly talked about it] is under this new license.


Code is MIT, weights are under the NC license for now.


The code is MIT. The model has a non-commercial license. They are separate pieces of work under different licenses. Stability AI have said that the non-commercial license is because this is a technology preview (like SDXL 0.9 was).


MIT license is not parasitic like GPL. You can close an MIT licensed codebase, but you cannot retroactively change the license of the old code.

Stability's initial commit had an MIT license, so you can fork that commit and do whatever you want with it. It's MIT licensed.

Now, the tricky part here is that they committed a change to the license that changes it from MIT to proprietary, but they didn't change any code with it. That is definitely invalid, because they cannot license the exact same codebase with two different contradictory licenses. They can only license the changes made to the codebase after the license change. I wouldn't call it "illegal", but it wouldn't stand up in court if they tried to claim that the software is proprietary, because they already distributed it verbatim with an open license.


> they didn't change any code with it. That is definitely invalid, because they cannot license the exact same codebase with two different contradictory licenses.

Why couldn't they? Of course they can. If you are the copyright owner, you can publish/sell your stuff under as many licenses as you like.


we have an optimized playground here: https://www.fal.ai/models/stable-cascade


"sign in to run"

That's a marketing opportunity being missed, especially given how crowded the space is now. The HN crowd is more likely to run it themselves when presented with signing up just to test out a single generation.


Uh, thanks for noticing it! We generally turn it off for popular models so people can see the underlying inference speed and the results but we forgot about it for this one, it should now be auth-less with a stricter rate limit just like other popular models in the gallery.


I wanted to use your service for a project but you can only sign in through github, I emailed your support about this and never got an answer, in the end I ended up installing SD Turbo locally. I think that a github only auth is losing you potential customers like myself.


I just got rate-limited on my first generation. The message is "You have exceeded the request limit per minute". This was after showing me cli output suggesting that my image was being generated.

I guess my zero attempts per minute was too much. You really shouldn't post your product on HN if you aren't prepared for it to work. Reputations are hard to earn, and you're losing people's interest by directing them to a broken product.


Are you using a vpn or at a large campus or office?


No, I was connected on my own private IP address.


It uses github auth, it’s not some complex process. I can see why they would need to require accounts so it’s harder to abuse it.


After all the bellyaching from the HN crowd when PyPI started requiring 2FA, nothing surprises me anymore.


Like every other image generator I've tried, it can't do a piano keyboard [1]. I expect that some different approach is needed to be able to count the black keys groups.

[1] https://fal.ai/models/stable-cascade?share=13d35b76-d32f-45c...


I think it's more than this. In my case in most of images I made about basketball there were more than one ball. I'm not an expert, but some fundamental constrains of the human (cultural) life (like all piano keys are the same, there's only one ball in a game) are not grasped by the training or grasped partially


These image generators can't count. Go try variations on "one ball", "two balls", etc, and you can see in what sort of ways it fucks it up.


As with human hands, coherency is fixed by scaling the model and the training.


This model is built upon the Würstchen architecture. Here is a very good explanation of how this model works by one of its authors.

https://www.youtube.com/watch?v=ogJsCPqgFMk


Great video! And here's a summary of the video :)

    Gemini Advanced> Summarize this video: https://www.youtube.com/watch?v=ogJsCPqgFMk
This video is about a new method for training text-to-image diffusion models called Würstchen. The method is significantly more efficient than previous methods, such as Stable Diffusion 1.4, and can achieve similar results with 16 times less training time and compute.

The key to Würstchen's efficiency is its use of a two-stage compression process. The first stage uses a VQ-VAE to compress images into a latent space that is 4 times smaller than the latent space used by Stable Diffusion. The second stage uses a diffusion model to further compress the latent space by another factor of 10. This results in a total compression ratio of 40, which is significantly higher than the compression ratio of 8 used by Stable Diffusion.

The compressed latent space allows the text-to-image diffusion model in Würstchen to be much smaller and faster to train than the model in Stable Diffusion. This makes it possible to train Würstchen on a single GPU in just 24,000 GPU hours, while Stable Diffusion 1.4 requires 150,000 GPU hours.

Despite its efficiency, Würstchen is able to generate images that are of comparable quality to those generated by Stable Diffusion. In some cases, Würstchen can even generate images that are of higher quality, such as images with higher resolutions or images that contain more detail.

Overall, Würstchen is a significant advance in the field of text-to-image generation. It makes it possible to train text-to-image models that are more efficient and affordable than ever before. This could lead to a wider range of applications for text-to-image generation, such as creating images for marketing materials, generating illustrations for books, or even creating personalized avatars.


Is there any way this can be used to generate multiple images of the same model? e.g. a car model rotated around (but all images are of the same generated car)


Someone with resources will have to train Zero123 [1] with this backbone.

[1] https://zero123.cs.columbia.edu/



Yes, input image => embedding => N images, and if you're thinking 3D perspectives for rendering, you'd ControlNet the N.

ref.: "The model can also understand image embeddings, which makes it possible to generate variations of a given image (left). There was no prompt given here."


The model looks different in each of those variations though. Which seems to be intentional, but the post you're responding to is asking whether it's possible to keep the model exactly the same in each render, varying only by perspective.


I remember doing some random experiments with these two researchers to find the best way to condition the stage B on the latent, my very fancy cross-attn with relative 2D positional embeddings didn't work as well as just concatenating the channels of the input with the nearest upsample of the latent, so I just gave up ahah.

This model used to be known as Würstchen v3.


Will this work on AMD? Found no mention of support. Kinda an important feature for such a project, as AMD users running Stable Diffusion will be suffering diminished performance.



I believe those are CPU specific, it disregards my AMD GPU. It's too slow a proces.

For what it's work, a sort of hybrid CPU GPU approach is giving me workable results with automatic111, about 30-50s to generate a 512x512 image ataround 1-2 it/s.

Not enough for me to want to use local-run txt2img for daily use (why, when Bing/Poe exists and give you high quality image gen for free in seconds), but enough that when I want some granular control or don't want to give my data to a company (i like making AI portraits of myself and friends), I can run locally.

If I could improve performance, I'd love to get into video production


I'd say I'm most impressed by the compression. Being able to compress an image 42x is huge for portable devices or bad internet connectivity (or both!).


That is 42x spatial compression, but it needs 16 channels instead of 3 for RGB.


Even assuming 32 bit floats (the extra 4 on the end):

4*16*24*24*4 = 147,456

vs (removing the alpha channel as it's unused here)

3*3*1024*1024 = 9,437,184

Or 1/64 raw size, assuming I haven't fucked up the math/understanding somewhere (very possible at the moment).


It is actually just 2/4 bytes x 16 latent channels x 24 x 24, but the comparison to raw data needs to be taken with a grain of salt, as there is quite a bit of hallucination involved in reconstruction.


Furthermore, each of those 16 channels would typically be mutibyte floats as opposed to single byte RGB channels. (speaking generally, haven't read the paper)


I have to imagine at this point someone is working toward a fast AI based video codec that comes with a small pretrained model and can operate in a limited memory environment like a tv to offer 8k resolution with low bandwidth.


I am 65% sure this is already extremely similar to LGs upscaling approach in their most recent flagship


I would be shocked if Netflix was not working on that.


a 42x compression is also impressive as it matches the answer to the ultimate question of life, the universe, and everything, maybe there is some deep universal truth within this model.


I haven't been following the image generation space since the initial excitement around stable diffusion. Is there an easy to use interface for the new models coming out?

I remember setting up the python env for stable diffusion, but then shortly after there were a host of nice GUIs. Are there some popular GUIs that can be used to try out newer models? Similarly, what's the best GUI for some of the older models? Preferably for macos.


Fooocus is the fastest way to try SDXL/SDXL turbo with good quality.

ComfyUI is cool but very DIY. You don't get good results unless you wrap your head around all the augmentations and defaults.

No idea if it will support cascade.


ComfyUI is similar to Houdini in complexity, but immensely powerful. It's a joy to use.

There are also a large amount of resources available for it on YouTube, GitHub (https://github.com/comfyanonymous/ComfyUI_examples), reddit (https://old.reddit.com/r/comfyui), CivitAI, Comfy Workflows (https://comfyworkflows.com/), and OpenArt Flow (https://openart.ai/workflows/).

I still use AUTO1111 (https://github.com/AUTOMATIC1111/stable-diffusion-webui) and the recently released and heavily modified fork of AUTO1111 called Forge (https://github.com/lllyasviel/stable-diffusion-webui-forge).


Our team at Stability AI build ComfyUI so yeah is supported


Auto1111 and Comfy both get updated pretty quickly to support most of the new models coming out. I expect they'll both support this soon.


Check out invoke.com


Thanks for calling us out - I'm one of the maintainers.

Not entirely sure we'll be in the Stable Cascade race quite yet. Since Auto/Comfy aren't really built for businesses, they'll get it incorporated sooner vs later.

Invoke's main focus is building open-source tools for the pros using this for work that are getting disrupted, and non-commercial licenses don't really help the ones that are trying to follow the letter of the license.

Theoretically, since we're just a deployment solution, it might come up with our larger customers who want us to run something they license from Stability, but we've had zero interest on any of the closed-license stuff so far.


fal.ai is nice and fast: https://news.ycombinator.com/item?id=39360800 Both in performance and for how quickly they integrate new models apparently: they already support Stable Cascade.


Was anyone able to get this running on Colab? I got as far as loading extras in text-to-inference, but it was complaining about a dependency.


Why are they benchmarking it with 20+10 steps vs. 50 steps for the other models?


prior generations usually take fewer steps than vanilla SDXL to reach the same quality.

But yeah, the inference speed improvement is mediocre (until I take a look at exactly what computation performed to have more informed opinion on whether it is implementation issue or model issue).

The prompt alignment should be better though. It looks like the model have more parameters to work with text conditioning.


in my observation, it yields amazing perf at higher batch sizes (4 or better 8). i assume it is due to memory bandwith and the constrained latent space helping.


However, the outputs are so similar that I barely feel a need for more than 1. 2 is plenty.


I think that this model used consistency loss during training so that it can yield better results with less steps.


...because they feel that at 20+10 it achieves a superior output than at 50 steps for SDXL. They also benchmark it against 1 step for SDXL-Turbo.


I'm very impressed by the recent AI progress on making models smaller and more efficient. I just have the feeling that every week there's something big on this space (like what we saw previously from ollama, llava, mixtral...). Apparently the space for on-device models are not fully discovered yet. Very excited to see future products on that direction.


> I'm very impressed by the recent AI progress on making models smaller and more efficient.

That's an odd comment to place in a thread about an image generation model that is bigger than SDXL. Yes, it works in a smaller latent space, yes its faster in the hardware configuration they've used, but its not smaller.


my bad, I misread the post and got the impression that it's a way smaller model. Thanks for correcting me.


It is pretty good I shared a comparison on medium

https://medium.com/@furkangozukara/stable-cascade-prompt-fol...

My Gradio APP even works amazing on 8 GB gpu with CPU offloading


the way it's written about in Image Reconstruction section like it is just an image compression thing...is kind of interesting. for that stuff and its presented use there to be very much about storing images and reconstructing them. when "it doesn't actually store original images" and "it can't actually give out original images" are points that get used so often in arguments as a defense for image generators. so it is just a multi-image compression file format, just a very efficient one. sure, it's "redrawing"/"rendering" its output and makes things look kinda fuzzy, but any other compressed image format does that as well. what was all that 'well it doesn't do those things' nonsense about then? clearly it can do that.


>well it doesn't do those things' nonsense about then? clearly it can do that.

There is a model that is trained to compress (very lossy) and decompress the latent, but it's not the main generative model, of course the model doesn't store images in it, you just give the encoder an image and it will encode it and then you can decode it with the decoder and get a very similar image, this encoder and decoder is used during training so that the stage C can work on a compressed latent instead of directly at the pixel level because it's expensive, but the main generative model (stage C) should be able to generate any of the images that were present in the dataset or it fails to do its job. Stages C, B, and A do not store any images.

The B and A stages work like an advanced image decoder, so unless you have something wrong with image decoders in general, I don't see how this could be a problem (a JPEG decoder doesn't store images either, of course).


Ultimately this is abstraction not compression.


In a way it's just an algorithm than can compress either text or an image. The neat trick is that if you compress the text "brown bear hitting Vladimir Putin" and then decompress it as an image, you get an image of a bear hitting Vladimir Putin.

This principle is the idea behind all Stable Diffusion models, this one "just" achieved a much better compression ratio


well yeah. but it's not so much about what it actually does, but how they talk about it. maybe (probably) i missed them putting out something that's described like that before, but it's just the open admission in demonstration of it. i guess they're getting more brazen, given than they're not really getting punished for what they're doing, be it piracy or infringement or whatever.


The model works on compressed data. That's all it is. Sure, it could output a picture from its training set on decompression, but only if you feed that same picture into the compressor.

In which case what are you doing, exactly? Normally you feed it a text prompt instead, which won't compress to the same thing.


Does anyone have a link to a demo online?



Thank you, is there a demo if the "image to image" ability? It doesnt seem to be in any of the demos I see.


Can Stable Cascade be used for image compression? 1024x1024 to 24x24 is crazy.


That's definitely not lossless compression


Where can I run it if I don't have a GPU? Colab didn't work


runpod, kaggle, lambda labs, or pretty much any other server provider that gives you one or more gpus.


Wow like the compression part. 42 fixed times compression. That is really nice. Slow to unpack on the fly. But the future is waiting.


That is a very tiny latent space. Wow!


What is the system requirements needed to run this, particularly how much vram it would take?


Will this get integrated into Stable Diffusion Web UI?


Surely within days. ComfyUI’s maintainer said he is readying the node for release perhaps by this weekend. The Stable Cascade model is otherwise known as Würschten v3 and has been floating around the open source generative image space since fall.


Third-party (using diffusers) node for ComfyUI is already available for those who can't wait for native integration.

https://github.com/kijai/ComfyUI-DiffusersStableCascade




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: