Prediction: This is going to reduce the cost of producing non-stylized animated motions and lead to a lot of movies that look something like Polar Express.
it's pretty good at making something that looks natural and realistic, but it doesn't seem to be accounting for any difficult constraints like gravity, wrinkles, or whether the egg is done on this side
robotics also needs instant reaction time, which i don't think this kind of thing is good at
maybe something like a rough pre-simulation step for further rapid prototyping work? can't imagine the practical application..
disclaimer: i have no idea what i'm talking about, and am just going through the motions, much like the diffusion model under discussion
If we'd ever really try to do humanoid robotics, ones that interact with an environment made for humans in the way it's intended to be interacted with, this could actually be very valuable: imitate the general movement to quickly get the general action plan, use live feedback loops and simulation for the details. Basically like a child learns, but with one part in the cloud hivemind.
Without that, you have a much bigger solution space to search for the optimal approach. And you likely wouldn't even want that optional solution anyways: if your humanoid robots have sufficiently strong actors, optional solution for, to pick a simple example, navigating a staircase would likely be vertically scaling up the handrails through the slot in the middle. That might save staircase capacity and a tiny amount of battery (better aerodynamics!) but it would certainly ruin any humanoid qualities.
This may also be how it works for humans. The more conscious parts of the brain are involved in coming up with a general "movement plan", while fine-grained re-balancing and re-planning is taken care or subconciously.
With the twist that for a robot that gets it's rough plan from a diffusion model that's then executed in a tight conventional control loop, I'd call the control loop more "conscious" than the diffusion model part (which is of course an extremely subjective categorization).
On the other hand, if you take another step back and think of the query input to the diffusion model, perhaps a text "human walking up to a higher floor" in our stair-climbing example, then you are absolutely right, that would be much closer to what we'd call "conscious" than I had ever imagined!
(my usual model of technical consciousness is an entirely different beast, the "simulate before act" approach with the additional requirement that the simulation includes a "self" and "peers" and the simulation includes rules/mechanisms that put them in the same category)
Heh, your post did not make sense to me at all the first time I read it, then I re-read grandparent, this time without skipping the final paragraph. I completely agree!
> complex robotic agility tasks like folding laundry or making eggs
The robotic challenge in those tasks is not insufficient Actuation capabilities or Control capabilities, but insufficient sensory Proprioception.
Sure you can compensate for those in Control or adding Vision guidance but ultimately what you needs is sensors/transducers that are able to mimic touch and pressure and proprioception in a way that approximates what exists in the animal kingdom.
Oh this is fantastic. I've been thinking a bit lately on how to do things other than image generation using these types of models, but you need to be quick to the punch these days. Kudos to these researchers, this is going to open the doors for a lot of applications.
Just a week ago I was wondering about this. I was wondering if diffusion models could be used first to generate a 3D character model, and then use another model to describe actions that animate the character. Here is the humble beginnings of such a thing.
I was also wondering about the possibility of generating, say, an anthropomorphized cartoon otter that can be animated using a model trained on both otter and human motion to produce a result that is something in between.
It could reduce the workload for producing animated stories by one or more orders of magnitude sometime in the possibly not-too-distant future.
The original diffusion code for these projects comes from lots of research. In terms of code- OpenAI released guided-diffusion over ImageNet classes awhile ago, then GLIDE, their DALLE2 predecessor. CompVis incorporated that into their Latent Diffusion variant about half a year ago. A few weeks before Stable Diffusion released they also published the similar “retrieval augmented diffusion” codebase.
And _then_ stable diffusion came out. The effectiveness of diffusion models for text to image is an idea that has been floating around a little longer than you might think.
I think I agree here. Stable diffusion is "just" an optimized version of DALLE2, which itself directly builds on the previous literature. Also, the 3d model generator posted to HN yesterday builds on NeRF. And they are all using CLIP I believe. Stable diffusion gave us speed and knowlege about how much data of what kind and quality needs to be in the training data to still generate good results.
It's certainly not my intent to undermine the efforts of Robin Rombach, Andreas Blattman, Katherine Crowson (and many others).
Katherine's work on clip-guided-diffusion over the `guided-diffusion` ImageNet checkpoints was effectively the first time the public got to see what text-to-image via diffusion instead of purely transformer-based solutions (like in DALLE1/dalle-mini) would look like. And it happened well before GLIDE was published (and gets a mention/citation).
The CompVis team (Blattman, Rombach, etc.) has been able to not just compete, but surpass (in some ways - it's nuanced) the work of the big American research labs (OpenAI in particular) with solid novel research. Their research on `VQGAN` outperformed the Autoencoder from the DALLE-1 paper, and they've been competing directly in the vision space ever since.
One of these days, I think we're going to be able to feed a book to one of these models and have it create a movie/show/cartoon out of it.
Which as a writer quite excites me, even though it'll probably be quite bad at the beginning, with a flood of terrible products similar to the flood of Unity games.
Haven’t read the paper yet, but based on the videos it likely generates the rotation of each bone in the skeleton, which can then be used to animate a humanoid skeleton. So the videos you see are those poses being applied to a model.
Do someone have some easy to understand reading material about latent diffusion principles? I try to understand the core concepts, but I have a hard time finding good sources.
This video explains it at multiple complexity levels, starting with the simplest explanation. Not sure whether you are looking for something more advanced.
Certainly impossible to implement with current hardware. You'd need a robust physical model that could withstand significant stress from any random permutation of desired motion. Biped robots aren't really ragdolls.
Besides, BD is doing fine with classical optimal control theory for Atlas' locomotion.
"MotionCLIP: Exposing Human Motion Generation to CLIP Space"
https://arxiv.org/abs/2203.08063
Would anyone be able to explain how the two techniques are related?