Hacker News new | past | comments | ask | show | jobs | submit login
MDM: Human Motion Diffusion Model (guytevet.github.io)
226 points by Vt71fcAqt7 on Sept 30, 2022 | hide | past | favorite | 38 comments



A related paper by the same lead author:

"MotionCLIP: Exposing Human Motion Generation to CLIP Space"

https://arxiv.org/abs/2203.08063

Would anyone be able to explain how the two techniques are related?


I just relized the actual research paper[0] does not seem to be linked to in the page.

[0]https://arxiv.org/abs/2209.14916


There's an arXiv button at the top for me (7 hours later).


Prediction: This is going to reduce the cost of producing non-stylized animated motions and lead to a lot of movies that look something like Polar Express.


Does the model produce bone movement instructions that could be used for any rigged character?

What are the training data? Sounds like they would need tons of diverse labelled motion capture data.


I’m in awe. Is this type of thing likely to solve complex robotic agility tasks like folding laundry or making eggs?


it's pretty good at making something that looks natural and realistic, but it doesn't seem to be accounting for any difficult constraints like gravity, wrinkles, or whether the egg is done on this side

robotics also needs instant reaction time, which i don't think this kind of thing is good at

maybe something like a rough pre-simulation step for further rapid prototyping work? can't imagine the practical application..

disclaimer: i have no idea what i'm talking about, and am just going through the motions, much like the diffusion model under discussion


If we'd ever really try to do humanoid robotics, ones that interact with an environment made for humans in the way it's intended to be interacted with, this could actually be very valuable: imitate the general movement to quickly get the general action plan, use live feedback loops and simulation for the details. Basically like a child learns, but with one part in the cloud hivemind.

Without that, you have a much bigger solution space to search for the optimal approach. And you likely wouldn't even want that optional solution anyways: if your humanoid robots have sufficiently strong actors, optional solution for, to pick a simple example, navigating a staircase would likely be vertically scaling up the handrails through the slot in the middle. That might save staircase capacity and a tiny amount of battery (better aerodynamics!) but it would certainly ruin any humanoid qualities.


This may also be how it works for humans. The more conscious parts of the brain are involved in coming up with a general "movement plan", while fine-grained re-balancing and re-planning is taken care or subconciously.


With the twist that for a robot that gets it's rough plan from a diffusion model that's then executed in a tight conventional control loop, I'd call the control loop more "conscious" than the diffusion model part (which is of course an extremely subjective categorization).

On the other hand, if you take another step back and think of the query input to the diffusion model, perhaps a text "human walking up to a higher floor" in our stair-climbing example, then you are absolutely right, that would be much closer to what we'd call "conscious" than I had ever imagined!

(my usual model of technical consciousness is an entirely different beast, the "simulate before act" approach with the additional requirement that the simulation includes a "self" and "peers" and the simulation includes rules/mechanisms that put them in the same category)


I found your response both insightful and funny.


Heh, your post did not make sense to me at all the first time I read it, then I re-read grandparent, this time without skipping the final paragraph. I completely agree!


Nah - it's going to be wonky as hell, and robotics is much more complex and exacting than this. This is useful for animation or creative stuff.


> complex robotic agility tasks like folding laundry or making eggs

The robotic challenge in those tasks is not insufficient Actuation capabilities or Control capabilities, but insufficient sensory Proprioception.

Sure you can compensate for those in Control or adding Vision guidance but ultimately what you needs is sensors/transducers that are able to mimic touch and pressure and proprioception in a way that approximates what exists in the animal kingdom.


Oh this is fantastic. I've been thinking a bit lately on how to do things other than image generation using these types of models, but you need to be quick to the punch these days. Kudos to these researchers, this is going to open the doors for a lot of applications.


Just a week ago I was wondering about this. I was wondering if diffusion models could be used first to generate a 3D character model, and then use another model to describe actions that animate the character. Here is the humble beginnings of such a thing.

I was also wondering about the possibility of generating, say, an anthropomorphized cartoon otter that can be animated using a model trained on both otter and human motion to produce a result that is something in between.

It could reduce the workload for producing animated stories by one or more orders of magnitude sometime in the possibly not-too-distant future.


I'm picturing this in the next Kings Quest game.


Just me or is something new like this popping up every day now?


It's because there was a major ML conference submission deadline yesterday, so now things are being announced.


https://iclr.cc/

The conference for reference


Its been about every 5-6 hours today.


The pure volume of projects spurred on by the release of Stable Diffusion source code is staggering


The original diffusion code for these projects comes from lots of research. In terms of code- OpenAI released guided-diffusion over ImageNet classes awhile ago, then GLIDE, their DALLE2 predecessor. CompVis incorporated that into their Latent Diffusion variant about half a year ago. A few weeks before Stable Diffusion released they also published the similar “retrieval augmented diffusion” codebase.

And _then_ stable diffusion came out. The effectiveness of diffusion models for text to image is an idea that has been floating around a little longer than you might think.


I think I agree here. Stable diffusion is "just" an optimized version of DALLE2, which itself directly builds on the previous literature. Also, the 3d model generator posted to HN yesterday builds on NeRF. And they are all using CLIP I believe. Stable diffusion gave us speed and knowlege about how much data of what kind and quality needs to be in the training data to still generate good results.


It's certainly not my intent to undermine the efforts of Robin Rombach, Andreas Blattman, Katherine Crowson (and many others).

Katherine's work on clip-guided-diffusion over the `guided-diffusion` ImageNet checkpoints was effectively the first time the public got to see what text-to-image via diffusion instead of purely transformer-based solutions (like in DALLE1/dalle-mini) would look like. And it happened well before GLIDE was published (and gets a mention/citation).

The CompVis team (Blattman, Rombach, etc.) has been able to not just compete, but surpass (in some ways - it's nuanced) the work of the big American research labs (OpenAI in particular) with solid novel research. Their research on `VQGAN` outperformed the Autoencoder from the DALLE-1 paper, and they've been competing directly in the vision space ever since.

Incredibly talented people.


Yet the SNR is getting abysmal day by day.


Looking forward to seeing this in blender. :)


One of these days, I think we're going to be able to feed a book to one of these models and have it create a movie/show/cartoon out of it.

Which as a writer quite excites me, even though it'll probably be quite bad at the beginning, with a flood of terrible products similar to the flood of Unity games.


What kind of output does it generate ? What format is the 3D animated model it creates?


Haven’t read the paper yet, but based on the videos it likely generates the rotation of each bone in the skeleton, which can then be used to animate a humanoid skeleton. So the videos you see are those poses being applied to a model.


Plus translation of the root bone.


Do someone have some easy to understand reading material about latent diffusion principles? I try to understand the core concepts, but I have a hard time finding good sources.



Hello again, such article is now on the frontpage and I immediately thought of you:

https://jalammar.github.io/illustrated-stable-diffusion/

https://news.ycombinator.com/item?id=33084205


This video explains it at multiple complexity levels, starting with the simplest explanation. Not sure whether you are looking for something more advanced.

https://www.youtube.com/watch?v=yTAMrHVG1ew


Tel Aviv University keeps kicking ass, great job.


I wonder the applicability of these types of movements to robotics/motion planning. Anyone familiar with the topic care to comment?


Certainly impossible to implement with current hardware. You'd need a robust physical model that could withstand significant stress from any random permutation of desired motion. Biped robots aren't really ragdolls.

Besides, BD is doing fine with classical optimal control theory for Atlas' locomotion.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: