Generalisable how? The model completely hallucinates invalid input, it's not eve...

Art9681 · 2024-10-13T23:17:45.000000Z

None of those questions are relevant are they? I get the impression you've already decided this isnt good enough, which is basically agreeing with everyone else. No one is talking about what it's capable of today. Read the thread again. We're imagining the great probability a few permutations later this thing will basically be The Matrix.

ben_w · 2024-10-13T18:53:37.000000Z

It did not require CSGO, that was simply one of their examples. The very first video in the link shows a bunch of classic Arati games, and even the video which is showing CSGO is captioned "DIAMOND's diffusion world model can also be trained to simulate 3D environments, such as CounterStrike: Global Offensive (CSGO)" — I draw your attention to "such as" being used rather than "only".

And I thought I was fairly explicit about video data, but just in case that's ambiguous: the stuff you record with your phone camera set to video mode, synchronised with the accelerometer data instead of player keyboard inputs.

As for output, with the model as it currently stands, I'd expect a 24h training video at 60fps to be "photorealisic and with similar weird hallucinations". Which is still interesting, even without combining this with a control net like Stable Diffusion can do.

empath75 · 2024-10-13T21:11:50.000000Z

You do the same thing at a larger scale, and instead of video game footage you use a few million hours of remote controlled drone input in the real world.