More

danijar · 2025-04-08T18:23:10 1744136590

For a lot of things, VLMs are good enough already to provide rewards. Give them the recent images and a text description of the task and ask whether the task was accomplished or not.

For a more general system, you can annotate videos with text descriptions of all the tasks that have been accomplished and when, then train a reward model on those to later RL against.

danijar · 2025-04-07T23:30:44 1744068644

It gets diamonds at 1:48 in the top left video (might need to full screen to seek) [1].

The tools are admittedly really hard to see in the videos because of the timelapse and MP4 struggles a bit on the low resolution, but they are there :)

[1]: https://danijar.com/dreamerv3/

danijar · 2025-04-07T22:42:42 1744065762

It actually has no human data as input and learns by itself in the environment, that's the point of the accomplishment! :)

ninetyninenine · 2025-04-08T00:46:31 1744073191

That's what humans do right? So it's parroting us.

danijar · 2025-04-07T22:41:23 1744065683

Yes, it's RL from scratch and sparse rewards

danijar · 2025-04-07T22:40:05 1744065605

I agree with you, this is just the start and Minecraft has a lot more to offer for future research!

danijar · 2025-04-07T22:35:03 1744065303

I think learning to hold a button down in itself isn't too hard for a human or robot that's been interacting with the physical world for a while and has learned all kinds of skills in that environment.

But for an algorithm learning from scratch in Minecraft, it's more like having to guess the cheat code for a helicopter in GTA, it's not something you'd stumble upon unless you have prior knowledge/experience.

Obviously, pretraining world models for common-sense knowledge is another important research frontier, but that's for another paper.

danijar · 2025-04-07T09:20:48 1744017648

Hi, author here! Dreamer learns to find diamonds from scratch by interacting with the environment, without access to external data. So there are no explainer videos or internet text here.

It gets a sparse reward of +1 for each of the 12 items that lead to the diamond, so there is a lot it needs to discover by itself. Fig. 5 in the paper shows the progression: https://www.nature.com/articles/s41586-025-08744-2

itchyjunk · 2025-04-07T10:47:55 1744022875

Since diamonds are surrounded by danger and if it dies, it loses its items and such, why would it not be satisfied after discovering iron pick axe or somesuch? Is it in a mode where it doesn't lose its item when it dies? Does it die a lot? Does it ever try digging vertically down? Does it ever discover other items/tools you didn't expect it to? Open world with sparse reward seems like such a hard problem. Also, once it gets the item, does it stop getting reward for it? I assume so. Surprised that it can work with this level of sparse rewards.

taneq · 2025-04-07T12:21:19 1744028479

In all reinforcement learning there is (explicitly as part of a fitness function, or implicitly as part of the algorithm) some impetus for exploration. It might be adding a tiny reward per square walked, a small reward for each block broken and a larger one for each new block type broken. Or it could be just forcing a random move every N steps so the agent encounters new situations through “clumsiness”.

kevindamm · 2025-04-07T20:15:16 1744056916

That is right, there is usually a parameter on the action selection function -- the exploitation vs exploration balance.

danijar · 2025-04-07T22:26:24 1744064784

When it dies it loses all items and the world resets to a new random seed. It learns to stay alive quite well but sometimes falls into lava or gets killed by monsters.

It only gets a +1 for the first iron pickaxe it makes in each world (same for all other items), so it can't hack rewards by repeating a milestone.

Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.

SpaceManNabs · 2025-04-08T02:15:02 1744078502

> Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.

This is such gold. Thanks for sharing. Immediately added to my notes.

SpaceManNabs · 2025-04-07T17:05:10 1744045510

I just want to express my condolences in how difficult it must be to correct basic misunderstandings that can be immediately corrected from reading the fourth paragraph under the section "Diamonds are forever"

Thanks for your hard work.

danijar · 2025-04-07T22:27:53 1744064873

Haha thanks!

ryan-duve · 2025-04-07T22:38:51 1744065531

For the curious, from the link above:

> log, plank, stick, crafting table, wooden pickaxe, cobblestone, stone pickaxe, iron ore, furnace, iron ingot, iron pickaxe and diamond

danijar · 2025-04-07T09:14:22 1744017262

Yes, you can decode the imagined scenarios into videos and look at them. It's quite helpful during development to see what the model gets right or wrong. See Fig. 3 in the paper: https://www.nature.com/articles/s41586-025-08744-2

Animats · 2025-04-07T14:52:20 1744037540

So, prediction of future images from a series of images. That makes a lot of sense.

Here's the "full sized" image set.[1] The world model is low-rez images. That makes sense. Ask for too much detail and detail will be invented, which is not helpful.

[1] https://media.springernature.com/full/springer-static/image/...

danijar · on July 25, 2022

To me, that's just bad scientific reporting then. As a scientist, I also found this headline a bit misleading.

93po · on July 25, 2022

All science reporting is bad. All of it.

danijar · on July 6, 2020

It's necessary if you want to offer an interactive Python shell in the browser, e.g. for websites that teach programming or otherwise use programming as a means of user interaction.

airstrike · on July 6, 2020

Which is like 0.0001% of the web