Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Reinforcement learning is very good with games.

>> In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.

And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.



For a lot of things, VLMs are good enough already to provide rewards. Give them the recent images and a text description of the task and ask whether the task was accomplished or not.

For a more general system, you can annotate videos with text descriptions of all the tasks that have been accomplished and when, then train a reward model on those to later RL against.


Plenty of real world situations have clear objectives with obvious rewards.


Example.


Fold clothes -> clothes are folded.

Take children to school -> they safely arrive on time.

Autonomous driving -> arrive at destination without crashing.

Call centre -> customers are happy.


Those don't look like rewards, or at least don't get processed as such for many people (myself included).

Or maybe there is some art to finding happiness in simple things like having folded clothes or surviving the commute?


In RL rewards can be anything you want. They don't have to be things that humans like.


Fair enough!

I guess you can always find some well-specified, measurable goal/reward, but then that choice limits the performance of your model. It's fine when you're building a very specialized system; it gets more difficult the more general you're trying to be.

For a general system meant to operate in human environment, the goal ends up approaching "things that humans like". Case in point, that's what the overall LLM goal function is - continuations that make sense to humans, in fully-general meaning of that.


>> Fold clothes -> clothes are folded.

>> Take children to school -> they safely arrive on time.

>> Autonomous driving -> arrive at destination without crashing.

>> Call centre -> customers are happy.

Define a) "folded", b) "safely", c) "destination", d) "happy".

Also define the reward functions for each of the four objectives above.


Safely -> no crashes

Destination -> Like, close to the destination? I don't see how that's hard.

Happy -> you can use customer feedback for this

Folded -> this is indeed the trickiest one, but I think well within the capabilities of modern vision models.


>> Safely -> no crashes

Really? What about fires? Falling off cliffs? Causing others to crash?

Your "examples" are all hand-wavy and vague and no good to train an RL agent. You've also not provided a reward function.


Work a job, receive money


That's a weak example it context of at least salaried jobs, especially in context of RL, as "receive money" part is usually both significantly delayed from "work a job" part, and only loosely affected by it.


The delay between action and reward is a pretty fundamental problem with RL in general. I don't think they've come up with a really good solution yet.

Of course the delay is much bigger with working a job than most RL games but fundamentally it's the same problem.


> And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.

I encourage you to read deepmind's work with robots.


Oh I have. For example I remember this project:

>> Quantitatively, the QT-Opt approach succeeded in 96% of the grasp attempts across 700 trial grasps on previously unseen objects. Compared to our previous supervised-learning based grasping approach, which had a 78% success rate, our method reduced the error rate by more than a factor of five.

https://research.google/blog/scalable-deep-reinforcement-lea...

That was in 2018.

So what do you think, is vision-based robotic manipulation and grasping a solved problem, seven years later? Is QT-Opt now an established industry standard in training robots with RL?

Or was that just another project that was announced with great fanfare and hailed as a breakthrough that would surely lead to great increase of capabilities... only to pop, fizzle and disappear in obscurity without any real-world result, a few years later? Like most of DeepMind's RL projects do?


Let's look at 2025

https://www.youtube.com/watch?v=x-exzZ-CIUw

It looks pretty awesome. Let's see what happens.


Nice robot demo. Here's another one:

https://youtu.be/03p2CADwGF8?si=BXeWXqu1_3WMS4yy

A robot assembling a puzzle with machine vision!

And it's only from the 1970's.


i dont think those demos are comparable and cool for sharing your link!


Absolutely comparable. Consider what can be done today with hardware as powerful as in the 1970's and it's obvious that the needle hasn't budged one tick.

But, like you say- let's wait and see. I always do the former but I'm still waiting for the latter.


> games have clear objectives with obvious rewards. The real world, not so much.

Tell that to the people here who are trying to turn their startup ideas into money.


I don't think folks go the startup path because the steps to go from idea to making money are obvious and clear.


> it is never going to work in the real world

DeepSeek used RL to train R1, so that is clearly not true. But ignoring that, what is your alternative? Supervised learning? Good luck finding labels if you don’t even know what the objective is.


No, let's not ignore DeepSeek: text is not the real world any more than Minecraft is the real world.

And why do I have to offer an alternative? If it's not working, it's not working, regardless of whether there's an alternative (that we know of) or not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: