Hacker News new | past | comments | ask | show | jobs | submit login
Reinforcement learning’s foundational flaw (thegradient.pub)
132 points by andreyk on July 9, 2018 | hide | past | favorite | 49 comments



In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

“What are you doing?”, asked Minsky.

“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.

“Why is the net wired randomly?”, asked Minsky.

“I do not want it to have any preconceptions of how to play”, Sussman said.

Minsky then shut his eyes.

“Why do you close your eyes?”, Sussman asked his teacher.

“So that the room will be empty.”

At that moment, Sussman was enlightened.

"picking the right reward function" in RL is shockingly hard. It actually works OK-ish when the problem space is strictly bounded, like with a game whose rules are known.

After that, you start getting into sky-humping cheetah problems:https://www.alexirpan.com/public/rl-hard/upsidedown_half_che...

https://www.alexirpan.com/2018/02/14/rl-hard.html is a better article, perhaps, than this one.


> https://www.alexirpan.com/public/rl-hard/upsidedown_half_che...

You can argue, but it works. Goal accomplished. Nature did things that are stranger than that.

Now you just have to add efficiency aspects into your reward functions and observe how it slowly finds a local minimum. Also something nature did a lot of times. Now hope that the remaining inefficiency is ok for you.

Done.

-

You can start with a random neural net. It's not exactly empty, but it's ok. The randomness defines which local minimum you'll find this time around while you burn through your VC money desperately hoping that the AWS bills for all the GPU time will arrive after you've found the holy grail that makes your startup worth <insert bullshit valuation>.

> "picking the right reward function" in RL is shockingly hard.

Agree. x1000. And computational power and data quality.

-

on another post I've added some links (AI playing Mario): https://news.ycombinator.com/item?id=17489459


> You can start with a random neural net. It's not exactly empty, but it's ok.

One interesting thing about random NNs is how much they can already do. For example, you can do single-shot image inpainting or superresolution with an untrained randomly-wired CNN: https://dmitryulyanov.github.io/deep_image_prior In RL, you can use randomly sampled NNs to create various artificial arbitrary 'reward functions' to force an RL agent to explore an environment & learn the dynamics, and then when you give it the real reward function, it learns much faster how to optimize it. Similarly, you can sample random NNs to execute for entire trajectories for 'deep exploration', providing demonstrations of potentially long-range strategies much more efficiently than simple random-action strategies. In 'reservoir sampling', as I understand it, you don't even bother training the NN, you just randomly initialize it and train a simple model on the outputs, assuming that some of the random highly nonlinear relationships encoded into the NN will turn out to be useful, which sounds crazy but apparently works. Makes one think about Tegmark's interpretations.


Re sky cheetah: yes, if you ask a stupid question, you get a stupid answer. That cheetah model is oversimplified and bad -- it has a heard but doesn't have any requirements or objectives for the head,


my point re: sky cheetah is that there are large domains where the only questions you can ask are stupid —- which is to say, ill-posed.

in those situations you cannot expect RL to come up with “normal” solutions that do what you actually want. The sky-humping is a totally valid answer to the question asked; it’s merely that the reward function for “walk forward” doesn’t (and in many situations, may not ever) fully constrain the search space such that you get sane solutions.


The rl-hard.html article was discussed on HN only four months ago:

https://news.ycombinator.com/item?id=16383264

Makes for an interesting trail of additional breadcrumbs to wind into the stack :)


I have been searching for this Marvin Minsky quote for a long time. Thank you for posting it again!


I don't get it, could you please explain it?


If you wire the neural net randomly, it still has preconceptions about how to play - you just don't know what they are. You're just closing your eyes to the initial wiring, but that doesn't make the initial wiring non-existent (in the same way that the room did not become empty when Minsky closed his eyes).


Thanks!


This is a weirdly shallow article containing lots of diagrams and bullet points to just summarize the known points that RL needs a lot of data and needs to learn from scratch.

No mention of all the ongoing work in learning from demonstrations, or more generally incorporating any off-policy knowledge. Vague speculations about the philosophy of model free learning. Not really worth the read (as someone working in RL).


All that stuff is in part two! https://thegradient.pub/how-to-fix-rl/

Says as much at the end... to be fair we did warn up front "The first part, which you're reading right now, will set up what RL is and why it is fundamentally flawed. It will contain some explanation that can be skipped by AI practitioners." But personally I think the board game allegory is fun and that most people tend to forget the categorical simplicity of Go and Atari games and overhype ; easy to say the main points are not new but the details are important here.


calling model-free RL "fundamentally flawed" is just click-baiting. too bad it worked on me; but I was hoping for insight.


In your opinion, is this a solution to the "AI winter" that is often talked about? I'm an engineer but not involved in AI but things like meta-reinforcement seem, from the info/perspective you've given, to address the problem, at least partially.


I think AI winter is unlikely to come about this time since non-RL stuff (supervised learning) has been so successful and useful.


Yes, some techs are overhyped (chatbots, finance stuff) but deeplearning has delivered a lot of incredible working applications. It is not just hot air or marketing hype.


Expert systems were not just hot air or marketing hype. Usefulness of a subset of new AI technology is irrelevant. A winter or contraction is caused by expectations not being met, and it seems, at least to me, that investors/funders have already started expecting superhuman performance in image/speech recognition, and there's a lot of expectation even in robotics, which will probably not be met by actual results any time soon.


I appreciated the point that pure RL is insufficient for many tasks. But why also downplay the achievements of pure RL where it has matched or surpassed skilled humans?

Captioned chart:

The progression of AlphaGo Zero's skill. Note that it takes a whole day and thousands of lifetimes' worth of games to get to an ELO score of 0 (which even the weakest human can achieve easily).

I'm pretty sure that a one week old infant's ELO score will also fall short of 0. Sure, the AI did things that no human could do in order to match and then surpass human performance. Great! Half of the fun of following AI research is seeing it refute old intuitions about how human-like systems have to be to perform well on tasks previously considered to require human intelligence.

Whatever "general intelligence" or "human level intelligence" comes to mean by the 2050s, it looks like it's going to be a lot better pruned-by-counterexample than it was in the 1950s.


While I appreciate the sentiment, I thi k the fact that we can learn from fewer examples demonstrates that the learning process isn’t as efficient as ours, therefore it isn’t yet optimal. It seems like a goal should be for learning to be as efficient or more efficient for computers than for humans.


We got 86 billion neurons in our brains that all crave to get used - so you can even imagine them as single agents trying to get along.

It's like 86 billion guys that try to please that thing they simultaneously produce (our consciousness). What I want to say: The algorithm can be dumb as f*. I call it the f-star-algorithm. But the computational power in our brains is extremely high.


I don’t think throwing more computational power at the problem is the right answer to all ML problems.


Those neurons are arranged and incentivized cleverly. The structure is also very important and is necessary for the resulting intelligence.

So it's not only computational power, but also the unique structure nature found through trial and error.


I wonder how much of that cleverness will be gleaned and appropriated by those who design 3 dimensional chips.


I (author) think it is worth downplaying the achievements when perspective is lacking about their signifiance; Henry Kissinger wrote a whole op ed recently founded mostly on a misunderstanding of how big a deal AlphaGo really is https://www.theatlantic.com/magazine/archive/2018/06/henry-k...

But I did try to not just denigrate this work, rather to both praise it and discuss the sometimes ignores flaws.


Didn't someone just recently post a DQN solution for Montezuma's Revenge (the game that according to this article they cannot solve)?

> "Though DQN is great at games like Breakout, it is still not able to tackle relatively simple games like Montezuma's Revenge"

Yep:

https://www.engadget.com/2016/06/09/google-deepmind-ai-monte...

https://blog.openai.com/learning-montezumas-revenge-from-a-s...

It's far too early in this research to say what exactly what can and can't be solved by RL.


The Open AI solution uses demonstrations though, which is the article's point, that bare DQN can't solve the games, and something like demonstrations are needed.


But if you look at how it uses the demonstrations it's quite interesting. It uses them only as a series of starting points to start learning from. It doesn't actually use state-action pair examples at all, as far as I understand, which is quite different from the idea that comes to mind from the phrase "uses demonstration". It simply starts up the simulation at places that the one single demonstration it has access to got to. In other words, the examples it is learning from are nothing but "by this point in the game you could get here.." But nothing about "when you are here, you should do this.."

In a sense it's pretty similar to how you'd learn a game if you watched someone play it through once. (Except backwards, perhaps.)


yep, we in fact link to this work...

"Even 5 years later, no pure RL algorithms have cracked reasoning and memory games; on the contrary, approaches that have done well at them have either used instructions <link> or demonstrations <link> just as we mentioned would make sense to do in the board game allegory."


It also assumes access to the simulator, which is an even more problematic assumption. That's like saying you're doing image classification but assuming access to the 3D model which generated the image.


I think that analogy is a bit bogus, but if you want to make it, it's more like assuming access to a function that renders the 3D model from a variety of perspectives on command, not having access to the model itself.

(Because the RL algorithm doesn't have access to the rules by which the simulation is carried out, it only has access to the commands and the result.)

And frankly, that would be a perfectly fair and interesting classification problem, so I don't see your point.

Otherwise, how exactly do you propose learning to drive a simulation without access to the simulation? I really don't know what you're saying here.


My point is that the two problems are quite distinct. This is not a small change to how the problem is being solved, but a complete change of the problem itself. Further the change significantly limits the feasibility of the solution, which is not sufficiently made clear by the authors of the blog post. Casual followers of AI/RL research might think that this is a significant progress, while in fact it's actually a progress on a problem that hasn't really received any attention due its uselessness. I think there may be 1-2 papers which might have experiments on this problem while probably 100s in the model-free problem.

Thanks for your analogy though. I agree that it's better than mine. I was only trying to give a rough idea, but I'll use your analogy if I have to now. :)


True, but the Open AI article on Montezuma's Revenge stated that their approach didn't work for Pitfall and one other game.


I just skimmed the article but doesn't seem like there is any talk of more modern approaches.

Newer approaches have the agent learn "primitives" through curiosity. It's sub-goal is to predict future states given the current state + an action.

By doing this, the problem becomes more hierarchical and the search space is reduced. This makes it feasible for more complex scenarios.

I haven't personally heard of a lot of research on this part but I imagine that transfer learning becomes more feasible as well once some "primitives" are established.


Have a read of part 2, where the author discusses approaches of solving the problem in part 1 -- https://thegradient.pub/how-to-fix-rl/

It seems like there's a bit of research in this area but it's not receiving the attention it may deserve. At least, that's how I interpreted the author's tone.


Yep, thanks, the idea was to highlight all the research going on and argue it deserves more attention.


Any good links on the approaches you mention, bcheung?


The article's use of the word "flaw" is overstated.

For background, here are some selected quotes from the article:

> "The first part, which you're reading right now, will set up what RL is and why it is fundamentally flawed." > "In the typical model of RL, the agent begins only with knowledge of which actions are possible; it knows nothing else about the world, and it's expected to learn the skill solely by interacting with the environment and receiving rewards after every action it takes." > "how reasonable is it to design AI models based on pure RL if pure RL makes so little intuitive sense?"

To summarize, the article claims that this particular aspect of RL is a "flaw".

I'd suggest it is more useful to call it a design choice. In many cases, this design choice has beneficial properties.

Of course, there are other ways to build learning agents. The field of RL is certainly open to alternatives, including hybrid models and/or relaxing this particular assumption.

I've seen a good number of (popular) articles about RL making rather broad claims, like this article. It appears to me that many of these articles attempt to 'reduce' RL to a smaller/narrower version of itself in order to make their claims. I hope more people start to see that RL is a set of techniques (not a monolith) that can be mixed and matched in many ways for particular applications.


To be fair, in the article itself we wind up criticizing "Pure RL" (defined as the basic formulation that is typically followed, in which all learning is done from just the reward signal) and not RL as a whole. We call out a lot of awesome non-pure RL work in the second part and suggest this deserves more attention and excitement over eg AlphaGo.


Fair enough. Your article makes a lot of good points, for sure.

Here is a quote from the article I want to mention: “Trying to learn the board game 'from scratch' without explanation was absurd, right?”

No. It is hardly absurd. Sometimes it works, sometimes not. It is a great starting point, if nothing else. So, I wonder if we have different ideas of what ‘absurd’ means.

I agree that we’re in a period of hype. It requires careful work to write clearly without too much zeal or oversimplification. My opinion here is that your attempt to ‘balance’ the debate uses a lot of language that I (and others) perceive as exaggerated.


Has any of the approaches recommended in part 2 been shown to give equivalent or better results than AlphaGo Zero? Eg matching/surpassing human performance with shorter learning time or less compute resources?


Enjoyed the article. But I'm a bit confused by the Venn diagram; neither StarCraft nor DOTA are deterministic or fully observable. And they are discrete only at such extreme resolutions that they may as well be continuous.


Sc2 is deterministic. A replay file is simply a record of all actions performed. There is no randomness in the action resolutions, which allows actions to be the only thing needed to be streamed over to dync game states.


Well it's deterministic once you know the random seed, which is stored in the replay file. An agent doesn't know the seed, hence cannot predict the exact outcome of its actions, only a probability over the outcomes. So, from the agent's perspective, the game is indeed random.


True, not all properties are there to the same extent as in eg Go, but they are there to some extent and more broadly Starcraft and Dota still 'game-like' ie not as complex as driving a self driving car in the real world. The Venn diagram is tricky to get right...


The article claims that RL is simplistic because it uses an unreasonable amount of data. However, recent advances are significant because it uses unreasonable amount of data. As an example, I don't expect to be as good as Michael Jordan no matter how much I play basketball, or beat Garry Kasparov no matter how much I play chess. There's a fundamental flaw to my learning algorithm that prevents me from becoming good at something even if I have infinite experience.

Recent RL research about Policy Gradients / On Policy vs Off Policy / Function approximation / Model-based vs model-free are all research about how to get good at something with a lot of practice. RL has been around for a long time, discussions about higher level learning / planning has been done over and over. One doesn't discount the other. One deals with how to structure the learning problem that you can continue to get better with more experience (RL problem), while the other is about how to use higher level learning to speed it up.


I think the author of this article has fundamentally missed the mark. He talks about humans as if they come out of the womb being able to play Chess. On the contrary, we try and fail to make even simple sounds, and later, words, phrases, crawling, walking, etc.


We do have imitation Learning. I think the article is missing some important parts in RL. One way to train a network is to use experience from others or even past experience of the agent, but why is it interesting doing it from scratch? Because doing it from scratch will allow us to face a lot more problems where we don't have any information or skills and because we avoid the bias in the data and we can discover new things (as it happened with AlphaGo zero with its 'tactics' in go). Now we have a method that can be applied to all the problems that are similar to a board game without any other information, just the rules of the domain.


All that stuff is in part two! https://thegradient.pub/how-to-fix-rl/

"In part two, we’ll overview the different approaches within AI that can address those limitations (chiefly, meta-learning and zero-shot learning). And finally, we'll get to a survey of monumentally exciting work based on these approaches, and conclude with what that work implies for the future of RL and AI as a whole. "


I'd be curious to know how far you can go with marrying GOFAI systems for high level planning work and RL systems for pattern recognition and tactical actions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: