Hacker News new | past | comments | ask | show | jobs | submit login
Reinforcement Learning with Prediction-Based Rewards (blog.openai.com)
165 points by lainon on Oct 31, 2018 | hide | past | favorite | 38 comments



I like the idea of having the agent be attracted to the unpredictable, but I guess there should be something to ensure that unpredictability doesn't dominate which action is selected. For an interesting/funny example check their two videos: "Agent in a maze without a noisy TV" and "Agent in a maze with a noisy TV"


How about an approach where the agent's reward is not the predictability itself but the first derivative of it. This way the agent will be attracted to the parts of environment where it can improve and will avoid white-noise parts since its model of the world doesn't generalize on these.

Juergen Schmidhuber (the author of original LSTM paper) had a very similar idea, http://people.idsia.ch/~juergen/driven2009.pdf

"This drive maximizes interestingness, the first derivative of subjective beauty or compressibility, that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and (since 1990) artificial systems."


If you read it thats exactly what they addressed here. They say they address the noisy TV problem. The video shows why they needed to address it.

> These choices make RND immune to the noisy-TV problem.


I would imagine they would need some kind of breakaway factor that allows the agent to decide that, despite the unpredictability, what its trying to explore might not be worth the reward, or that there is no reward behind it.


Finally someone beat Montezuma's Revenge without imitating human demonstrations! Very cool. I wonder why the algorithm then fails so hard on Pitfall? I would expect them to be similar problems.


Looks like AI winter update needs another update

https://blog.piekniewski.info/2018/10/29/ai-winter-update/


All right, you led me into an interesting rabit-hole there, I saw this AI winter page, and after reading a bit , the guy( Filip Piekniewski) said: "...the only problem really worth solving in AI is the Moravec's paradox, which is exactly the opposite of what DeepMind or OpenAI are doing...". Then I dont know what the guy thinks about what is the exact meaning of the said paradox, but in my ignorance I went to check Wikipedia, and they gave me:"Moravec's paradox is the discovery...that, contrary to traditional assumptions, high-level reasoning requires very little computation, but low-level sensorimotor skills require enormous computational resources" Well, this sounded pretty obvious, if you accept "high-level reasoning" as meaning...well...lets see what Wikipedia says a few lines later, quoting Moravec himself: "As Moravec writes, "it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers,...". So, I can see the same contradiction in both, in the Wikipedia's definition and in Moravecs statement, in my opinion and i think the common sense opinion,"intelligence tests" sounds very broad or very narrow, but anyway useless to make a point about the point, even more when "playing checkers" is in the same sentence, the common sense is that from playing checkers to the "high-level reasoning" used by Wikipedias definition (makes me laugh, sorry for the sincerity), there is an ocean of distance. The amount of memory required to do real "what-i-call" high-level reasoning can be huge only to store the heuristics algorithms execution codes, not to mention all the "concepts" of "things" and their possible ways of "interaction" and all the interactions and their possible subjects-"things" , and the probable mechanisms that we have no idea are in place, probably driven by the most primitive reaction-reasoning mechanisms, who in heavens knows...just my-an opinin after biting the bait : ) , in form of a light comment. Cheers.


  Just to add a bit about the point of the "AI winter reality check" X "Stephen Hawkings hype", I think both are right and wrong at the same time: I do believe that the analisis of the human reasoning and its simulation in computers can be achieved with much less computing power than that needed by most insects sensorial systems(so in this sense agree with Moravec, but...). The computing power can accelerate learning in a system, but its a question of time and nothing more, it doesnt matter really if you take a week or a year to "set-train-load" a reasoning human level AI, the achievement would be amazing anyway. So is the mistake, in my opinion, of all the wave-of-hype's direction: neural nets is just for acceleration and generation of datasets(what you say!? yes, generation, thats what they are really for, totally contrarilly to the world's opinion : ) !!! ) and , obviously for the "lesser utility" of  sensorial system autonomous programming, that is what they are using it for right now, but it have nothing to do with "Reasoning Intelligence", that would be, in ma humble opinion, lets say, a programm that could take part and give some insigthful contribution in the famous talk between Albert Einstein and Rabindranath Tagore.[1] That said, I do belive a non-neural net "real deal" AI is just around the corner, not that far, really, and here maybe I do tend a bit to the "Hawkings-hype" people, but completely differently from them, I do believe it has nothing to do with neural nets(that are just tools), and are...lets say: "non-neural networks heuristic based".


Your text is unreadable on mobile. Can you remove the code tag?


Sorry, It was not the intention to put the code tag, I tried to edit it to remove the scroll bar, but I couldnt, have to learn more about this, first timer.


I think the code tag shows up when you indent a line of text. Check if there's any spaces at the start of your message.


No worries!


MR isn't beaten, it only clears the first level.


OK, it beats average human performance.


By the way, the fact that "average" and "best" human performance are presented as meaningful benchmarks is one of the biggest signs that modern AI is driven by hype, rather than science.

For example, speech recognition AI is supposedly within fraction of a percent from "average human level", and yet auto-generated captions are awful. They have no punctuation, they don't distinguish between different speakers, they aren't visually grouped, and fail miserably dealing with slang. So turns out researchers are measuring only one aspect of the problem their algorithm is good at and ignoring the rest.

On the flip side, we have animal intelligence. Bees aren't nearly as smart as humans. So surely modern AI, which surpasses humans in this and that, would have no problem outperforming a bee with its 960 000 neurons, right? But in reality, there is nothing even approaching bees' versatile intelligence. Of course, modern AI researchers would just hand-wave this saying the problem is not well defined. Convenient.


> For example, speech recognition AI is supposedly within fraction of a percent from "average human level", and yet auto-generated captions are awful. They have no punctuation, they don't distinguish between different speakers, they aren't visually grouped, and fail miserably dealing with slang. So turns out researchers are measuring only one aspect of the problem their algorithm is good at and ignoring the rest.

YouTube captioning != SOTA, any more than Google Translate for years and years represented anything close to the NMT SOTA.


Well, for a certain definition of 'human performance'... I believe that's carried over from the DQN paper and is something like 'an ordinary video game player given a few hours'. When it comes to ALE you should usually treat the 'human performance' numbers as being lower bounds.

(In this case, if an agent can beat 'human performance' by only clearing 1 of 9 total levels, one is entitled to a little skepticism about how useful 'human performance' is as a benchmark for this particular game. Focus on the improvement over other DRL agents, not that.)



It was literally only a few days ago that Google/DeepMind posted about a new approach to curiosity: https://ai.googleblog.com/2018/10/curiosity-and-procrastinat...

That blog post also mentions the noisy TV problem.

I am not skilled enough to describe how these approaches differ.


From the google one:

> The environment also contains a TV for which the agent has the remote control. There is a limited number of channels (each with a distinct show)... even if the order of shows appearing on the screen is random and unpredictable, all those shows are already in memory!

So in google's "Curiosity and Procrastination in Reinforcement Learning" they could not handle a TV with pure noise (snow) as it could not remember all that noise.


This approach will be thwarted if the time is shown in the observation space. If this is so then every kth state will give a novelty reward and every other state will give zero, it doesn't matter what the agent does.


I don't think that's a reasonable conclusion to draw from the fact the TV wasn't pure noise. Why would the agent not be able to determine that every noisy frame was one step away from every other?


The google paper uses memory. You cant remember a never ending set of 2D random noise, there are limitless variations.


They don't remember all the previous scenes, just ones which are "novel" enough.

But how do we decide whether the agent is seeing the same thing as an existing memory? Checking for an exact match could be meaningless: in a realistic environment, the agent rarely sees exactly the same thing twice. For example, even if the agent returned to exactly the same room, it would still see this room under a different angle compared to its memories.

Instead of checking for an exact match in memory, we use a deep neural network that is trained to measure how similar two experiences are.


Ok. Though it still depends on limited number of tv shows.

If there is an unlimited number of shows and the agent walks right up to it, I think its still trapped.



I'm becoming more and more convinced that reinforcement learning is equivalent to AGI, we just haven't finished optimizing it yet.


"AGI" is not well defined however RL is more like Freudian perspective on human mind which has long fell out of favor in psychology. For example, one of the opular classical view assumed that all of our behaviors are driven by ultimate quest of survival and reproduction. But then how do you explain suicides or soldiers going in certain death combats? How do you explain people who never want to have child? How do you explain artist giving up big money Wall Street job for simple life of doing some obscure art? How do you explain people sitting on the beach to just do nothing?

Here are some interesting comment on the topic:

Is Global Reinforcement Learning (RL) a Fantasy?

https://www.lesswrong.com/posts/QEgGp7gaogAQjqgRi/is-global-...


We're optimized for the survival of our genes, not ourselves. Sometimes that means sacrificing ourselves, or not having children, etc.


Unsolved bugs


David Silver has said that explicitly.

But "RL" is currently used to describe both problems (MDPs etc), and solutions (RL algos).

RL problems are defined very broadly and have so many subtypes (episodic/non, with/without goals, continuous/discrete, ...) that you can frame most things as an RL problem. And "AGI" would by definition be able to solve any RL problem.

So imo to some extent this is tautological.


I love how this post gets like 20 points and the one about decensoring hentai yesteday was like 200


Happy to see more work based on prediction. I've been of the opinion that predictive rewards should largely bypass the hand-tuned rewards we have been using for reinforcement learning so far, or at least speed up learning by providing a much richer signal to use for training.


> We also noticed significant improvements in performance of RND every time we discovered and fixed a bug [...]. Getting such details right was a significant part of achieving high performance even with algorithms conceptually similar to prior work.

Details.


This is a really interesting method that mimics human behavior on boredom and achieves fantastic results in RL.

However, it's pretty disappointing that OpenAI, a non-profit organization that claims to want to distribute AI as evenly as possible, to not release the source code of their research to make their findings reproducible by others. This paper is an example, but also other high profile work, such as their DoTA bot.

Edit: My mistake, this paper was open sourced, see below comment.


(I work at OpenAI.)

The code is linked from the top of the blog post: https://github.com/openai/random-network-distillation.

While we do produce lots of open-source code, our mandate is much broader than that: https://blog.openai.com/openai-charter/. We are attempting to build safe AGI and distribute its benefits.

In the short term, that means we focus on building working systems and will sometimes, but not always, release source code.

In the long term, per the charter: "we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research."



Note also there are many other repositories available in https://github.com/openai from various papers & projects, such as https://github.com/openai/baselines.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: