Reinforcement Learning: From Zero to State of the Art with Pytorch 4

activatedgeek · on June 8, 2018

The title is "click-baity" but I've used this repo in the past to verify parts of my code and it is highly recommended. The code is amazingly clean!

Once you've verified that the implementations are correct, it is easier to start the journey to reproduce SOTA on harder problems by playing around with the side-tricks that are often employed.

RobertoG · on June 8, 2018

Pytorch 4?

Pytorch 1 is not available yet.

I suppose it means Pytorch 0.4. but it doesn't sounds so good. We should leave marketing outside technical explanations.

What will happen in the future when there is a Pytorch 4 and somebody find this repo?

Sean1708 · on June 8, 2018

As far as I can tell it's only the Hacker News title that mentions PyTorch 4 (in fact the repo description says "PyTorch0.4 tutorial ..."), and I think it's far more likely that that was just a typo/misunderstanding that some sort of insidious marketing technique.

RobertoG · on June 8, 2018

Maybe I just dreamed it, but I think the description in GitHub said "Pytorch 4". Not anymore.

Well, if it was changed because of my comment, I'm happy to be useful. If it was not changed and I dreamed it, my excuses.

2bitencryption · on June 8, 2018

I did a great deal of reading on Q-learning around the time of the original AlphaGo, it looks like that was covered in a previous repo (RL-Adventures-1).

This new one seems to not mention Q-learning, is that because all these examples are based implicitly on Q-learning, or are these totally new alternatives?

blt · on June 8, 2018

In Deep Q Network (DQN), the Q network itself is the policy. You have one output for each possible action, and the the neural network estimates the Q value for each action in the current state. You act by selecting the action with the highest Q output. This doesn't work if the action space is continuous, e.g. motor torques for a robot.

You might think you can fix this by making the action an input to the Q network and keeping only one output; then you could find the action with the highest output. But due to the nonlinearity in the neural network, this is an intractable nonconvex optimization problem.

So instead, you train a neural network to output the action given the state. The algorithms are harder to understand, because Q learning is kind of like supervised learning but policy gradients really aren't. A lot of algorithms (A2C, DDPG, TRPO, etc.) still use one-output Q network (as described in the previous paragraph), but this is just a part of the learning algorithm * . Once training is done, you throw away this Q network. The learned behavior is entirely contained in the policy network. These methods are usually called policy gradient methods.

This article covers policy gradient methods only.

* it's possible to do "pure" policy gradients using only the empirical return, but the Q network helps reduce the variance of the gradient estimate and stabilize the learning.

poiuytqwer · on June 8, 2018

Anyone got anything similar to this for deep learning?

poppingtonic · on June 8, 2018

https://github.com/fastai/fastai/tree/master/courses/

zxcvvcxz · on June 7, 2018

Hate to be that person, but -

If you can go from zero to SOTA from a web tutorial in anything less than, I don't know, a year, then the field is severely underdeveloped.

But let that be an opportunity: what it really means is that the field is quickly growing and there aren't well-established experts or leaders.

laGrenouille · on June 8, 2018

Looking at the material here, I think you may be under estimating what is meant by "zero" here. To fully follow these notes you would need to have significant research-level knowledge of general machine learning techniques, with a good deal of specific experience working with neural networks (both in theory and application).

I think someone with a general graduate-level background in most fields of ML or computer science could acquire a working knowledge of the SOTA in a particular associated sub-topic after reading through a half dozen or so research papers and code. That doesn't seem unreasonable or surprising.

sherjilozair · on June 8, 2018

He's also under estimating what it means to be a hero. Most of the experiments here are on Cartpole which is a very basic environment (not even MNIST level of complexity). Most papers in RL use Atari, which requires some amount of engineering that this repo does not have.

tartrate · on June 8, 2018

Understanding and making use of E=mc^2 takes less time than it took to research it.

__bee · on June 8, 2018

Hate to be that person, but -

This is not the spirit that some of us want to see in HN. Efforts are always appreciated. The field is improving incrementally, in a very fast way.

make3 · on June 8, 2018

come on man, the title is obviously click bait; you can only begin to build an understanding of a field with a single tutorial, even if it's super in depth