The title is "click-baity" but I've used this repo in the past to verify parts of my code and it is highly recommended. The code is amazingly clean!
Once you've verified that the implementations are correct, it is easier to start the journey to reproduce SOTA on harder problems by playing around with the side-tricks that are often employed.
As far as I can tell it's only the Hacker News title that mentions PyTorch 4 (in fact the repo description says "PyTorch0.4 tutorial ..."), and I think it's far more likely that that was just a typo/misunderstanding that some sort of insidious marketing technique.
I did a great deal of reading on Q-learning around the time of the original AlphaGo, it looks like that was covered in a previous repo (RL-Adventures-1).
This new one seems to not mention Q-learning, is that because all these examples are based implicitly on Q-learning, or are these totally new alternatives?
In Deep Q Network (DQN), the Q network itself is the policy. You have one output for each possible action, and the the neural network estimates the Q value for each action in the current state. You act by selecting the action with the highest Q output. This doesn't work if the action space is continuous, e.g. motor torques for a robot.
You might think you can fix this by making the action an input to the Q network and keeping only one output; then you could find the action with the highest output. But due to the nonlinearity in the neural network, this is an intractable nonconvex optimization problem.
So instead, you train a neural network to output the action given the state. The algorithms are harder to understand, because Q learning is kind of like supervised learning but policy gradients really aren't. A lot of algorithms (A2C, DDPG, TRPO, etc.) still use one-output Q network (as described in the previous paragraph), but this is just a part of the learning algorithm * . Once training is done, you throw away this Q network. The learned behavior is entirely contained in the policy network. These methods are usually called policy gradient methods.
This article covers policy gradient methods only.
* it's possible to do "pure" policy gradients using only the empirical return, but the Q network helps reduce the variance of the gradient estimate and stabilize the learning.
Looking at the material here, I think you may be under estimating what is meant by "zero" here. To fully follow these notes you would need to have significant research-level knowledge of general machine learning techniques, with a good deal of specific experience working with neural networks (both in theory and application).
I think someone with a general graduate-level background in most fields of ML or computer science could acquire a working knowledge of the SOTA in a particular associated sub-topic after reading through a half dozen or so research papers and code. That doesn't seem unreasonable or surprising.
He's also under estimating what it means to be a hero. Most of the experiments here are on Cartpole which is a very basic environment (not even MNIST level of complexity). Most papers in RL use Atari, which requires some amount of engineering that this repo does not have.
come on man, the title is obviously click bait; you can only begin to build an understanding of a field with a single tutorial, even if it's super in depth
Once you've verified that the implementations are correct, it is easier to start the journey to reproduce SOTA on harder problems by playing around with the side-tricks that are often employed.