While studying for an RL course, I created a reference for several algorithms with a brief description of what limitations they solve. Example:
Problem: SARSA pushes q-values towards the current policy, but ideally we'd want optimal values.
Solution: Use the best action in TD-target calculation -> Q-learning
Haha, cool, thank you! I had some notes ready but didn't get around to finishing it sooner. Besides, I'm sure the course slides were much better material for exam prep ;)
Problem: SARSA pushes q-values towards the current policy, but ideally we'd want optimal values. Solution: Use the best action in TD-target calculation -> Q-learning
Perhaps someone else will find it helpful!
reply