The interesting result here is not the performance against human players, which DeepMind tries to upsell. The interesting bit is their approach to bridging the sim-2-real gap by iteratively training in a simulation and fine-tuning in real-world games. The approach is described here:
i-Sim2Real: Reinforcement Learning of Robotic Policies in Tight Human-Robot Interaction Loops
The sim-2-real gap is a real obstacle in adopting RL for rotobics and anything that pushes the envelope is worth the trouble. On the other hand, I can't tell how well this approach will work outside of Table Tennis.
Note that on top of the RL work there is, AFAICT a metric shit ton of good, old-fashioned engineering to support the decision-making ability of learned policies. E.g. the "High Level Controller" (HLC) that selects "Low Level Controllers" (LLC) using good, old-fashioned AI tools, tree search with heuristics, seems to me to be hand-crafted rather than learned, and involves a load of expert knowledge driving information gathering. So far, no bitter lessons to taste here.
Oh and of course the HLC is a symbolic component while the LLCs are learned by a neural net. Once more DeepMind sneakily introduces a neuro-symbolic approach but keeps quiet about it. They've done that since the days of AlphaGo. No idea why they're so ashamed of it, since it really looks like it's working very well for them.
i-Sim2Real: Reinforcement Learning of Robotic Policies in Tight Human-Robot Interaction Loops
https://sites.google.com/view/is2r
The sim-2-real gap is a real obstacle in adopting RL for rotobics and anything that pushes the envelope is worth the trouble. On the other hand, I can't tell how well this approach will work outside of Table Tennis.
Note that on top of the RL work there is, AFAICT a metric shit ton of good, old-fashioned engineering to support the decision-making ability of learned policies. E.g. the "High Level Controller" (HLC) that selects "Low Level Controllers" (LLC) using good, old-fashioned AI tools, tree search with heuristics, seems to me to be hand-crafted rather than learned, and involves a load of expert knowledge driving information gathering. So far, no bitter lessons to taste here.
Oh and of course the HLC is a symbolic component while the LLCs are learned by a neural net. Once more DeepMind sneakily introduces a neuro-symbolic approach but keeps quiet about it. They've done that since the days of AlphaGo. No idea why they're so ashamed of it, since it really looks like it's working very well for them.