DeepMind's MuZero teaches itself how to win at Atari, chess, shogi, and Go

schoen · on Nov 22, 2019

Recent HN discussion: https://news.ycombinator.com/item?id=21589719

2bitencryption · on Nov 22, 2019

After all these iterations of Alpha-[blank], [blank]-Zero, now MuZero, etc, I'm wondering:

If I'm interested in building a toy version following the Deepmind spec, which can be trained to reach super-human capabilities on a particular board game (Reversi, Chess, checkers, possibly even Go given enough compute), which of these "versions" of the project would be the easiest for me to understand/implement? (assume I have a basic understanding of the high-level concepts and lots of enthusiasm, but I'm not an expert).

My understanding is, AlphaZero is not just stronger than AlphaGo, but architecturally simpler and more efficient. That's what I'm looking for -- the implementation with the highest result/difficulty ratio.

codehotter · on Nov 22, 2019

AlphaGo Master, unsurprisingly, was significantly stronger than AlphaGoZero. AlphaZero, although it can play multiple games, was weaker yet. In both cases, they compared the 40 block version of the one with the 20 block version of the other (they had to double the network size to approach the level of the predecessor.)

Recently, Katago has reached similar levels of strength using a small fraction of the resources: https://arxiv.org/abs/1902.10565

It depends on what you mean by "more efficient." The significance of AlphaZero was that you can reach good results in a variety of domains even without human expert knowledge to provide supervised learning data or engineer features. It's efficient in terms of engineering resources.

A precisely tailored approach can always get better results.

IanCal · on Nov 23, 2019

Has it been improved? AlphaZero overtook AlphaGo Master previously https://en.wikipedia.org/wiki/AlphaGo_Zero#Comparison_with_p...

codehotter · on Nov 23, 2019

The 40 block version of AlphaGo Zero is stronger than the 20 block version of AlphaGo Master.

IanCal · on Nov 23, 2019

This is a bit outside of my comfort zone so I'm not sure I quite get what these blocks are. Has any version of alphago master bested alphago zero?

ta_tunestub · on Nov 24, 2019

> which of these "versions" of the project would be the easiest for me to understand/implement?

I have the same question. Not sure I have an answer yet, but this paper includes some pseudocode that implements the algorithm: https://arxiv.org/src/1911.08265v1/anc/pseudocode.py

I'm planning on trying to train something simple like TicTacToe to both see if it works and understand how it works.

malux85 · on Nov 22, 2019

Pick a simple game, so your search space is smaller, and you won’t need 10,000 GPUs to get anything done

yters · on Nov 22, 2019

Do they have any sort of chart showing zeroes are able to learn more games/state spaces with less domain specific information and less compute and space requirements? For instance, if we are getting into an exponential tradeoff curve (seems possible due to enormous number of GPUs), then it is hard to see how this will scale to human level type intelligence.

These one off experiments make it hard to know if AI is truly progressing or not. Naively I'd assume due to the decision tree leaves growing exponentially with depth, then we are facing an inherently unscalable problem, and we only are getting current gains due to advances in hardware, but the gains are only linear with exponential hardware improvements, especially if Moore's law is giving out and even with parallel computation we might end up turning the earth into a giant GPU array before we can reach parity with human intelligence.

twohearted · on Nov 22, 2019

I assume this has been tried but what happens if you give MuZero a goal like "keep the system/process that spawns me running as long as possible?"

hervature · on Nov 22, 2019

Why do you assume this has been tried? It's not even clear what the game is. In this setting, what state and actions would the algorithm have access to?

blacksmith_tb · on Nov 22, 2019

In some games it could find an equilibrium where it could keep the game going on indefinitely by moving back and forth, for example (which won't work in a game like Go[1], though).

1: https://en.wikipedia.org/wiki/Rules_of_Go#Ko_and_Superko

datashrimp · on Dec 2, 2019

Just released - walkthrough of the MuZero pseudo code: https://link.medium.com/KB3f4RAu51

chenzikuy · on Nov 22, 2019

It's unclear to me how MuZero was able to use less compute to achieve AlphaZero level performance on Go?

gok · on Nov 22, 2019

From the preprint [1]:

> In Go, MuZero slightly exceeded the performance of AlphaZero, despite using less computation per node in the search tree (16 residual blocks per evaluation in MuZero compared to 20 blocks in AlphaZero). This suggests that MuZero may be caching its computation in the search tree and using each additional application of the dynamics model to gain a deeper understanding of the position.

It also strikes me as possible that just not giving the system the rules to start with might have allowed it to explore more efficient strategies.

[1] https://arxiv.org/pdf/1911.08265.pdf

smithmayowa · on Nov 22, 2019

"They trained the system for five hypothetical steps and a million mini-batches (i.e., small batches of training data) of size 2,048 in board games and size 1,024 in Atari, which amounted to 800 simulations per move for each search in Go, chess, and shogi and 50 simulations for each search in Atari"

Because of this I am presuming.

NicoJuicy · on Nov 22, 2019

It can play a million times against itselve in the virtual world every day.

But applying that in the real world takes years.

Erlich_Bachman · on Nov 22, 2019

If we and up following this approach, it is clear that it will be combined with some sort of virtual-world building, where a machine builds an approximate world based on the real-world data, then runs the simulation inside the virtual world for eons, ends up with the best possible, but still inferior action model (of course because the virtual world was not real), goes back to real world, adjusts the model, and repeats etc.

It is even possible that our brains just do the same thing BTW. How many times do you run that scenario of a job interview in your head before you go there? How many times does it run in your subconscious virtually? How many times does it happen in dreams? And more profoundly, how often do those scenarios in our head are very inaccurate and simplified, and yet they still help us act in the real world nonetheless?