Hacker News new | past | comments | ask | show | jobs | submit login
DeepMind's MuZero teaches itself how to win at Atari, chess, shogi, and Go (venturebeat.com)
63 points by jonbaer on Nov 22, 2019 | hide | past | favorite | 18 comments




After all these iterations of Alpha-[blank], [blank]-Zero, now MuZero, etc, I'm wondering:

If I'm interested in building a toy version following the Deepmind spec, which can be trained to reach super-human capabilities on a particular board game (Reversi, Chess, checkers, possibly even Go given enough compute), which of these "versions" of the project would be the easiest for me to understand/implement? (assume I have a basic understanding of the high-level concepts and lots of enthusiasm, but I'm not an expert).

My understanding is, AlphaZero is not just stronger than AlphaGo, but architecturally simpler and more efficient. That's what I'm looking for -- the implementation with the highest result/difficulty ratio.


AlphaGo Master, unsurprisingly, was significantly stronger than AlphaGoZero. AlphaZero, although it can play multiple games, was weaker yet. In both cases, they compared the 40 block version of the one with the 20 block version of the other (they had to double the network size to approach the level of the predecessor.)

Recently, Katago has reached similar levels of strength using a small fraction of the resources: https://arxiv.org/abs/1902.10565

It depends on what you mean by "more efficient." The significance of AlphaZero was that you can reach good results in a variety of domains even without human expert knowledge to provide supervised learning data or engineer features. It's efficient in terms of engineering resources.

A precisely tailored approach can always get better results.


Has it been improved? AlphaZero overtook AlphaGo Master previously https://en.wikipedia.org/wiki/AlphaGo_Zero#Comparison_with_p...


The 40 block version of AlphaGo Zero is stronger than the 20 block version of AlphaGo Master.


This is a bit outside of my comfort zone so I'm not sure I quite get what these blocks are. Has any version of alphago master bested alphago zero?


> which of these "versions" of the project would be the easiest for me to understand/implement?

I have the same question. Not sure I have an answer yet, but this paper includes some pseudocode that implements the algorithm: https://arxiv.org/src/1911.08265v1/anc/pseudocode.py

I'm planning on trying to train something simple like TicTacToe to both see if it works and understand how it works.


Pick a simple game, so your search space is smaller, and you won’t need 10,000 GPUs to get anything done


Do they have any sort of chart showing zeroes are able to learn more games/state spaces with less domain specific information and less compute and space requirements? For instance, if we are getting into an exponential tradeoff curve (seems possible due to enormous number of GPUs), then it is hard to see how this will scale to human level type intelligence.

These one off experiments make it hard to know if AI is truly progressing or not. Naively I'd assume due to the decision tree leaves growing exponentially with depth, then we are facing an inherently unscalable problem, and we only are getting current gains due to advances in hardware, but the gains are only linear with exponential hardware improvements, especially if Moore's law is giving out and even with parallel computation we might end up turning the earth into a giant GPU array before we can reach parity with human intelligence.


I assume this has been tried but what happens if you give MuZero a goal like "keep the system/process that spawns me running as long as possible?"


Why do you assume this has been tried? It's not even clear what the game is. In this setting, what state and actions would the algorithm have access to?


In some games it could find an equilibrium where it could keep the game going on indefinitely by moving back and forth, for example (which won't work in a game like Go[1], though).

1: https://en.wikipedia.org/wiki/Rules_of_Go#Ko_and_Superko


Just released - walkthrough of the MuZero pseudo code: https://link.medium.com/KB3f4RAu51


It's unclear to me how MuZero was able to use less compute to achieve AlphaZero level performance on Go?


From the preprint [1]:

> In Go, MuZero slightly exceeded the performance of AlphaZero, despite using less computation per node in the search tree (16 residual blocks per evaluation in MuZero compared to 20 blocks in AlphaZero). This suggests that MuZero may be caching its computation in the search tree and using each additional application of the dynamics model to gain a deeper understanding of the position.

It also strikes me as possible that just not giving the system the rules to start with might have allowed it to explore more efficient strategies.

[1] https://arxiv.org/pdf/1911.08265.pdf


"They trained the system for five hypothetical steps and a million mini-batches (i.e., small batches of training data) of size 2,048 in board games and size 1,024 in Atari, which amounted to 800 simulations per move for each search in Go, chess, and shogi and 50 simulations for each search in Atari"

Because of this I am presuming.


It can play a million times against itselve in the virtual world every day.

But applying that in the real world takes years.


If we and up following this approach, it is clear that it will be combined with some sort of virtual-world building, where a machine builds an approximate world based on the real-world data, then runs the simulation inside the virtual world for eons, ends up with the best possible, but still inferior action model (of course because the virtual world was not real), goes back to real world, adjusts the model, and repeats etc.

It is even possible that our brains just do the same thing BTW. How many times do you run that scenario of a job interview in your head before you go there? How many times does it run in your subconscious virtually? How many times does it happen in dreams? And more profoundly, how often do those scenarios in our head are very inaccurate and simplified, and yet they still help us act in the real world nonetheless?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: