- MCTS is executed after encountering each new state to select the agent’s action for that state; it is executed again to select the action for the next state, and so on. As in a rollout algorithm, each execution is an iterative process that simulates many trajectories starting from the current state and running to a terminal state (or until discounting makes any further reward negligible as a contribution to the return). The core idea of MCTS is to successively focus multiple simulations starting at the current state by extending the initial portions of trajectories that have received high evaluations from earlier simulations. MCTS does not have to retain approximate value functions or policies from one action selection to the next, though in many implementations it retains selected action values likely to be useful for its next execution.
- Monte Carlo value estimates are maintained only for the subset of state—action pairs that are most likely to be reached in a few steps, which form a tree rooted at the current state, as illustrated in Figure 8.10. MCTS incrementally extends the tree by adding nodes representing states that look promising based on the results of the simulated trajectories. Any simulated trajectory will pass through the tree and then exit it at some leaf node. Outside the tree and at the leaf nodes the rollout policy is used for action selections, but at the states inside the tree something better is possible. For these states we have value estimates for at least some of the actions, so we can pick among them using an informed policy, called the tree policy, that balances exploration and exploitation.
- Selection: Starting from the root node, select the child node with highest UCB
- Expansion: If the selected node has not been visited even once, then go to next state, else expand the node by adding all possible child nodes
- Simulation: This Node was not even visited once so, simulate a random playout from the selected node until a terminal state is reached
- Backup: Backup the value of the terminal state to previous nodes
- To the above till the time budget is exhausted or the number of iterations specified is reached
-
All games of perfect information have an optimal value function,
$v^*(s)$ , which determines the outcome of the game, from every board position or state$s$ , under perfect play by all players. -
These games may be solved by recursively computing the optimal value function in a search tree
$b^d$ containing approximately possible sequences of moves, where$b$ is the game’s breadth (number of legal moves per position) and$d$ is its depth (game length). -
Extensive search is infeasible, but two general principles can reduce the effectiveness of the search space:
- First, the depth of the search may be reduced by position evaluation:
truncating the search tree at state
$s$ and replacing the subtree below$s$ by an approximate value function$v(s) \approx v^*(s)$ that predicts the outcome from state$s$ . - Second, the breadth of the search
may be reduced by sampling actions from a policy
$p(a|s)$ that is a probability distribution over possible moves$a$ in position$s$ .
- First, the depth of the search may be reduced by position evaluation:
truncating the search tree at state
-
Monte Carlo tree search (MCTS) uses Monte Carlo rollouts to estimate the value of each state in a search tree.
-
As more simulations are executed, the search tree grows larger and the relevant values become more accurate.
-
The policy used to select actions during search is also improved over time, by selecting children with higher values.
-
We pass in the board position as a
$19 \times 19$ image and use convolutional layers to construct a representation of the position. -
We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.
-
We begin by training a supervised learning (SL) policy network
$p_\sigma$ directly from expert human moves. -
We also train a fast policy
$p_\pi$ that can rapidly sample actions during rollouts. -
Next, we train a reinforcement learning (RL) policy network
$p_\rho$ that improves the SL policy network by optimizing the final outcome of games of self-play. -
This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy.
-
Finally, we train a value network
$v_\theta$ that predicts the winner of games played by the RL policy network against itself.
- For the first stage of the training pipeline, we build on prior work on predicting expert moves in the game of Go using supervised learning.
- The SL policy network
$p_\sigma(a | s)$ alternates between convolutional layers with weights$\sigma$ , and rectifier nonlinearities. A final softmax layer outputs a probability distribution over all legal moves$a$ . - The policy network is trained on randomly sampled state-action pairs
$(s, a)$ , using stochastic gradient ascent to maximize the likelihood of the human move$a$ selected in state$s$ . - The network predicted expert moves on a held-out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board position and move history as inputs, compared to the state-of-the-art from other research groups of 44.4%.
- We also
trained a faster but less accurate rollout policy
$p_\pi(a|s)$ , using a linear softmax of small pattern features (see Extended Data Table 4) with weights$\theta$ ; this achieved an accuracy of 24.2%, using just$2 \mu s$ to select an action, rather than 3 ms for the policy network.
- The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL).
- We play games between the current policy network
$p_\rho$ and a randomly selected previous iteration of the policy network. - Randomizing from a pool of opponents in this way stabilizes training by preventing overfitting to the current policy.

- We evaluated the performance of the RL policy network in-game
play, sampling each move
$a_t \sim p_\rho (.|s_t)$ from its output probability distribution over actions.
- We also assessed variants of AlphaGo that evaluated positions
using just the value network (
$\lambda = 0$ ) or just rollouts ($\lambda = 1$ ) (see Fig. 4b). Even without rollouts, AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go. However, the mixed evaluation ($\lambda = 0.5$ ) performed best, winning 95% of games against other variants.
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play.
Here we introduce AlphaGo Zero, an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo Zero’s own move selections and the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in highly accurate move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
The program plays a game
The neural network takes the raw board position
The new parameters are used in the next iteration of self-play.
Our new method uses a deep neural network
This neural network combines the roles of both policy network and value network into a single architecture. The neural network consists of many residual blocks of convolutional layers with batch normalization and rectifier nonlinearities.
The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforcement learning algorithm. In each position
MCTS may be viewed as a self-play algorithm that, given neural network parameters
The neural network is trained by a self-play reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialized to random weights
Each simulation traverses the MCTS search tree by selecting moves that maximize an upper confidence bound
where
We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behavior and continued without human intervention for approximately three days.
Over the course of training, 4.9 million games of self-play were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4s thinking time per move. Parameters were updated approximately.
Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24 h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play








