# Mastering the game of Go without human knowledge

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts. However, expert data sets are often expensive, unreliable or simply unavailable. Even when reliable data sets are available, they may impose a ceiling on the performance of systems trained in this manner. By contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning. These systems have outperformed humans in computer games, such as Atari and 3D virtual environments. However, the most chal­lenging domains in terms of human intellect—such as the game of Go, widely viewed as a grand challenge for artificial intelligence—require a precise and sophisticated lookahead in vast search spaces. Fully gene­ral methods have not previously achieved human­level performance in these domains.

AlphaGo was the first program to achieve superhuman performance in Go. The published version, which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October 2015. AlphaGo Fan used two deep neural networks: a policy network that outputs move probabilities and a value network that outputs a position eval­uation. The policy network was trained initially by supervised learn­ing to accurately predict human expert moves, and was subsequently refined by policy­gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy net­work against itself. Once trained, these networks were combined with a Monte Carlo tree search (MCTS) to provide a lookahead search, using the policy network to narrow down the search to high­probability moves, and using the value network (in conjunction with Monte Carlo rollouts using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of inter­national titles, in March 2016.

AlphaGo是第一个在围棋领域中性能超越了人类的程序。之前的公开版本，我们称之为AlphaGo Fan，是在2015年十月打败了欧洲冠军樊辉。AlphaGo Fan使用了两个深度神经网络：一个策略神经网络，用来输出可能的步骤；第二个是价值网络，用于判断位置。策略网络首先是通过监督学习网络来训练，用于准确预测人类专家的步骤，并通过策略径向增强学习来逐步修正。价值网络用于预测通过策略网络下的游戏的概率。一旦测试结束，这些网络便会用蒙特卡罗搜索树结合在一起，用来进行游戏中落子的预测。使用策略网络来筛选出高可能性的步骤，并用价值网络（用快速走子结合在一起的蒙特卡罗搜索树）来评估树中的位置。接下来的版本，我们称之为AlphaGo Lee，使用了类似的方法，并在2016年3月份打败了18次获得世界冠军的选手李世石。

Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee in several important aspects. First and foremost, it is trained solely by self­play reinforcement learning, starting from ran­dom play, without any supervision or use of human data. Second, it uses only the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improve­ment and precise and stable learning. Further technical differences in the search algorithm, training procedure and network architecture are described in Methods.

Reinforcement learning in AlphaGo Zero

AlphaGo Zero的增强学习

Our new method uses a deep neural network ${f}_{\theta }$$f_θ$ with parameters θ. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p, v) = ${f}_{\theta }\left(s\right)$$f_θ(s)$. The vector of move probabilities p represents the probability of selecting each move a (including pass), ${p}_{a}$$p_a$ = Pr(a| s). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network into a single architecture. The neural network consists of many residual blocks of convolutional layers with batch normalization and rectifier nonlinearities (see Methods).

The neural network in AlphaGo Zero is trained from games of self­play by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network ${f}_{\theta }$$f_θ$. The MCTS search outputs probabilities π of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network ${f}_{\theta }\left(s\right)$$f_θ(s)$; MCTS may therefore be viewed as a powerful policy improvement operator. Self­play with search—using the improved MCTS­based policy to select each move, then using the game winner z as a sample of the value—may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network’s parameters are updated to make the move probabilities and value (p, v) = ${f}_{\theta }\left(s\right)$$f_θ(s)$ more closely match the improved search probabilities and self­play winner (π, z); these new parameters are used in the next iteration of self­play to make the search even stronger. Figure 1 illustrates the self­play training pipeline.

AlphaGo Zero的神经网络是通过一个虚拟机的增强学习算法的自我博弈游戏中训练的。在每一个位置s上，一个蒙特卡洛搜索便会执行，这个搜索树通过函数${f}_{\theta }\left(s\right)$$f_θ(s)$来指导。蒙特卡洛搜索的输出结果每一个步骤的概率$\pi$$\pi$。这些搜索概率通常比原始神经网络${f}_{\theta }\left(s\right)$$f_θ(s)$的概率p要大；蒙特卡洛搜索也因此被认为是一种强大的策略提升因子。通过搜索的自我博弈——使用提升的基于蒙特卡罗搜索树的策略来选择每次移动的可能性，然后是用游戏胜率z来做为值的范例。我们的增强学习算法的主要想法是使用这些算法因子来反复执行策略迭代生成：神经网络的变量通过产生移动概率和值$\left(p,v\right)={f}_{\theta }\left(s\right)$$(p,v)=f_\theta(s)$更与改进搜索的概率和自我博弈的胜者$\left(\pi ,z\right)$$(\pi,z)$相接近；这些新变量被用于自我博弈的下一次迭代当中，来让算法更加强劲。图例1说明了自我博弈的路径。

The MCTS uses the neural network ${f}_{\theta }$$f_θ$ to guide its simulations (see Fig. 2). Each edge (s, a) in the search tree stores a prior probability P(s, a), a visit count N(s, a), and an action value Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximize an upper confidence bound Q(s, a) + U(s, a), where $U\left(s,a\right)\propto P\left(s,a\right)/\left(1+N\left(s,a\right)\right)$$U(s, a) \propto P(s, a) / (1 + N(s, a))$ , until a leaf node s′ is encountered. This leaf position is expanded and evaluated only once by the network to gene­rate both prior probabilities and evaluation, $\left(P\left(s\prime ,·\right),V\left(s\prime \right)\right)={f}_{\theta }\left(s\prime \right)$$(P(s′ , ·),V(s′ )) = f_θ(s′ )$. Each edge (s, a) traversed in the simulation is updated to increment its visit count N(s, a), and to update its action value to the mean evaluation over these simulations,$Q\left(s,a\right)=\frac{1}{N\left(s,a\right)\sum _{{s}^{\prime }|s,a\to s\prime }V\left(s\prime \right)}$$Q(s,a)=\frac{1}{N(s, a)\sum_{s'|s,a\rightarrow s′}V(s′)}$ where $s,a\to s\prime$$s, a\rightarrow s′$ indicates that a simulation eventually reached s′ after taking move a from position s.

MCTS may be viewed as a self­play algorithm that, given neural network parameters $\theta$$θ$ and a root position s, computes a vector of search probabilities recommending moves to play, $\pi ={\alpha }_{\theta }\left(s\right)$$π = α_θ(s)$, proportional to the exponentiated visit count for each move, ${\pi }_{a}\propto N\left(s,a{\right)}^{1/\tau }$$π_a \propto N(s, a)^{1/τ}$, where τ is a temperature parameter.

The neural network is trained by a self­play reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialized to random weights ${\theta }_{0}$$θ_0$. At each subsequent iteration i ≥ 1, games of self­play are generated (Fig. 1a). At each time­step t, an MCTS search ${\pi }_{t}={\alpha }_{\theta i-1}\left({s}_{t}\right)$$π_t = α_{θ{i−1}} (s_t)$ is executed using the previous iteration of neural network ${f}_{{\theta }_{i-1}}$$f_{θ_{i−1}}$ and a move is played by sampling the search probabilities ${\pi }_{t}$$π_t$. A game terminates at step T when both players pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length; the game is then scored to give a final reward of ${r}_{T}\in \left\{-1,+1\right\}$$r_T ∈ \{− 1,+ 1\}$(see Methods for details). The data for each time­step t is stored as $\left({s}_{t},{\pi }_{t},{z}_{t}\right)$$(s_t, π_t, z_t)$, where ${z}_{t}=±{r}_{T}$$z_t = ± r_T$ is the game winner from the perspective of the current player at step t. In parallel (Fig. 1b), new network parameters ${\theta }_{i}$$θ_i$ are trained from data (s, π, z) sampled uniformly among all time­steps of the last iteration(s) of self­play. The neural network $\left(p,v\right)={f}_{{\theta }_{i}}\left(s\right)$$(p,v)=f_{{\theta}_i}(s)$ is adjusted to minimize the error between the predicted value v and the self­play winner z, and to maximize the similarity of the neural network move probabilities p to the search probabilities π. Specifically, the parameters θ are adjusted by gradient descent on a loss function l that sums over the mean­squared error and cross­entropy losses, respectively:

$\left(p,v\right)={f}_{\theta }\left(s\right)$

and
$l=\left(z-v{\right)}^{2}-{\pi }^{T}logp+c|\theta {|}^{2}$

where c is a parameter controlling the level of L2 weight regularization (to prevent overfitting).

c是一个控制L2权重规范化的等级的变量，用于防止过度拟合

Empirical analysis of AlphaGo Zero training

AlphaGo Zero训练的宏观分析

We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approximately three days.

Over the course of training, 4.9 million games of self­play were gen­erated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4 s thinking time per move. Parameters were updated from 700,000 mini­batches of 2,048 positions. The neural network contained 20 residual blocks (see Methods for further details).

Figure 3a shows the performance of AlphaGo Zero during self­play reinforcement learning, as a function of training time, on an Elo scale. Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting that have been suggested in previous literature. Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 h. In comparison, AlphaGo Lee was trained over several months. After 72 h, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the same 2 h time controls and match conditions that were used in the man–machine match in Seoul (see Methods). AlphaGo Zero used a single machine with 4 tensor processing units (TPUs), whereas AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0 (see Extended Data Fig. 1 and Supplementary Information).

To assess the merits of self­play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS Server data­set; this achieved state-­of-­the-­art prediction accuracy compared to pre­vious work (see Extended Data Tables 1 and 2 for current and previous results, respectively). Supervised learning achieved a better initial performance, and was better at predicting human professional moves (Fig. 3). Notably, although supervised learning achieved higher move prediction accuracy, the self­learned player performed much better overall, defeating the human­trained player within the first 24 h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.

To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in AlphaGo Zero with the previous neural network architecture used in AlphaGo Lee (see Fig. 4). Four neural networks were created, using either separate policy and value networks, as were used in AlphaGo Lee, or combined policy and value networks, as used in AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimize the same loss function (equation (1)), using a fixed dataset of self­play games generated by AlphaGo Zero after 72 h of self­play training. Using a residual network was more accurate, achieved lower error and improved performance in AlphaGo by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to improved computational effi­ciency, but more importantly the dual objective regularizes the network to a common representation that supports multiple use cases.

Knowledge learned by AlphaGo Zero

AlphaGo Zero discovered a remarkable level of Go knowledge dur­ing its self­play training process. This included not only fundamental elements of human Go knowledge, but also non­standard strategies beyond the scope of traditional Go knowledge.

AlphaGo Zero在自我博弈的时候发现了围棋知识的新境界。这不仅包括了人类围棋的基础知识，也包括了传统知识领域未曾达到的地步。

Figure 5 shows a timeline indicating when professional joseki (corner sequences) were discovered (Fig. 5a and Extended Data Fig. 2); ultimately AlphaGo Zero preferred new joseki variants that were previously unknown (Fig. 5b and Extended Data Fig. 3). Figure 5c shows several fast self­play games played at different stages of train­ing (see Supplementary Information). Tournament length games played at regular intervals throughout training are shown in Extended Data Fig. 4 and in the Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisti­cated understanding of Go concepts, including fuseki (opening), tesuji (tactics), life­and­death, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho (‘ladder’ capture sequences that may span the whole board)—one of the first elements of Go knowledge learned by humans—were only understood by AlphaGo Zero much later in training.

Final performance of AlphaGo Zero

AlphaGo Zero的最终性能

We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration. Training again started from completely random behaviour and continued for approximately 40 days.

Over the course of training, 29 million games of self­play were gener­ated. Parameters were updated from 3.1 million mini­batches of 2,048 positions each. The neural network contained 40 residual blocks. The learning curve is shown in Fig. 6a. Games played at regular intervals throughout training are shown in Extended Data Fig. 5 and in the Supplementary Information.

We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo Fan, AlphaGo Lee and several previous Go programs. We also played games against the strongest existing program, AlphaGo Master—a program based on the algorithm and architecture presented in this paper but using human data and fea­tures (see Methods)—which defeated the strongest human professional players 60–0 in online games in January 2017. In our evaluation, all programs were allowed 5 s of thinking time per move; AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUs and 48 TPUs, respectively. We also included a player based solely on the raw neural network of AlphaGo Zero; this player simply selected the move with maximum probability.

Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan.

Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100-game match with 2­h time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Fig. 6 and Supplementary Information).

Conclusion

Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data. Using this approach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted fea­tures, by a large margin.

Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, prov­erbs and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.

• 本文已收录于以下专栏：

举报原因： 您举报文章：Mastering the game of Go without human knowledge 色情 政治 抄袭 广告 招聘 骂人 其他 (最多只允许输入30个字)