Mastering the game of Go with deep neural networks and tree search

最新推荐文章于 2022-07-30 00:17:53 发布

duxingzhe103

最新推荐文章于 2022-07-30 00:17:53 发布

阅读量1.9k

点赞数 2

分类专栏：英文文档翻译及简要解析

英文文档翻译及简要解析专栏收录该内容

5 篇文章 0 订阅

订阅专栏

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any look ahead search, the neural networks play Go at the level of state of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

摘要
围棋是一门古老的传统对战游戏。在对战的时候，围棋产生了巨大的搜索空间。由于传统的人工智能分析方法难以对棋局形势和策略建立合适的评价模型，导致以前的研究认为围棋是难以被人工智能掌握的游戏。在这种情况下，我们决定使用最新的方法来解决围棋下棋问题：让电脑分别使用价值网络评价棋局当前的形势，用策略网络选择位置落子。这些深度神经网络前所未有地由两个部分组成：对人类大师对弈棋局的监督学习，以及通过自我对弈的增强学习。在没有利用任何前瞻性搜索的情况下，程序可以模拟上千个自我对弈的随机棋局。这些棋局通过目前研究的蒙特卡洛搜索树方法生成。与此同时，我们还引入了一个综合蒙特卡洛模拟和价值与策略神经网络所得结果的新搜索算法。通过使用这个搜索算法，AlphaGo对其他围棋程序能获得99.8%的胜率，并以5:0成功战胜了欧洲冠军樊辉。这是第一次电脑围棋程序在完整的19×19围棋棋局中战胜一位人类专业选手。过去人们一直认为至少还需要十年的时间，电脑才可以下赢人类。AlphaGo的出现颠覆了这种想法。

All games of perfect information have an optimal value function, v*(s), which determines the outcome of the game, from every board position or state s, under perfect play by all players. These games may be solved by recursively computing the optimal value function in a search tree containing approximately b*d possible sequences of moves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length). In large games, such as chess (b ≈ 35, d ≈ 80) $^1$ and especially Go (b ≈ 250, d ≈ 150) $^1$ , exhaustive search is infeasible $^2,^3$ , but the effective search space can be reduced by two general principles. First, the depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function v(s) ≈ v*(s) that predicts the outcome from state s. This approach has led to superhuman performance in chess $^4$ , checkers $^5$ and othello $^6$ , but it was believed to be intractable in Go due to the complexity of the game $^7$ . Second, the breadth of the search may be reduced by sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s. For example, Monte Carlo rollouts $^8$ search to maximum depth without branching at all, by sampling long sequences of actions for both players from a policy p. Averaging over such rollouts can provide an effective position evaluation, achieving superhuman performance in backgammon $^8$ and Scrabble $^9$ , and weak amateur level play in Go $^{10}$ .

游戏中所包含所有的完整信息的总会有一个最优值函数 $v^*(s)$ 。这个函数决定了游戏的结果。如果所有的玩家都遵循最佳策略，那么这个函数将唯一地由棋盘上所有已有棋子的位置（也即状态s）决定。这些游戏最优化策略的寻找可以通过在一个搜索树中递归地计算出最优值函数来加以解决。搜索树包含了 $b^d$ 个可能的行动序列，其中b是游戏的广度（每个状态可以采取的行动种类），d则是它的深度（游戏的时长）。对于大型的游戏，比如说在象棋（b=35,d=80）和围棋（b=250，d=150）中，使用穷举搜索的想法是不现实的。但是，有效的搜索空间可以通过以下两种方式来简化：第一，通过状态评估函数v(s)可以简化搜索的深度：在状态s处截断搜索树，即当给出的状态s通过函数计算出来的值小于一个给定的值v(s)= $v^*(s)$ 时，则不再继续检索并展开接下来的子树。象棋、拼写检查、奥赛罗棋等游戏的实践证明，这种方法已经超越了人类。但是在围棋上，由于其复杂性，这个方法仍然是不可取的；第二，通过某个状态下的状态s的概率分布p(a|s)来抽取可能的步骤，这样搜索树的宽度可以得到简化。例如，蒙特卡洛第一次使用的时候通过搜索用来搜索深度，而不是展开分支。采用的方法是通过一个策略函数p来预测所有的玩家一系列有可能的下棋序列。于是通过均一化的方法，这些搜索通常可以有效预测的一个位置，这个方法便在西洋双陆棋上和射击游戏均获得到了比人类还要好的表现，在围棋上则达到了业余选手水平的高度。

Monte Carlo tree search (MCTS) $^{11,12}$ uses Monte Carlo rollouts to estimate the value of each state in a search tree. As more simulations are executed, the search tree grows larger and the relevant values become more accurate. The policy used to select actions during search is also improved over time, by selecting children with higher values. Asymptotically, this policy converges to optimal play, and the evaluations converge to the optimal value function $^{12}$ . The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves $^{13}$ . These policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts. This approach has achieved strong amateur play $^{13–15}$ . However, prior work has been limited to shallow policies $^{13–15}$ or value functions $^{16}$ based on a linear combination of input features.

蒙特卡罗树搜索树（MCTS）使用了蒙特卡罗落子来评估搜索树的每个状态对应的价值函数。当模拟的次数增加，搜索树就会变得庞大，其相关的值则会变得越来越准确。我们的方法便在通过搜索有高价值的子节点的过程中不断进步。逐渐地，这个策略函数趋进于一个比较优化的值域之中，并且可以对聚合的值和最优函数进行评估。目前最好的围棋程序是通过蒙特卡洛搜索树算法，预测人类围棋大师的下子情况来进行训练和提高。这些方法可以让搜索范围局限在一个较高可能性步骤的范围内，并能在棋局中给出范例步骤。这个方法已经在和业余棋手的较量中取得了好的成绩。然而，主要是因为只用了线性逼近的输入特性，并且使用了较为肤浅的策略或者赋值函数，以至于导致之前的工作具有局限性。

Recently, deep convolutional neural networks have achieved unprecedented performance in visual domains: for example, image classification $^{17}$ , face recognition $^{18}$ , and playing Atari games $^{19}$ . They use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localized representations of an image $^{20}$ . We employ a similar architecture for the game of Go. We pass in the board position as a 19 × 19 image and use convolutional layers to construct a representation of the position. We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

最近一段时间里，深度卷积神经网络在视觉领域有了前所未有的良好表现：图像分类、面部识别以及雅达利游戏。这些神经网络使用了多个层级的神经元，每一个都与其他神经元进行连接，这样便可以持续地将图像抽象成数学公式，然后再显示在屏幕上。我们在围棋上应用了一个类似的结构。程序中使用了一个19*19的平板，使用了卷积层来重构棋盘上棋子的位置。我们使用这些神经网络来减少搜索树的有效深度和广度：价值网络判断位置的可能性，策略网络推演步骤。

We train the neural networks using a pipeline consisting of several stages of machine learning (Fig. 1). We begin by training a supervised learning (SL) policy network p $_{σ}$ directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Similar to prior work $^{13,15}$ , we also train a fast policy p $_{π}$ that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network pρ that improves the SL policy network by optimizing the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network v $_{θ}$ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS.

我们训练了一个特殊的神经网络，这个神经网络包含了机器学习几个阶段并由管道相连通将各个阶段联系在一起。开始时，我们直接通过人类围棋大师的棋局来指导受监督策略神经网络p $_{σ}$ 的训练。这个方法通过高质量的梯度下降和实时反馈来产生快速有效的学习改变。类似于之前的工作，我们同样在对局中训练快速策略网络p $_{π}$ 。接着我们训练了一个增强型（RL）策略网络p $_{ρ}$ 。它强化了受监督学习的策略网络（SL)，所使用的方法是最优化自我对弈的最终结果。我们使用这种策略来让AI获胜。这可以使游戏不依靠最大化预测来让下棋更有胜算。最终，我们训练得到一个价值网络v $_{θ}$ 。这个函数用来预测由增强型神经网络博弈后出现的最终胜利者。于是，AlphaGo通过蒙特卡洛搜索树的方法有效地整合了策略网络和价值网络。

Supervised learning of policy networks
For the first stage of the training pipeline, we build on prior work on predicting expert moves in the game of Go using supervised learning $^{13,21–24}$ . The SL policy network p $_{σ}$ (a | s) alternates between convolutional layers with weights σ, and rectifier nonlinearities. A final soft-max layer outputs a probability distribution over all legal moves a. The input s to the policy network is a simple representation of the board state (see Extended Data Table 2). The policy network is trained on randomly sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s

策略网络的受监督学习
在训练管道的第一部分，我们通过受监督的学习来预测专家在围棋中的落子。而这只是一项前期工作。受监督学习的策略网络p $_{σ}$ (a|s)在传统的神经网络层数σ和非线性矫正器之间作出调整。一个使用Softmax（Softmax是术语，这个词目前并没有提供中文翻译。）回归输出层。这个输出层将会将输出有效落子a的概率分布全部输出，用以简单展现棋盘的状态。程序通过随机地输入状态和行为对(s,a)来训练策略网络，还使用了随机梯度上升法来不断逼近人类在这种状态s下最可能下的棋子的位置a：

Δ (σ) \propto \partial log p σ ( a | s ) \partial σ (1)

$\varDelta(\sigma) \propto \frac{\partial \text{log}p_{\sigma}(a|s) }{\partial \sigma} \tag{1}$

We trained a 13-layer policy network, which we call the SL policy network, from 30 million positions from the KGS Go Server. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board position and move history as inputs, compared to the state-of-the-art from other research groups of 44.4% at date of submission $^{24}$ (full results in Extended Data Table 3). Small improvements in accuracy led to large improvements in playing strength (Fig. 2a); larger networks achieve better accuracy but are slower to evaluate during search. We also trained a faster but less accurate rollout policy p $_{π}$ (a|s), using a linear softmax of small pattern features (see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%, using just 2 μs to select an action, rather than 3 ms for the policy network.

我们通过训练一个被称为监督学习的策略网络。它有13层，KGS Go Server向他发送3千万个步骤来进行学习。这个网络用于预测专家的步骤。在输出测试集中，如果使用了全部的输入特征会有57.0%的准确率。如果使用未处理的（也就是一开局的时候）的棋盘位置并将之前的输入的落子历史做为输入的一部分来预测，则会有55.7%的准确率。截稿时，最新的研究最多也就只有44.4%的准确率。在预测棋盘落子一点点的提高都会在对局中产生惊人的进步；更大的网络可以得到更为精确的结果，但会减慢搜索时估算的速度。我们同样通过使用一个具有线性权重 $\pi$ 和精简模式特征的Softmax函数训练了一个快速但是没有那么精确的落子决策网络p $_{π}$ (a|s)。这个函数达到了有22.4%的准确性，仅仅使用了2 μs来确定一个步骤，这个方法比策略网络的3毫秒相比，节省了许多时间。

Reinforcement learning of policy networks
The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL) $^{25,26}$ . The RL policy network p $_{ρ}$ is identical in structure to the SL policy network, and its weights ρ are initialized to the same values, ρ = σ. We play games between the current policy network p $_{ρ}$ and a randomly selected previous iteration of the policy network. Randomizing from a pool of opponents in this way stabilizes training by preventing overfitting to the current policy. We use a reward function r(s) that is zero for all non-terminal time steps t < T. The outcome z $_{t}$ = ± r(s $_{T}$ ) is the terminal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and −1 for losing. Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome $^{25}$

策略网络的增强学习
训练过程的第二步是进行的策略梯度的增强学习(RL)训练，其目的是为了能够改进策略网络。增强型策略网络p $_{\rho}$ 与SL策略网络的结构是一样的，增强型策略网络的权重 $\rho$ 初始化的时候是通过赋值 $\rho=\sigma$ 来完成。我们将当前的策略神经网络p $_{\rho}$ 和一个随机选择之前某一次迭代的策略网络一起下棋。为了防止当前策略网络的在稳定训练中出现过度拟合的情况，我们让对手使用随机生成策略网络的方法。我们使用了一个奖励函数r(s)，当所有非终止的步骤t

Δ (ρ) \propto \partial log p ρ ( a t | s t ) \partial ρ z t (2)

$\varDelta(\rho) \propto \frac{\partial \text{log}p_{\rho}(a_{t}|s_{t}) }{\partial \rho}z_{t} \tag{2}$

We evaluated the performance of the RL policy network in game play, sampling each move a $_{\rho}$ ~p $_{\rho}(·|s_{\rho})$ from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi $^{14}$ , a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi. In comparison, the previous state-of-the-art, based only on supervised learning of convolutional networks, won 11% of games against Pachi $^{23}$ and 12% against a slightly weaker program, Fuego $^{24}$ .

我们在游戏中评估增强学习的策略网络的性能，从每一步输出的概率分布来得到a $_{\rho}$ ~p $_{\rho}(·|s_{\rho})$ 中抽样分析（sample distribution为抽样分析，那么这里的sample应作抽样翻译）。在直接较量中，增强型学习对局监督性学习有超过80%的胜率。同时，我们让AlphaGo和一个开源软件中最厉害的围棋程序Pachi进行比赛。Pachi是在蒙特卡洛搜索算法上进行过深入研究的软件。它在KGS中获得业余二段的水平，并可以在每一步中进行10万次模拟运算。不进行任何的搜索，增强型学习策略网络可以有85%几率在与Pachi的对局中取得胜利。然而在我们能已知的最新研究中，仅仅基于监督性卷积网络，AlphaGo对Pachi只有11%的可能性获得胜利，而对局一个比较简单的程序Fuego则只有12%的几率获胜。

Reinforcement learning of value networks
值网络的增强学习
The final stage of the training pipeline focuses on position evaluation, estimating a value function v $_{p}$ (s) that predicts the outcome from position s of games played by using policy p for both players $^{28–30}$
训练的最后一步主要是在位置的评价上，对值函数v $_{p}$ (s)进行估算是为了预测每一个游戏者在策略p的条件下选择位置s的概率：

v p (s) = E [z t | s t, a t . . . T p] (3)

$v^{p}(s)=\mathbb{E}[z_{t}|s_{t}, a_{t...T}~p]\tag{3}$

Ideally, we would like to know the optimal value function under perfect play v*(s); in practice, we instead estimate the value function v $^{p_ρ}$ for our strongest policy, using the RL policy network p $_{ρ}$ . We approximate the value function using a value network v $_{θ}$ (s) with weights θ, $v_{θ}(s) \approx v^{p_ρ} \approx v*(s)$ . This neural network has a similar architecture to the policy network, but outputs a single prediction instead of a probability distribution. We train the weights of the value network by regression on state-outcome pairs (s, z), using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value v $_{θ}$ (s), and the corresponding outcome z
在理想状态下，我们希望知道在双方都在按最优策略进行游戏的情况下v*(s)的最优值；然而在实践中，我们只能通过增强学习的策略网络p $_{ρ}$ 来代替理想情况。用这种情况预测最强策略对应的值函数v $^{p_ρ}$ 。预测值函数所使用的是具有权重θ的v $_{θ}$ (s)的价值神经网络，其中 $v_{θ}(s) \approx v^{p_ρ} \approx v*(s)$ 。这个神经网络具有一个类似与策略网络的结构，但是输出的就不是概率分布而只是一个值。我们通过拟合目前的状态-输出对(s,z)来训练值网络的权重，用随机梯度下降法来减少预测值v $_{θ}$ (s)和对应输出的均方差z：

Δ (θ) \propto \partial v θ ( s ) \partial θ (z - v θ (s)) (4)

$\varDelta(\theta) \propto \frac{\partial v_{\theta}(s) }{\partial \theta}(z-v_{\theta}(s)) \tag{4}$

The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting. The problem is that successive positions are strongly correlated, differing by just one stone, but the regression target is shared for the entire game. When trained on the KGS data set in this way, the value network memorized the game outcomes rather than generalizing to new positions, achieving a minimum MSE of 0.37 on the test set, compared to 0.19 on the training set. To mitigate this problem, we generated a new self-play data set consisting of 30 million distinct positions, each sampled from a separate game. Each game was played between the RL policy network and itself until the game terminated. Training on this data set led to MSEs of 0.226 and 0.234 on the training and test set respectively, indicating minimal overfitting. Figure 2b shows the position evaluation accuracy of the value network, compared to Monte Carlo rollouts using the fast rollout policy p $_π$ , the value function was consistently more accurate. A single evaluation of v $_{θ}$ (s) also approached the accuracy of Monte Carlo rollouts using the RL policy network p $_{ρ}$ , but using 15,000 times less computation.
最初，我们用整个棋局的数据来预测输出结果。然而这会造成过度拟合。这是因为连续的步骤之间是有强相关的。不同于一个孤立的点，拟合的目标在整个棋局中都相互联系。一个棋子的不同就造成整个棋盘产生的结局便会完全不一样。当我们用这种方式来训练KGS的数据的时候，值网络会记住每一个步骤从而无法一般化地下棋，而是依赖已经下过的步骤。这样造成了测试集中的均方差为0.37。但是相比之下，训练集中的均方差为0.19。为了纠正这个错误，我们建立了一个新的自我对战的训练棋局，这个棋谱包含3000万个位置，都来自于各自独立的棋谱中。每一局都会让增强学习网络和自己之间互相对弈到最后一刻。经过测试以后，训练数据和测试数据均方差降到了0.226和0.234，这也就避免了过度拟合的情况。图表2b展示了值网络对位置判断的准确性，相比于蒙特卡洛搜索树并结合快速搜索的策略p $_π$ ，值函数的结果还是比较精确的。一个简单的函数v $_{θ}$ (s)的预测结果达到了与蒙特卡洛搜索树与增强型学习网络结合运算的相同精确程度，而且比原有的计算少了15000次。

Searching with policy and value networks
通过策略网络和值网络搜索
AlphaGo combines the policy and value networks in an MCTS algorithm (Fig. 3) that selects actions by lookahead search. Each edge (s, a) of the search tree stores an action value Q(s, a), visit count N(s, a), and prior probability P(s, a). The tree is traversed by simulation (that is, descending the tree in complete games without backup), starting from the root state. At each time step t of each simulation, an action at is selected from state $s_t$
阿尔法围棋在蒙特卡洛搜索树上组合了策略网络和值网络，通过向前搜索选择下一步的落子。树的每一条边(s,a)都存储了行动值Q(s,a)、访问次数N(s,a)和优先级P(s,a)。搜索树贯穿于模拟的每一个步骤（也就是说，对于每一次棋局，树都是不断生长且没有备份），逐步从根部开始逐渐生长。在每一次模拟的基础上的每一次棋步，每一个动作都是从状态 $s_t$ 中选择一个：

a t = arg max a (Q (s t, a) + u (s t, a)) (5)

$a_t=\arg\max_{a}(Q(s_t,a)+u(s_t,a))\tag{5}$

so as to maximize action value plus a bonus
为了最大化落子概率，我们加上了附加值（so as to是为了，不是所以的意思）

u (s, a) \propto P ( s , a ) 1 + N ( s , a ) (6)

$u(s,a)\propto\frac{P(s,a)}{1+N(s,a)}\tag{6}$
that is proportional to the prior probability but decays with repeated visits to encourage exploration. When the traversal reaches a leaf node

sL s L $s_L$ at step L, the leaf node may be expanded. The leaf position

sL s L $s_L$ is processed just once by the SL policy network

pσ p σ $p_σ$ . The output probabilities are stored as prior probabilities P for each legal action a,

Pσ(a,s)=p(a|s) P σ ( a , s ) = p ( a | s ) $P_σ(a,s) =p(a|s)$ . The leaf node is evaluated in two very different ways: first, by the value network

vθ(sL) v θ ( s L ) $v_θ(s_L)$ ; and second, by the outcome

zL z L $z_L$ of a random rollout played out until terminal step T using the fast rollout policy

pπ p π $p_π$ ; these evaluations are combined, using a mixing parameter λ, into a leaf evaluation

V(sL) V ( s L ) $V(s_L)$
这个和先验概率是成比例的，但是会随访问增加而逐渐衰减。当树的生长逐渐触碰到叶子节点的时候，也就是在第L步碰到

sL s L $s_L$ 时，就只会让监督学习网络运行一次

pσ p σ $p_σ$ 。输出的概率被当做每一步有效步骤a的优先概率P存储起来，即

Pσ(a,s)=p(a|s) P σ ( a , s ) = p ( a | s ) $P_σ(a,s) =p(a|s)$ 。叶子节点被两种不同的方式用来评估：第一种，通过值网络

vθ(sL) v θ ( s L ) $v_θ(s_L)$ ；第二种，游戏一步步进行，直到使用快速落子策略

pπ p π $p_π$ 下出最后的一步T，这个函数就会产生结果

zL z L $z_L$ 。这些分析都会通过混合的变量λ整合在一起，计算进叶子节点的预测

V(sL) V ( s L ) $V(s_L)$ 中

V (s) = (1 - λ) v θ (s L) + λ z L (7)

$V(s)=(1−λ)v_{\theta}(s_L)+λz_L \tag{7}$

At the end of simulation, the action values and visit counts of all traversed edges are updated. Each edge accumulates the visit count and mean evaluation of all simulations passing through that edge
在拟合的最后，所有延伸边中的动作值和访问次数都会被更新。每一条边积累了模拟时从这条边走过的访问次数和均值拟合：

$N(s,a) = \sum_{i=1}^{N}1(s,a,i)\tag{8}$

$Q(s,a) = \frac{1}{N(s,a)}\sum_{i=1}^{N}1(s,a,i)V(s^i_L)\tag{9}$

where $s^i_L$ is the leaf node from the ith simulation, and 1(s, a, i) indicates whether an edge (s, a) was traversed during the ith simulation. Once the search is complete, the algorithm chooses the most visited move from the root position.
这里说明的是第i次模拟后叶子节点 $s^i_L$ 所在的位置，1(s,a,i)表明了在第i次模拟后一条边(s,a)是否延展到了这个地方。一旦搜索结束，算法便会选择从根节点移动到这个地方中最频繁的路径。

It is worth noting that the SL policy network $p_σ$ performed better in AlphaGo than the stronger RL policy network $p_ρ$ , presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move. However, the value function $v_θ(s)≈v^{p_ρ}(s)$ derived from the stronger RL policy network performed better in AlphaGo than a value function $v_θ(s)≈v^{p_σ}(s)$ derived from the SL policy network.
对于AlphaGo来说，监督学习网络 $p_σ$ 比增强的增强学习网络 $p_ρ$ 对于下棋指导表现要好是没什么实际意义。这是因为人类会选择一系列的棋步。而增强学习网络却只能对单步进行最优解。不过，由更强大的增强学习网络所得出的值函数 $v_θ(s)≈v^{p_ρ}(s)$ 比监督学习策略网络所得到的值函数 $v_θ(s)≈v^{p_σ}(s)$ 还是会有更加良好的表现。

Evaluating policy and value networks requires several orders of magnitude more computation than traditional search heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs. The Methods section provides full details of asynchronous and distributed MCTS.
预测策略函数和值网络所需要的多维度计算量比传统搜索所启发的要大几个数量级。为了能够更有效整合了蒙特卡罗搜索树和深度学习神经网络，AlphaGo使用了一个同步多线程的搜索，这样便可以在多个CPU上并行模拟，并在GPU上并行计算策略和值网络。最后，AlphaGo版本使用了大概40个搜索线程，48个CPU和8个GPU。我们同时部署了一个分布式的AlphaGo，这样可以多利用机器的资源。这个分布式的AlphaGo一共耗费了40个线程，1202个CPU和176个GPU。在方法部分中讲解了同步蒙特卡罗法和分布式蒙特卡罗法的具体细节。

Evaluating the playing strength of AlphaGo
测试AlphaGo的下棋水平

To evaluate AlphaGo, we ran an internal tournament among variants of AlphaGo and several other Go programs, including the strongest commercial programs Crazy Stone $^{13}$ and Zen, and the strongest open source programs Pachi $^{14}$ and Fuego $^{15}$ . All of these programs are based on high-performance MCTS algorithms. In addition, we included the open source program GnuGo, a Go program using state-of-the-art search methods that preceded MCTS. All programs were allowed 5 s of computation time per move.
为了能够测试AlphaGo，我们在内部对不同版本的AlphaGo进行了测试，并测试了其他围棋程序，包括最强的商用程序Crazy stone和Zen，以及最强的开源程序Pachi $^{14}$ 和Fuego $^{15}$ 。这些程序都基于高效的蒙特卡罗搜索树算法。另外，我们测试了开源的程序GNUGO，一个使用最新研究成果的、由蒙特卡罗搜索树为核心的围棋程序。所有的程序都要在五秒运算内计算出步骤。

The results of the tournament (see Fig. 4a) suggest that single- machine AlphaGo is many dan ranks stronger than any previous Go program, winning 494 out of 495 games (99.8%) against other Go programs. To provide a greater challenge to AlphaGo, we also played games with four handicap stones (that is, free moves for the opponent); AlphaGo won 77%, 86%, and 99% of handicap games against Crazy Stone, Zen and Pachi, respectively. The distributed version of AlphaGo was significantly stronger, winning 77% of games against single-machine AlphaGo and 100% of its games against other programs.
比赛的结果是单机版AlphaGo在很多方面上优于之前的围棋程序，495比赛中赢了494局（99.8%）。为了进一步挑战AlphaGo，我们同时让了对手四个子（也就是让对手先走四步）；AlphaGo在分别对战Crazy Stone, Zen和Pachi中取得了77%，86%和99%的胜率。而分布式版本的AlphaGo则更加强大，对局单机版AlphaGo取的了77%的胜率，而和其他围棋程序下棋则为100%。

We also assessed variants of AlphaGo that evaluated positions using just the value network (λ = 0) or just rollouts (λ = 1) (see Fig. 4b). Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go. However, the mixed evaluation (λ = 0.5) performed best, winning ≥95% of games against other variants. This suggests that the two position-evaluation mechanisms are complementary: the value network approximates the outcome of games played by the strong but impractically slow $p_ρ$ , while the rollouts can precisely score and evaluate the outcome of games played by the weaker but faster rollout policy $p_π$ . Figure 5 visualizes the evaluation of a real game position by AlphaGo.
我们还通过只使用值网络（λ = 0）和只使用蒙特卡洛（λ = 1）来测试AlphaGo的各种变量。即使没有蒙特卡洛搜索树的帮助，AlphaGo的性能也比其他为其软件要好得多，这说明值网络相比于蒙特卡洛预测在围棋中更有效果。然而在混合使用的时候，在与其他对手比赛时，AlphaGo只有超过95%的胜率获胜。这说明两个值预测函数是互相影响的：值网络可以提供好但是比较慢的预测（ $p_ρ$ ），相比之下，蒙特卡罗搜索树则可以快速预测，但是准确性则不是很高（ $p_π$ ）。图5将AlphaGo在真实对局中的棋子预测给展示了出来。

Finally, we evaluated the distributed version of AlphaGo against Fan Hui, a professional 2 dan, and the winner of the 2013, 2014 and 2015 European Go championships. Over 5–9 October 2015 AlphaGo and Fan Hui competed in a formal five-game match. AlphaGo won the match 5 games to 0 (Fig. 6 and Extended Data Table 1). This is the first time that a computer Go program has defeated a human professional player, without handicap, in the full game of Go—a feat that was previously believed to be at least a decade away $^{3,7,31}$ .
最终，我们让分布式AlphaGo与欧洲2013、2014、2015年围棋冠军樊辉来了一次对决。在2015年10月5日至9日，AlphaGo与樊辉职业二段在5局正式比赛上战得5比0的成绩。这是第一次围棋程序在没有让子，完整的比赛中成功的击败了一位人类专业选手。AlphaGo这次对决让这一盛况比人类的预测提早了几十年实现。

Discussion
讨论

In this work we have developed a Go program, based on a combination of deep neural networks and tree search, that plays at the level of the strongest human players, thereby achieving one of artificial intelligence’s “grand challenges” $^{31–33}$ . We have developed, for the first time, effective move selection and position evaluation functions for Go, based on deep neural networks that are trained by a novel combination of supervised and reinforcement learning. We have introduced a new search algorithm that successfully combines neural network evaluations with Monte Carlo rollouts. Our program AlphaGo integrates these components together, at scale, in a high-performance tree search engine.
在这次研究中，我们制作了一个围棋程序，基于目前最先进的深度神经网络和蒙特卡罗搜索树，并和最厉害的人类选手进行了比赛。这也意味着在人工智能领域的“巨大突破”。我们已经第一次开发了基于深度网络的程序，并用来训练集合监督学习和增强学习方法的结合模型，由此来有效选择下棋步骤和并计算位置的胜率。我们同时使用了一个新搜索算法来合并神经网络和蒙特卡洛搜索树。AlphaGo便是一个基于高性能搜索树引擎的基础之上的程序，这个程序在性能上和规模上都十分优秀。

During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov $^4$ ; compensating by selecting those positions more intelligently, using the policy network, and evaluating them more precisely, using the value network—an approach that is perhaps closer to how humans play. Furthermore, while Deep Blue relied on a handcrafted evaluation function, the neural networks of AlphaGo are trained directly from gameplay purely through general-purpose supervised and reinforcement learning methods.
在对决樊辉的时候，AlphaGo所测试的步骤的位置比深蓝在下象棋的时候要少得多。因此，相比深蓝，AlphaGo所选择的步骤更加智能。它使用了策略网络并精确计算它们的可能性——一种更接近于人类的下棋方式。更重要的是，深蓝还需要在训练中人为干预预测函数，AlphaGo完全是在下棋中自行通过监督学习网络和增强学习方法来调整神经网络方法。

Go is exemplary in many ways of the difficulties faced by artificial intelligence $^{33,34}$ : a challenging decision-making task, an intractable search space, and an optimal solution so complex it appears infeasible to directly approximate using a policy or value function. The previous major breakthrough in computer Go, the introduction of MCTS, led to corresponding advances in many other domains; for example, general game-playing, classical planning, partially observed planning, scheduling, and constraint satisfaction $^35,36$ . By combining tree search with policy and value networks, AlphaGo has finally reached a professional level in Go, providing hope that human-level performance can now be achieved in other seemingly intractable artificial intelligence domains.
围棋是人工智能领域中的一项大挑战，其特点是：一个富有挑战性的决策任务，一个无法对付的搜索广度，以及一个难以用一个策略或值函数来准确预测最优解。在之前的围棋程序中，引入了蒙特卡罗搜索树成为了突破点，这个方法在其他领域已经获得了许多优势；例如，一般的游戏对局、经典的计划、部分观察计划、时间安排和满足约束条件的问题决策。通过整合策略搜索和值网络的搜索树，AlphaGo最终达到了围棋的职业选手水平。这一研究给其他人工智能高深领域的研究达到完全与人类水平相媲美的实现带来了一线希望。