《Mastering the game of Go without human knowledge》阅读摘录总结

马念同学

已于 2024-04-13 12:17:59 修改

阅读量715

点赞数 9

分类专栏：深度强化学习+AlphaGo Zero 文章标签：深度学习机器学习人工智能神经网络

于 2024-04-03 12:05:10 首次发布

本文链接：https://blog.csdn.net/m5678m/article/details/137338104

版权

深度强化学习+AlphaGo Zero 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

题目：《Mastering the game of Go without human knowledge》——在无人类数据先验知识下掌握围棋游戏

--------------------------------------------------------------------------------------------------------------------------------

原文：

Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts 1–4. However, expert data is often expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a ceiling on the performance of systems trained in this manner 5. In contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning. These systems have outperformed humans in computer games such as Atari 6,7 and 3D virtual environments 8–10. However, the most challenging domains in terms of human intellect – such as the game of Go, widely viewed as a grand challenge for artificial intelligence 11 – require precise and sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved human-level performance in these domains.

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：

人类玩家经验难获得、不可获得及可能对训练的系统性能施加上限。因此深度强化学习算法兴起。但在围棋等高挑战性领域仍有待改进。

---------------------------------------------------------------------------------------------------------------------------------

原文：

AlphaGo was the first program to achieve superhuman performance in Go. The published version 12, which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October 2015. AlphaGo Fan utilised two deep neural networks: a policy network that outputs move probabilities, and a value network that outputs a position evaluation. The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained, these networks were combined with a Monte-Carlo Tree Search (MCTS) 13–15 to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach (see Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016.

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：
介绍了AlphaGo程序及其组成，包括：①一个输出移动概率的策略网络（监督学习训练，预测玩家举动，强化学习改进） ②输出位置评估的价值网络（预测策略网络对自己博弈的赢家） ③结合蒙特卡洛树搜索（提供前瞻搜索），结合策略网络缩小到高概率移动，并使用价值网络评估树的位置。——AlphaGo Lee也使用了类似方法

---------------------------------------------------------------------------------------------------------------------------------

原文：

Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee 12 in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it only uses the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any MonteCarlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the search algorithm, training procedure and network architecture are described in Methods.

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：

介绍alphago zero。特点 ①从随机游戏开始，未使用监督和人类数据，进行自我游戏强化学习训练 ②只以棋盘上黑白宝石为特征 ③单一神经网络，拒绝分开的策略和价值网络。④使用简单的树搜索，仅依赖于上述单一的神经网络评估位置和采样移动，拒绝执行蒙特卡洛步骤。 ⑤引入新的强化学习算法，加入了前瞻搜索。（这里未精确介绍，需要method里看）

---------------------------------------------------------------------------------------------------------------------------------

原文：

Reinforcement Learning in AlphaGo Zero

Our new method uses a deep neural network fθ with parameters θ. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p, v) = fθ(s). The vector of move probabilities p represents the probability of selecting each move (including pass), pa = P r(a|s). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network 12 into a single architecture. The neural network consists of many residual blocks 4 of convolutional layers 16,17 with batch normalisation 18 and rectifier non-linearities 19 (see Methods).

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：
这一段介绍AlphaGo Zero里强化学习中使用的神经网络。首先讲了输入和输出，输入是"the raw board representation"——围棋棋盘当前状态的未经处理的数据表示（指棋盘每个点状态的数据结构（空、黑、白），能直接反映棋盘上的所有棋子分布和游戏进程）和“move probabilities and a value”——移动概率和值。其次讲了神经网络的简略结构，结构是策略网络和价值网络的整合。具体是由带着批量归一化和整流器非线性的卷积层的许多残差块构成（具体构成见方法）。

pa = Pr(a|s)——数学表示说的是在当下棋局状态中走每一步（包含不过）的概率的值。神经网络学习棋局的模式和策略，为每一个可能的走法分配一个概率，概率加和为1。

value: 一个单一的数值，一个标量（scalar），用来评估当前玩家从当前位置s获胜的概率。此价值不依赖特定的移动选择，而是对当前棋局整体情况的一个评估。

---------------------------------------------------------------------------------------------------------------------------------

原文：

Reinforcement Learning in AlphaGo Zero

The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network fθ. The MCTS search outputs probabilities πππ of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network fθ(s); MCTS may therefore be viewed as a powerful policy improvement operator 20,21. Self-play with search – using the improved MCTS-based policy to select each move, then using the game winner z as a sample of the value – may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure 22,23: the neural network’s parameters are updated to make the move probabilities and value (p, v) = fθ(s) more closely match the improved search probabilities and self-play winner (πππ, z); these new parameters are used in the next iteration of self-play to make the search even stronger. Figure 1 illustrates the self-play training pipeline.

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：

在alphago zero中神经网络是通过一种自玩游戏实现的。每个位置，被神经网络指导的蒙特卡洛算法被执行，输出下一步走的概率，通常会选择出比原始网络更强壮的移动。也是这个原因蒙特卡洛树被视为强大的策略改进者。自玩搜索——使用改进的蒙特卡洛搜索进行每一步+使用游戏获胜者的值作为样本——被看作一个强大的策略改进操作符。

强化深度学习的主要思想是重复使用策略改进操作符，神经网络的参数更新使得概率和值更接近改进的概率和值，下次更新后的参数被使用使得程序下次更加健壮。

---------------------------------------------------------------------------------------------------------------------------------

原文：

Reinforcement Learning in AlphaGo Zero

The Monte-Carlo tree search uses the neural network fθ to guide its simulations (see Figure 2). Each edge (s, a) in the search tree stores a prior probability P (s, a), a visit count N (s, a), and an action-value Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a) + U (s, a), where U (s, a) ∝ P (s, a)/(1 + N (s, a)) 12,24, until a leaf node s′ is encountered. This leaf position is expanded and evaluated just once by the network to generate both prior probabilities and evaluation, (P (s′, ·), V (s′)) = fθ(s′). Each edge (s, a) traversed in the simulation is updated to increment its visit count N (s, a), and to update its action-value to the mean evaluation over these simulations, Q(s, a) = 1/N (s, a) ∑ s′|s,a→s′ V (s′), where s, a → s′ indicates that a simulation eventually reached s′ after taking move a from position s.

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：

这一段是用蒙特卡洛树来指导其模拟，详细流程请见另一篇博客《《Mastering the game of Go without human knowledge》中用到的“蒙特卡洛树搜索算法”

需要注意的是，蒙特卡罗树算法在（“计算过往平均收益”+“期待收益“）时，使用”神经网络预测“来提供先验概率，并结合访问计数来进行”期待收益的评估“，以此来对后续选择的结点进行评估。

---------------------------------------------------------------------------------------------------------------------------------

原文：

Reinforcement Learning in AlphaGo Zero

MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, πππ = αθ(s), proportional to the exponentiated visit count for each move, πa ∝ N (s, a)1/τ , where τ is a temperature parameter.

---------------------------------------------------------------------------------------------------------------------------------

这一段讲述蒙特卡洛树时如何构建的。树的构建：

在MCTS中，搜索树的每个边（s, a）代表一个可能的行动a在状态s下的转换。每个边存储了三个信息：

行动价值Q(s, a)：表示行动a在之前的模拟中的平均评估值。

访问计数N(s, a)：表示行动a被选择过的次数。

先验概率P(s, a)：表示在没有任何模拟信息时，行动a的概率。

模拟过程： ①每次模拟从根状态开始，逐步选择行动，直到达到叶子节点s′。②在选择行动时，使用上界置信区间U(s, a)来指导选择，其中U(s, a)与P(s, a)成正比，与访问计数的函数成反比（1 + N(s, a)）。③ 这个上界置信区间结合了行动的价值Q(s, a)，以选择最大化Q(s, a) + U(s, a)的行动。

叶子节点的评估:①当模拟到达叶子节点s′时，这个节点会被扩展，并且使用神经网络fθ来评估这个新的状态。②神经网络输出的是该状态的先验概率分布P(s′, ·)和评估值V(s′)，后者表示在该状态下当前玩家的获胜概率。

搜索树的更新：在模拟过程中经过的每条边(s, a)都会更新其信息：增加访问计数N(s, a)，表示行动a被再次选择。更新行动价值Q(s, a)，计算所有通过行动a从状态s到达的叶子节点s′的平均评估值。

总结：MCTS通过模拟和神经网络不断更新搜索树。神经网络在过程不仅提供了初始的先验概率和行动价值，还在模拟结束时为新的状态提供了评估。

---------------------------------------------------------------------------------------------------------------------------------

原文：

Reinforcement Learning in AlphaGo Zero

The neural network is trained by a self-play reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialised to random weights θ0. At each subsequent iteration i ≥ 1, games of self-play are generated (Figure 1a). At each time-step t, an MCTS search πππt = αθi−1(st) is executed using the previous iteration of neural network fθi−1, and a move is played by sampling the search probabilities πππt. A game terminates at step T when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length; the game is then scored to give a final reward of rT ∈ {−1, +1} (see Methods for details). The data for each time-step t is stored as (st, πππt, zt) where zt = ±rT is the game winner from the perspective of the current player at step t. In parallel (Figure 1b), new network parameters θi are trained from data (s, πππ, z) sampled uniformly among all time-steps of the last iteration(s) of self-play. The neural network (p, v) = fθi(s) is adjusted to minimise the error between the predicted value v and the self-play winner z, and to maximise the similarity of the neural network move probabilities p to the search probabilities πππ. Specifically, the parameters θ are adjusted by gradient descent on a loss function l that sums over mean-squared error and cross-entropy losses respectively,

where c is a parameter controlling the level of L2 weight regularisation (to prevent overfitting).

---------------------------------------------------------------------------------------------------------------------------------

自己的注解，这段话讲述了以下几个步骤：

①初始时刻神经网络的值被随机化。此时没有事先掌握任何知识。

②通过自我对弈来生成数据，每个时间步，使用前一次神经网络的参数指导蒙特卡搜索，依据这些概率来选择走法。

③游戏结束条件：a.两位玩家连续弃着 b.搜索值低于认输阈值（目前未知） c. 步数超过最大长度限制。游戏结束，根据规则给出最终奖励一个[-1,1]间的值，表示的是游戏的输赢。

每个时间步？（表示的是当下时间还是游戏结束的时间）都会储存，当前棋局状态（黑白④数据存储：布局），搜索概率以及当前赢家。

⑤神经网络如何采样的以及输出：

对左边公式的理解：输入当前棋局状态s，输出移动概率和当前获胜预测值。

对右边公式的理解：是损失函数的构成。loss = (自玩赢家的值-获胜概率预测值)方 - 搜索概率的转置×移动概率的对数 + C×权重的范数，其中C用来防止过拟合。

---------------------------------------------------------------------------------------------------------------------------------

（上次自我对弈中，所有时间步均匀采样）

神经网络的输入： 棋局状态

基于梯度下降的均方误差和损失函数的优化

神经网络的输出：移动概率和当前玩家获胜的概率

蒙特卡洛树涉及量：

行动价值——（基于神经网络）之前当前玩家获胜概率的均值 / 访问计数

访问计数——之前被选择过的次数

先验概率——神经网络产生的移动概率

蒙特卡洛树更新：

结合上限置信区间：基于先验概率 利用方面（依据行动价值(正比)）+ 探索方面（访问计数(反比)）

进行结点的选择

---------------------------------------------------------------------------------------------------------------------------------

原文:

2 Empirical Analysis of AlphaGo Zero Training

We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behaviour and continued without human intervention for approximately 3 days.

Over the course of training, 4.9 million games of self-play were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4s thinking time per move. Parameters were updated from 700,000 mini-batches of 2,048 positions. The neural network contained 20 residual blocks (see Methods for further details).

Figure 3a shows the performance of AlphaGo Zero during self-play reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting suggested in prior literature 26–28. Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for comparison, AlphaGo Lee was trained over several months. After 72 hours, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the 2 hour time controls and match conditions as were used in the man-machine match in Seoul (see Methods). AlphaGo Zero used a single machine with 4 Tensor Processing Units (TPUs) 29, while AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0 (see Extended Data Figure 5 and Supplementary Information).

To assess the merits of self-play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS data-set; this achieved state-of-the-art prediction accuracy compared to prior work 12,30–33 (see Extended Data Table 1 and 2 respectively). Supervised learning achieved better initial performance, and was better at predicting the outcome of human professional games (Figure 3). Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24 hours of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：

在实践中，网络过程的细节（蒙特克罗使用1600个模拟等），与监督学习对比（准确率等一些表现）。得出深度强化比监督学习的总体效果要好的结论。

---------------------------------------------------------------------------------------------------------------------------------

To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in AlphaGo Zero with the previous neural network architecture used in AlphaGo Lee (see Figure 4). Four neural networks were created, using either separate policy and value networks, as in AlphaGo Lee, or combined policy and value networks, as in AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimise the same loss function (Equation 1) using a fixed data-set of self-play games generated by AlphaGo Zero after 72 hours of self-play training. Using a residual network was more accurate, achieved lower error, and improved performance in AlphaGo by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to improved computational efficiency, but more importantly the dual objective regularises the network to a common representation that supports multiple use cases.

---------------------------------------------------------------------------------------------------------------------------------为区分架构和算法的贡献：

将alpha zero 中的组合价值和策略网络，及残差网络架构与 alphago ,lee中的单独的策略和价值网络，及卷积网络架构进行对比。

发现alphago zero 中残差网络更准确，误差更低，虽然组合价值和预测网络，会稍微降低预测的准确性，但是，减少价值误差，并总体性能提高很多。

-------------------------------------------------------------------------------------------------------------------------------3

原文：

Knowledge Learned by AlphaGo Zero

AlphaGo Zero discovered a remarkable level of Go knowledge during its self-play training process. This included fundamental elements of human Go knowledge, and also non-standard strategies beyond the scope of traditional Go knowledge.

Figure 5 shows a timeline indicating when professional joseki (corner sequences) were discovered (Figure 5a, Extended Data Figure 1); ultimately AlphaGo Zero preferred new joseki variants that were previously unknown (Figure 5b, Extended Data Figure 2). Figure 5c and the Supplementary Information show several fast self-play games played at different stages of training. Tournament length games played at regular intervals throughout training are shown in Extended Data Figure 3 and Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts including fuseki (opening), tesuji (tactics), life-and-death, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho (“ladder” capture sequences that may span the whole board) – one of the first elements of Go knowledge learned by humans – were only understood by AlphaGo Zero much later in training.

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：

Alphago zero 不仅学习传统的知识策略。还学习了一些非标准策略。然后就是对一些围棋知识的解析。

---------------------------------------------------------------------------------------------------------------------------------

原文：

4 Final Performance of AlphaGo Zero

We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration. Training again started from completely random behaviour and continued for approximately 40 days. Over the course of training, 29 million games of self-play were generated. Parameters were updated from 3.1 million mini-batches of 2,048 positions each. The neural network contained 40 residual blocks. The learning curve is shown in Figure 6a. Games played at regular intervals throughout training are shown in Extended Data Figure 4 and Supplementary Information. We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo Fan, AlphaGo Lee, and several previous Go programs. We also played games against the strongest existing program, AlphaGo Master – a program based on the algorithm and architecture presented in this paper but utilising human data and features (see Methods) – which defeated the strongest human professional players 60–0 in online games 34 in January 2017. In our evaluation, all programs were allowed 5 seconds of thinking time per move; AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUs and 48 TPUs respectively. We also included a player based solely on the raw neural network of AlphaGo Zero; this player simply selected the move with maximum probability. Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan. Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100 game match with 2 hour time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Figure 6) and Supplementary Information.

-------------------------------------------------------------------------------------------------------------------------------=

自己的注解：

加大训练后，实验中设置对照组的情形下，将alphago zero 同alphago master 等进行对比，结果发现alphago zero 的表现更为出彩。

--------------------------------------------------------------------------------------------------------------------------------

原文：

5 Conclusion

Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data. Using this approach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted features, by a large margin. Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, proverbs and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.

---------------------------------------------------------------------------------------------------------------------------------

自己的注解：

这一段说的是，在陌生的领域，纯强化学习的方法也是完全可行的，在没有人类示例或指导的情况下，在没有基本规则之外的领域知识的情况下，多训练些时间，可以训练到超人的水平。在几天的时间里，AlphaGo Zero能够重新发现围棋的大部分知识，以及为最古老的游戏提供新见解的新策略。（从自玩游戏中，继承和创新，不需要提供任何棋局数据）

（method方面，很多有关参数的调优，本文作者经验尚浅，请阅者自己赏析）