中英AlphaGo论文:精通围棋博弈的深层神经网络和树搜索算法(附PDF公号发“AlphaGo论文”下载论文双语对照版)

这篇博客介绍了谷歌DeepMind团队如何利用深度神经网络和树搜索算法,创造出AlphaGo,首次在标准围棋比赛中战胜人类职业选手。论文详细阐述了策略网络、估值网络的训练过程,以及结合这两种网络的蒙特卡洛搜索算法,展示了在围棋这一复杂游戏中人工智能的突破性进展。
摘要由CSDN通过智能技术生成

中英AlphaGo论文:精通围棋博弈的深层神经网络和树搜索算法(附PDF公号发“AlphaGo论文”下载论文双语对照版)

原创: 秦陇纪 数据简化DataSimp 今天

数据简化DataSimp导读:谷歌人工智能DeepMind围棋团队2016.1.28在《自然》杂志发表nature16961号论文《精通围棋博弈的深层神经网络和树搜索算法》英文讲述AlphaGo算法,是人工智能(AI机器学习深度学习)经典版本,值得收藏学习。本文翻译AlphaGo算法论文为汉译原文中英文对照版本,以后搭建平台实现并分析,关注本公号“数据简化DataSimp”后,在输入栏回复关键字AlphaGo论文/谷歌围棋论文”获取本文PDF下载链接,欢迎赞赏支持。

数据DataSimp社区分享:信息与数据处理分析、数据科学研究前沿、数据资源现状和数据简化基础的科学知识、技术应用、产业活动、人物机构等信息。欢迎大家参与投稿,为数据科学技术做贡献,使国人尽快提高数据能力,提高社会信息能效。做事要平台,思路要跟进;止步吃住行,无力推文明。要推进人类文明,不可止步于敲门呐喊;设计空想太多,无法实现就虚度一生;工程能力至关重要,秦陇纪与君共勉之。祝大家学习愉快~~

中英AlphaGo论文:精通围棋博弈的深层神经网络和树搜索算法(51654字)

目录

A 谷歌AlphaGo论文:精通围棋博弈的深层神经网络和树搜索算法(42766字)

1.导言Introduction

2.策略网络的监督学习SupervisedLearning of Policy Networks

3.策略网络的强化学习ReinforcementLearning of Policy Networks

4.估值网络的强化学习ReinforcementLearning of Value Networks

5.策略网络和估值网络搜索Searchingwith Policy and Value Networks

6.AlphaGo弈棋算力评估Evaluatingthe Playing Strength of AlphaGo

7.讨论Discussion

8.方法Methods

9.参考文献References

12.扩展数据Extendeddata

10.致谢Acknowledgements

11.作者信息AuthorContributions

13.补充信息Supplementaryinformation

14.评论Comments

B 深度神经网络经典算法AlphaGo论文指标和译后(8122字)

1.自然研究期刊文章指标ANature Research Journal Article metrics

2.深度神经网络经典算法AlphaGo论文译后花絮

参考文献(3287字)Appx(1236字).数据简化DataSimp社区简介


A谷歌AlphaGo论文:精通围棋博弈的深层神经网络和树搜索算法(42766)

精通围棋博弈的深层神经网络和树搜索算法

文|谷歌DeepMind组,译|秦陇纪等,数据简化DataSimp20160316Wed-1105Mon

Mastering the game of Go withdeep neural networks and tree search

名称:精通围棋博弈的深层神经网络和树搜索算法

Author: David Silver1*, Aja Huang1*, Chris J. Maddison1, ArthurGuez1, Laurent Sifre1, George van den Driessche1, Julian Schrittwieser1,Ioannis Antonoglou1, Veda Panneershelvam1, Marc Lanctot1, Sander Dieleman1,Dominik Grewe1, John Nham2, Nal Kalchbrenner1, Ilya Sutskever2, TimothyLillicrap1, Madeleine Leach1, Koray Kavukcuoglu1, Thore Graepel1 & DemisHassabis1

作者:①戴维·斯尔弗1*,②黄士杰1*,③克里斯·J.·麦迪逊1,④亚瑟·格斯1,⑤劳伦特·西弗瑞1,⑥乔治·范登·德里施1,⑦朱利安·施立特威泽1,⑧扬尼斯·安东诺娄1,⑨吠陀·潘聂施尔万1,⑩马克·兰多特1,⑪伞德·迪勒曼1,⑫多米尼克·格鲁1,⑬约翰·纳姆2,⑭纳尔 卡尔克布伦纳1,⑮伊利亚·萨茨基弗2,⑯蒂莫西·李烈克莱普1,⑰马德琳·里奇1,⑱科瑞·卡瓦口格鲁1,⑲托雷·格雷佩尔1,和⑳戴密斯·哈萨比斯1

单位:1.谷歌DeepMind,英国伦敦EC4A3TW,新街广场5号。2.谷歌,美国加利福尼亚州94043,景山,剧场路1600号。*这些作者对这项工作作出了同等贡献。

摘要:由于海量搜索空间、评估棋局和落子行为的难度,围棋长期以来被视为人工智能领域最具挑战的经典游戏。这里,我们介绍一种新的电脑围棋算法:使用“估值网络”评估棋局、“策略网络”选择落子。这些深层神经网络,是由人类专家博弈训练的监督学习和电脑自我博弈训练的强化学习,共同构成的一种新型组合。没有任何预先搜索的情境下,这些神经网络能与顶尖水平的、模拟了千万次随机自我博弈的蒙特卡洛树搜索程序下围棋。我们还介绍一种新的搜索算法:结合了估值和策略网络的蒙特卡洛模拟算法。用这种搜索算法,我们的程序AlphaGo与其它围棋程序对弈达到99.8%的胜率,并以5比0击败了人类的欧洲围棋冠军。这是计算机程序第一次在标准围棋比赛中击败一个人类职业棋手——以前这被认为是需要至少十年以上才能实现的伟业。

Abstract:The game of Go has long beenviewed as the most challenging of classic games for artificial intelligenceowing to its enormous search space and the difficulty of evaluating boardpositions and moves. Here we introduce a new approach to computer Go that uses‘value networks’ to evaluate board positions and ‘policy networks’ to selectmoves. These deep neural networks are trained by a novel combination ofsupervised learning from human expert games, and reinforcement learning fromgames of self-play. Without any lookahead search, the neural networks play Goat the level of stateof-the-art Monte Carlo tree search programs that simulatethousands of random games of self-play. We also introduce a new searchalgorithm that combines Monte Carlo simulation with value and policy networks.Using this search algorithm, our program AlphaGoachieved a 99.8% winning rate against other Go programs, and defeated the humanEuropean Go champion by 5 games to 0. This is the first time that a computerprogram has defeated a human professional player in the full-sized game of Go,a feat previously thought to be at least a decade away.

(译注1:论文有15部分:0摘要Abstact、1导言Introduction、2策略网络的监督学习Supervised learningof policy networks、3策略网络的强化学习ReinforcementLearning of Policy Networks、4估值网络的强化学习ReinforcementLearning of Value Networks、5基于策略网络和估值网络的搜索算法Searching with Policyand Value Networks、6AlphaGo博弈算力评估Evaluating theplaying strength of AlphaGo、7讨论Discussion(参考文献References1-38)、8方法METHODS(9参考文献References39-62)、10致谢Acknowledgements、11作者信息Author Information(作者贡献Author Contributions)、12扩展数据Extended data(扩展数据图像和表格Extended Data Table)、13补充信息Supplementaryinformation(权力和许可Rights andpermissions、文章相关About this article、延伸阅读Further reading)和14评论Comments。其中,9参考资料References正文和讨论部分38篇,方法部分24篇,合计有62篇。另外,自然期刊在线资料还包括论文PDF、6张大图、6个对弈PPT和1个补充信息压缩包,论文网址http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html。新媒体文章省略第8部分等,全部资料ZIP压缩包,在本社区可下载。)

 

1.导言Introduction

完美信息类游戏都有一种最优值函数v*(s),从所有游戏者完美对弈时每一棋盘局面状态s,判断出游戏结果。这类游戏可以通过递归计算一个约含bd种可能落子情况序列的搜索树,求得上述最优值函数来解决。这里,b是游戏广度(每个局面可合法落子的数量),d是游戏深度(对弈步数)。在国际象棋(b≈35,d≈80)1,特别是围棋(b≈250,d≈150)1等大型游戏中,虽然穷举搜索并不可取2,3,但有两种常规方法可以减少其有效搜索空间。第一种方法,搜索深度可以通过局面评估来降低:用状态s截断搜索树,将s的下级子树用预测状态s结果的近似值函数v(s)≈v*(s)代替。这种做法在国际象棋4,跳棋5和奥赛罗6中取得了超过人类的性能;但由于围棋7的复杂性,这种做法据信在围棋中变得棘手。第二种方法,搜索广度可以用局面s中表示可能落子a的策略函数p(a|s)产生的概率分布的弈法抽样来降低。例如,蒙特卡洛走子算法8搜索到最大深度时无任何分支,而是用策略变量p为对弈双方的长弈法序列采样。大体上,这些走子行为提供了一种有效的局面评估,在五子棋8、拼字游戏9和低水平业余围棋比赛10中均实现了超越人类水平的性能。

蒙特卡洛树搜索(MCTS)11,12用蒙特卡洛走子来估算一个搜索树中每个状态的值。随着更多模拟情况的执行,该搜索树生长变大、相关值变得更加准确。随着时间的推移,通过选择那些较高估值的子树,搜索过程中选择弈法的策略也得到了提高。该策略渐进收敛于最优弈法,对应的估值结果收敛于该最优值函数12。当下最强的围棋程序都基于MCTS,通过预测人类高手落子情况而训练的一些策略,来增强性能13。这些策略大都把此搜索过程限制在高概率弈法,以及走子时的弈法采样。该方法已经在很强的业余博弈中取得了成功13–15。然而,以前的做法仅限于浅层策略13–15,或某些基于一种带输入型特征值的线性函数组合的估值函数。

近来,深度卷积神经网络在视觉领域达到前所未有的高性能:如图像分类17、人脸识别18、雅达利游戏19。他们用重叠排列的多层神经元,逐步构建图像的局部抽象表征20。我们在围棋中采用类似架构:通过把棋局看做为一个19×19的图像,使用若干卷积层构造该局面的表征值。用这些神经网络,我们来减少有效深度及搜索树广度:用一个估值网络评估棋局,用一个策略网络做弈法取样。

我们用一种由机器学习若干阶段组成的管道来训练这些神经网络(图1)。开始阶段,我们直接用人类高手走子弈法训练一种有监督学习(SL)策略网络。此阶段提供快速、高效的带有即时反馈和高品质梯度的机器学习更新数据。类似以前的做法13,15,我们也训练了一个快速走子策略,能对走子弈法快速采样。接下来的阶段,我们训练一种强化学习(RL)策略网络,通过优化那些自我博弈的最终结果,来提高前面的SL策略网络。此阶段是将该策略调校到赢取比赛的正确目标上,而非最大程度的预测准确性。最后阶段,我们训练一种估值网络,来预测那些采用RL走棋策略网络自我博弈的赢家。我们的程序AlphaGo,用MCTS有效结合了策略和估值网络。

1:神经网络训练管道和架构。左图1a,一种快速走子策略和监督学习(SL)策略网络被训练,用来预测一个局面数据集中人类高手的落子情况。一种强化学习(RL)策略网络按该SL策略网络进行初始化,然后对前一版策略网络用策略梯度学习来最大化该结果(即赢得更多比赛)。通过和这个RL策略网络自我博弈,产生一个新数据集。最后,一种估值网络由回归训练的,用来预测此自我博弈数据集里面局面的预期结果(即是否当前棋手获胜)。右图1bAlphaGo神经网络架构的示意图。图中的策略网络表示:作为输入变量的棋局s,通过带参数σ(SL策略网络)或ρ(RL策略网络)的许多卷积层,输出合法落子情况a的概率分布pσ(a|s)或pρ(a|s),由此局面概率图来呈现。此估值网络同样使用许多带参数θ的卷积层,但输出一个用来预测局面sʹ预期结果的标量值(sʹ)。

Figure 1: Neural network trainingpipeline and architecture. a A fastrollout policy p_ and supervised learning (SL) policy network are trained to predict human expert moves in a data-setof positions. A reinforcement learning (RL) policy network is initialised to the SL policy network, and is thenimproved by policy gradient learning to maximize the outcome (i.e. winning moregames) against previous versions of the policy network. A new data-set isgenerated by playing games of self-play with the RL policy network. Finally, avalue network is trained by regression topredict the expected outcome (i.e. whether the current player wins) inpositions from the selfplay data-set. bSchematic representation of the neural network architecture used in AlphaGo. The policy network takes arepresentation of the board position s as its input, passes it through manyconvolutional layers with parameters σ(SL policy network) or ρ(RL policy network), and outputs a probabilitydistribution pσ(a|s) or pρ(a|s) over legal moves a,represented by a probability map over the board. The value network similarlyuses many convolutional layers with parameters θ, but outputs a scalar value (sʹ) that predicts the expected outcome in position sʹ.

All games of perfect informationhave an optimal value function, v*(s), which determines the outcome of thegame, from every board position or state s, under perfect play by all players.These games may be solved by recursively computing the optimal value functionin a search tree containing approximately bd possible sequences ofmoves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length).In large games, such as chess (b≈35; d≈80)1and especially Go (b≈250; d≈150)1, exhaustive search isinfeasible2, 3, but the effective search space can be reduced by twogeneral principles. First, the depth of the search may be reduced by position evaluation:truncating the search tree at state s and replacing the subtree below s by anapproximate value function v(s)≈v*(s) that predicts the outcome from state s.This approach has led to super-human performance in chess4, checkers5and othello6, but it was believed to be intractable in Go due to thecomplexity of the game7. Second, the breadth of the search may be reduced bysampling actions from a policy p(a|s) that is a probability distribution overpossible moves a in position s. For example, Monte-Carlo rollouts8search to maximum depthwithout branching at all, by sampling long sequences of actions for bothplayers from a policy p. Averaging over such rollouts can provide an effectiveposition evaluation, achieving super-human performance in backgammon8and Scrabble9, and weak amateur level play in Go10.

Monte-Carlo tree search (MCTS)11, 12uses Monte-Carlo rollouts to estimate the value of eachstate in a search tree. As more simulations are executed, the search tree growslarger and the relevant values become more accurate. The policy used to selectactions during search is also improved over time, by selecting children withhigher values. Asymptotically, this policy converges to optimal play, and theevaluations converge to the optimal value function12. The strongest current Goprograms are based on MCTS, enhanced by policies that are trained to predicthuman expert moves13. These policies are used to narrow the search to a beamof high probability actions, and to sample actions during rollouts. Thisapproach has achieved strong amateur play13–15. However, prior work has been limited to shallowpolicies13–15or value functions16based on a linear combination of input features.

Recently, deep convolutional neural networks haveachieved unprecedented performance in visual domains: for example imageclassification17, face recognition18, and playing Atari games19. They use many layers ofneurons, each arranged in overlapping tiles, to construct increasinglyabstract, localised representations of an image20. We employ a similararchitecture for the game of Go. We pass in the board position as a 19×19 image and use convolutionallayers to construct a representation of the position. We use these neuralnetworks to reduce the effective depth and breadth of the search tree:evaluating positions using a value network, and sampling actions using a policynetwork.

We train the neural networks using a pipeline consistingof several stages of machine learning (Figure 1). We begin by training asupervised learning (SL) policy network, pσ, directly from experthuman moves. This provides fast, efficient learning updates with immediatefeedback and high quality gradients. Similar to prior work13, 15, we also train a fast policypπthat can rapidly sample actions during rollouts. Next, wetrain a reinforcement learning (RL) policy network, pρ, thatimproves the SL policy network by optimising the final outcome of games ofself-play. This adjusts the policy towards the correct goal of winning games,rather than maximizing predictive accuracy. Finally, we train a value network vθthat predicts the winner ofgames played by the RL policy net

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值