Go Game
High-Level Ideas
Training and Execution
Policy Network
State (of AlphaGo Zero)
Policy Network
AlphaGo Zero
AlphaGo
Initialize Policy Network by Behavior Cloning
需要注意的是:
具体步骤:
在behavior cloning之后,网络在下棋过程中会出现两种情况:
Train Policy Network Using Policy Gradient
Reinforcement learning of policy network
Policy Gradient
具体训练步骤:
Play Go using the policy network
Train the Value Network
Policy Value Networks (AlphaGo Zero)
Train the value network
Monte Carlo Tree Search
主要思想:”高瞻远瞩“
蒙特卡罗树搜索(MCTS)的每次模拟都有4个步骤:
具体的:
个人理解:在多次搜索之后,第一项占比很大,基本上分数就由Q(a)决定了,那么Q(a)大的,就很容易再次被搜索到,因此N(a)就大。因此N(a)是可以反应动作好坏的。
MCTS: Summary
Summary
Training and Execution
AlphaGo Zero v.s. AlphaGo
从而引出一个问题: