Chapter 8 Planning and Learning with Tabular Methods


Model-based方法依赖于planning,将模型作为输入并产生或改进与模拟环境交互的策略;
Model-free方法依赖于learning,将环境的真实experience作为输入;

现在把两种方法结合起来。


Integrated Architectures
8.1 Models and Planning

distribution models:描述所有可能以及这些可能的概率的model
sample model:描述一个可能,并根据该可能的概率进行采样

在动态规划中假设的MDP就是一个distribution models;而blackjack例子中介绍的就是sample model。

在任何一种情况下,model都是用来simulate the environment and produce simulated experience.

planning就是任何将model作为输入并产生或改进与模拟环境交互的policy的任何计算过程:
planning


参考自David silver的强化学习公开课ppt

详细解释 Model
  • 一个Model M M 就表示一个 MDPS,A,P,R,参数是 η η
  • 假设state space S S 和action space A 都是已知的
  • 所以一个model M=Pη,Rη M = ⟨ P η , R η ⟩ 表示状态转移 PηP P η ≈ P 和reward RηR R η ≈ R

    St+1Rt+1Pη(St+1|St,At)=Rη(Rt+1|St,At) S t + 1 ≈ P η ( S t + 1 | S t , A t ) R t + 1 = R η ( R t + 1 | S t , A t )

  • 还要假设状态转移和reward是条件独立的

    P[St+1,Rt+1|St,At]=P[St+1|St,At]P[Rt+1|St,At] P [ S t + 1 , R t + 1 | S t , A t ] = P [ S t + 1 | S t , A t ] P [ R t + 1 | S t , A t ]

Model Leanring的目的是从experience中估计model Mη M η ,这是监督学习问题。

  • 学习 s,ar s , a → r 是回归问题
  • 学习 s,as s , a → s ′ 是密度估计问题
  • 选择损失函数,如mean-squared error, KL diergence等
  • 选择参数 η η 最小化经验损失

使用的Model的例子
- Table Lookup Model (书和下面说的论文上介绍的Model)
- Linear Expectation Model
- Linear Gaussian Model
- Gaussian Process Model
- Deep Belief Network Model

简要介绍Table Lookup Model
Model是显式的 MDP,P^,R^ MDP , P ^ , R ^ N(s,a) N ( s , a ) 是总的访问次数

P^as,sR^as=1N(s,a)t=1T1(St,At,St+1=s,a,s)=1N(s,a)t=1T1(St,At=s,a)Rt P ^ s , s ′ a = 1 N ( s , a ) ∑ t = 1 T 1 ( S t , A t , S t + 1 = s , a , s ′ ) R ^ s a = 1 N ( s , a ) ∑ t = 1 T 1 ( S t , A t = s , a ) R t


详细解释 Planning

Planning with a Model

  • 给定一个model Mη=Pη,Rη M η = ⟨ P η , R η ⟩
  • 解决 MDP S,A,Pη,Rη MDP  ⟨ S , A , P η , R η ⟩
  • 使用的planning算法
    • Value iteration
    • Policy iteration
    • Tree search (这是书上介绍的MCTS)

Sample-Based Planning 就是从model中sample experience来做planning。使用model-free RL来sample,如:Monte-Carlo control; Sarsa; Q-learning


Integrating Learning and Planning


本书主要讲到的方法是 State-space planning:通过状态空间搜索最优策略或到达目标的最佳路径。

接下来讲到的方法包含两个基本想法:

  1. 所有的state-space planning 方法都包含把计算value functions作为改进policy的关键中间步骤
  2. 它们将updates或者backup operations应用到simulated experience上来计算value functions

basic idea planning

Random-sample one-step tabular Q-planning

8.2 Dyna: Integrated Planning, Acting, and Learning

把两者结合起来的强化学习的结构
policy summarized
直接学习:直接根据环境的experience学习和改进policy
间接学习:使用experience学习model,然后把model作为planning的输入,学习和改进policy
The general Dyna Architecture
对Dyna-Q,the acting, model-learning,direct RL都需要很少的时间,剩下的最多的时间都是planning process消耗的。
Tabular Dyna-Q
(d)(e)(f)分别完成Direct reinforcement learning,model-learning,planning

先使用真实数据学习出Model,然后再使用Planning更新Q值

可以看一个planning作用的例子
Dyna-Q example
可以看到planning的次数越多,收敛的越快

8.3 When the Model Is Wrong

之前都是假设环境是不会发生变化的,但是如果环境发生变化,那么model就是错误的。已经生成的policy就基本不会去探索那些已经发生变化的位置。
Dyna+ exmple
这里提出一个启发式的方法解决上述提到的问题,实现该方法的就是Dyna-Q+

给长时间没有探索过的step更多的reward

给出一个比较实际的解决方案:

In particular, if the modeled reward for a transition is r r , and the transition has not been tried in τ time steps, then planning updates are done as if that transition produced a reward of r+kτ r + k τ , for some small k k .

8.4 Prioritized Sweeping

之前的planning都是随机的从model中选取state-action pairs的,这样做很没有效率。用一个队列保存state-action pairs,估计的变化不是平稳的,有限改变变化大的。

A queue is maintained of every state{action pair whose estimated
value would change nontrivially if updated , prioritized by the size of the change.

Prioritized sweeping for a deterministic environment

8.5 Expected vs. Sample Updates

从one-step updates来考虑,到目前为止主要集中的三个维度(之后会介绍第四个维度)

  1. whether they update state values or action values. (eg. q,v)

    • whether they estimate the value for the optimal policy or for an arbitrary given policy. (eg. q,v,qπ,vπ q ∗ , v ∗ , q π , v π )
    • whether the updates are expected updates, considering all possible events that might happen, or sample updates, considering a single sample of what might happen.
    • Backup diagrams for all the one-step updates considered in this book.

      expected update: 正确性只受后继状态Q的限制

      Q(s,a)s,rp^(s,r|s,a)[r+γmaxaQ(s,a)] Q ( s , a ) ← ∑ s ′ , r p ^ ( s ′ , r | s , a ) [ r + γ max a ′ Q ( s ′ , a ′ ) ]

      sample update: 还收到sample error的影响,但是计算量小

      Q(s,a)Q(s,a)+α[R+γmaxaQ(s,a)Q(s,a)] Q ( s , a ) ← Q ( s , a ) + α [ R + γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ]

      通过实验表明,在多分枝状态问题上,sample update 比 expected update要好

      8.6 Trajectory Sampling

      In other words, one simulates explicit individual trajectories and performs updates at the state or state-action pairs encountered along the way. We call this way of generating experience and updates trajectory sampling.

      比较on-policy和uniform两种策略下,从初始状态S开始,在接下来的b个状态中选择一个,使用on-policy就是根据现在的policy选择其中一个状态,更新状态的值;使用uniform就是等可能的选择一个状态,来更新状态值。
      在任何情况下,短期on-policy都表现的更好,但是长期来看,uniform会更好。

      8.7 Real-time Dynamic Programming

      RTDP updates the values of states visited in actual or simulated trajectories by means of expected tabular value-iteration updates
      更新公式类似于

      vk+1(s)maxaE[Rt+1+γvk(St+1)|St=s,At=a]=maxas,rp(s,r|s,a)[r+γvk(s)] v k + 1 ( s ) ≐ max a E [ R t + 1 + γ v k ( S t + 1 ) | S t = s , A t = a ] = max a ∑ s ′ , r p ( s ′ , r | s , a ) [ r + γ v k ( s ′ ) ]

      8.8 Planning at Decision Time

      background planning : planning is not focussed on the current state.

      decision-time planning:从一个状态 St S t 开始选择一个动作 At A t ,然后 St+1 S t + 1 接着 At+1 A t + 1 ,重复该步骤,直到终止状态。更一般地说,以这种方式使用planning可能比one-step-ahead看起来更深,并评估导致许多不同预测状态和reward trajectories的行动选择。 与background planning使用planning不同,这里的planning侧重于特定的状态。

      Decision-time planning is most useful in applications in which fast responses are not required.

      经典的人工智能 state-space planning 方法是统一称为启发式搜索的decision planning methods。

      如果有一个完美的模型和一个不完善的action-value function,那么实际上更深的搜索通常会产生更好的policy。

      举西洋双陆棋的例子,说明即使是向前看几步,效果都会显著变好。

      人自己下棋就会希望自己估计的准确,但人不能记住所有的状态,计算机却能,这就是为什么启发式搜索会很有效。

      通过计算相关候选状态,decision-time planning可以比随便更新新产生更好的policy。
      Heuristic search

      8.10 Rollout Algorithms

      Rollout算法是基于MC control的decision-time planning算法,用来simulated trajectories,所有的trajectories都始于当前环境state。对于给定policy,平均所有模拟轨迹的returns来估计action values。

      选择给出最大估计的action然后执行,得到下一个状态,然后重复执行该操作。

      Rollout algorithms在给定rollout policy的情况下估计每个当前状态蒙特卡洛估计

      参考论文 A Survey of Monte Carlo Tree Search Methods

      Monte Carlo Tree Search (MCTS)

      MONTE CARLO TREE SEARCH (MCTS) is a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search tree according to the results.

      蒙特卡洛方法(Monte Carlo Methods)

      Q(s,a)=1N(s,a)i=1N(s)1i(s,a)zi Q ( s , a ) = 1 N ( s , a ) ∑ i = 1 N ( s ) 1 i ( s , a ) z i

      其中 N(s,a) N ( s , a ) 表示动作a从状态s开始被选中的次数; N(s) N ( s ) 表示经过状态 s s 的游戏次数;zi表示从状态s开始的第i次模拟; 1i(s,a) 1 i ( s , a ) 则是当 a a 从状态s被选择时为1。

      对于很多reward distribution,不存在增长速率小于 O(lnn) O ( ln ⁡ n ) 的policy

      Upper Confidence Bounds (UCBs)

      因为最简单,所以也成称 UCB1 UCB1
      考虑多臂老虎机,考虑选择使用臂 j j ,来最大化

      UCB1=X¯j+2lnnnj

      其中 X¯j X ¯ j 表示服从臂 j j 的平均reward;nj表示臂 j j 被使用的次数;n是到目前为止玩的次数。
      其中 X¯j X ¯ j 鼓励expolitation,而 (2lnn)/(nj) ( 2 ln ⁡ n ) / ( n j ) 鼓励exploration。

      A. Algorithm

      The basic algorithm involves iteratively building a search tree until some predefined computational budget–typically a time, memory, or iteration constraint–is reached, at which point the search is halted and the best performing root action returned.

      1. Selection: Starting at the root node, a child selection policy is recursively applied to descend through the tree until the most urgent expandable node is reached. A node is expandable if it represents a nonterminal state and has unvisited (i.e., unexpanded) children.
      2. Expansion: One (or more) child nodes are added to expand the tree, according to the available actions.
      3. Simulation: A simulation is run from the new node(s) according to the default policy to produce an outcome.
      4. Backpropagation: The simulation result is “backed up” (i.e., backpropagated) through the selected nodes to update their statistics.

      上面的操作可以分成两组:

      1. Tree policy:Select or create a leaf node from the nodes already contained within the search tree (selection and expansion).
      2. Default policy: Play out the domain from a given nonterminal state to produce a value estimate (simulation).

      General MCTS approch
      The result of the overall search a(BESTCHILD(v0)) a ( BESTCHILD ( v 0 ) ) is the action a a that leads to the best child of the root node v0

      One iteration of the general MCTS approach
      One iteration of the general MCTS approach

      Schadd describes four criteria for selecting the winning action, based on the work of Chaslot et al.:

      1. max child: select the root child with the highest reward;
      2. robust child: select the most visited root child;
      3. max-robust child: select the root child with both the highest visit count and the highest reward; if none exists, then continue searching until an acceptable visit count is achieved;
      4. secure child: select the child which maximizes a lower confidence bound.
      B. Development

      蒙特卡洛方法在选择action的时候没有游戏理论保证,简单来说就是即使运行了很长时间,最后选择的action也可能不是最优的。
      蒙特卡洛搜索采用了树搜索的选择机制。可以使用 UCB1 UCB1 等方法来平衡exploitation和exploration,最后可以尽快的减小error probability。

      C. Upper Confidence Bounds for Trees (UCT)

      1) The UCT Algorithm: MCTS的目的是逼近可能从当前状态出发选择的action的正确的game-theoretic value。可以选择使用 UCB1 UCB1 做为 tree policy。
      UCB1很简单,并且保证处于对回归增长的最佳限制的恒定因素之内。

      every time a node (action) is to be selected within the existing tree, the choicemay bemodeled as an independentmultiarmed bandit problem. A child node is selected to maximize

      UCT=X¯j+2Cp2lnnnj UCT = X ¯ j + 2 C p 2 ln ⁡ n n j

      where n n is the number of times the current (parent) node has been visited, is the number of times child has been visited, and is a constant.

      上面的Cp可以用来调整减少或增加exploration。byKocsis and Szepesv¨¢ri的论文中 Cp=1/2 C p = 1 / 2

      剩下的部分,如果一个被UCB decent选择的节点有子节点目前还不是树的一部分,那么其中一个子节点就会被随机的选择并添加到树中。然后一直使用default policy直到终止状态。在最简单的情况下,default policy是均匀随机的。最后终止状态 sT s T 的值 Δ Δ 反向传播更新在本次迭代中,从新添加的节点到根节点中的所有已经访问过的节点。
      每个节点都保存了两个值,节点被访问的次数 N(v) N ( v ) ,以及所有playout的值 Q(v) Q ( v ) 的加和。则 Q(v)/N(v) Q ( v ) / N ( v ) 是节点的game-theoretic value的逼近。

      UCT1
      UCT2


      Algorithm 3 shows an alternative and more efficient backup method for two-player, zero-sum games with alternating moves, which is typically found in the literature.
      UCT3
      UCT_3_2

      给足够的时间和内存,UCT允许MCTS收敛到minimax tree,这就是最优的


      下面是稍作修改,运行在Python3上的UCT的例子代码,包含三个游戏,来源于
      www.mcts.ai

      # This is a very simple implementation of the UCT Monte Carlo Tree Search algorithm in Python 2.7.
      # The function UCT(rootstate, itermax, verbose = False) is towards the bottom of the code.
      # It aims to have the clearest and simplest possible code, and for the sake of clarity, the code
      # is orders of magnitude less efficient than it could be made, particularly by using a 
      # state.GetRandomMove() or state.DoRandomRollout() function.
      # 
      # Example GameState classes for Nim, OXO and Othello are included to give some idea of how you
      # can write your own GameState use UCT in your 2-player game. Change the game to be played in 
      # the UCTPlayGame() function at the bottom of the code.
      # 
      # Written by Peter Cowling, Ed Powley, Daniel Whitehouse (University of York, UK) September 2012.
      # 
      # Licence is granted to freely use and distribute for any sensible/legal purpose so long as this comment
      # remains in any distributed code.
      # 
      # For more information about Monte Carlo Tree Search check out our web site at www.mcts.ai
      
      from math import *
      import random
      
      class GameState:
          """ A state of the game, i.e. the game board. These are the only functions which are
              absolutely necessary to implement UCT in any 2-player complete information deterministic 
              zero-sum game, although they can be enhanced and made quicker, for example by using a 
              GetRandomMove() function to generate a random move during rollout.
              By convention the players are numbered 1 and 2.
          """
          def __init__(self):
                  self.playerJustMoved = 2 # At the root pretend the player just moved is player 2 - player 1 has the first move
      
          def Clone(self):
              """ Create a deep clone of this game state.
              """
              st = GameState()
              st.playerJustMoved = self.playerJustMoved
              return st
      
          def DoMove(self, move):
              """ Update a state by carrying out the given move.
                  Must update playerJustMoved.
              """
              self.playerJustMoved = 3 - self.playerJustMoved
      
          def GetMoves(self):
              """ Get all possible moves from this state.
              """
      
          def GetResult(self, playerjm):
              """ Get the game result from the viewpoint of playerjm. 
              """
      
          def __repr__(self):
              """ Don't need this - but good style.
              """
              pass
      
      
      class NimState:
          """ A state of the game Nim. In Nim, players alternately take 1,2 or 3 chips with the 
              winner being the player to take the last chip. 
              In Nim any initial state of the form 4n+k for k = 1,2,3 is a win for player 1
              (by choosing k) chips.
              Any initial state of the form 4n is a win for player 2.
          """
          def __init__(self, ch):
              self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
              self.chips = ch
      
          def Clone(self):
              """ Create a deep clone of this game state.
              """
              st = NimState(self.chips)
              st.playerJustMoved = self.playerJustMoved
              return st
      
          def DoMove(self, move):
              """ Update a state by carrying out the given move.
                  Must update playerJustMoved.
              """
              assert move >= 1 and move <= 3 and move == int(move)
              self.chips -= move
              self.playerJustMoved = 3 - self.playerJustMoved
      
          def GetMoves(self):
              """ Get all possible moves from this state.
              """
              return [i for i in range(1,min([4, self.chips + 1]))]
      
          def GetResult(self, playerjm):
              """ Get the game result from the viewpoint of playerjm. 
              """
              assert self.chips == 0
              if self.playerJustMoved == playerjm:
                  return 1.0 # playerjm took the last chip and has won
              else:
                  return 0.0 # playerjm's opponent took the last chip and has won
      
          def __repr__(self):
              s = "Chips:" + str(self.chips) + " JustPlayed:" + str(self.playerJustMoved)
              return s
      
      class OXOState:
          """ A state of the game, i.e. the game board.
              Squares in the board are in this arrangement
              012
              345
              678
              where 0 = empty, 1 = player 1 (X), 2 = player 2 (O)
          """
          def __init__(self):
              self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
              self.board = [0,0,0,0,0,0,0,0,0] # 0 = empty, 1 = player 1, 2 = player 2
      
          def Clone(self):
              """ Create a deep clone of this game state.
              """
              st = OXOState()
              st.playerJustMoved = self.playerJustMoved
              st.board = self.board[:]
              return st
      
          def DoMove(self, move):
              """ Update a state by carrying out the given move.
                  Must update playerToMove.
              """
              assert move >= 0 and move <= 8 and move == int(move) and self.board[move] == 0
              self.playerJustMoved = 3 - self.playerJustMoved
              self.board[move] = self.playerJustMoved
      
          def GetMoves(self):
              """ Get all possible moves from this state.
              """
              return [i for i in range(9) if self.board[i] == 0]
      
          def GetResult(self, playerjm):
              """ Get the game result from the viewpoint of playerjm. 
              """
              for (x,y,z) in [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6)]:
                  if self.board[x] == self.board[y] == self.board[z]:
                      if self.board[x] == playerjm:
                          return 1.0
                      else:
                          return 0.0
              if self.GetMoves() == []: return 0.5 # draw
              assert False # Should not be possible to get here
      
          def __repr__(self):
              s= ""
              for i in range(9): 
                  s += ".XO"[self.board[i]]
                  if i % 3 == 2: s += "\n"
              return s
      
      class OthelloState:
          """ A state of the game of Othello, i.e. the game board.
              The board is a 2D array where 0 = empty (.), 1 = player 1 (X), 2 = player 2 (O).
              In Othello players alternately place pieces on a square board - each piece played
              has to sandwich opponent pieces between the piece played and pieces already on the 
              board. Sandwiched pieces are flipped.
              This implementation modifies the rules to allow variable sized square boards and
              terminates the game as soon as the player about to move cannot make a move (whereas
              the standard game allows for a pass move). 
          """
          def __init__(self,sz = 8):
              self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
              self.board = [] # 0 = empty, 1 = player 1, 2 = player 2
              self.size = sz
              assert sz == int(sz) and sz % 2 == 0 # size must be integral and even
              for y in range(sz):
                  self.board.append([0]*sz)
              self.board[sz//2][sz//2] = self.board[sz//2-1][sz//2-1] = 1
              self.board[sz//2][sz//2-1] = self.board[sz//2-1][sz//2] = 2
      
          def Clone(self):
              """ Create a deep clone of this game state.
              """
              st = OthelloState()
              st.playerJustMoved = self.playerJustMoved
              st.board = [self.board[i][:] for i in range(self.size)]
              st.size = self.size
              return st
      
          def DoMove(self, move):
              """ Update a state by carrying out the given move.
                  Must update playerToMove.
              """
              (x,y)=(move[0],move[1])
              assert x == int(x) and y == int(y) and self.IsOnBoard(x,y) and self.board[x][y] == 0
              m = self.GetAllSandwichedCounters(x,y)
              self.playerJustMoved = 3 - self.playerJustMoved
              self.board[x][y] = self.playerJustMoved
              for (a,b) in m:
                  self.board[a][b] = self.playerJustMoved
      
          def GetMoves(self):
              """ Get all possible moves from this state.
              """
              return [(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == 0 and self.ExistsSandwichedCounter(x,y)]
      
          def AdjacentToEnemy(self,x,y):
              """ Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
              """
              for (dx,dy) in [(0,+1),(+1,+1),(+1,0),(+1,-1),(0,-1),(-1,-1),(-1,0),(-1,+1)]:
                  if self.IsOnBoard(x+dx,y+dy) and self.board[x+dx][y+dy] == self.playerJustMoved:
                      return True
              return False
      
          def AdjacentEnemyDirections(self,x,y):
              """ Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
              """
              es = []
              for (dx,dy) in [(0,+1),(+1,+1),(+1,0),(+1,-1),(0,-1),(-1,-1),(-1,0),(-1,+1)]:
                  if self.IsOnBoard(x+dx,y+dy) and self.board[x+dx][y+dy] == self.playerJustMoved:
                      es.append((dx,dy))
              return es
      
          def ExistsSandwichedCounter(self,x,y):
              """ Does there exist at least one counter which would be flipped if my counter was placed at (x,y)?
              """
              for (dx,dy) in self.AdjacentEnemyDirections(x,y):
                  if len(self.SandwichedCounters(x,y,dx,dy)) > 0:
                      return True
              return False
      
          def GetAllSandwichedCounters(self, x, y):
              """ Is (x,y) a possible move (i.e. opponent counters are sandwiched between (x,y) and my counter in some direction)?
              """
              sandwiched = []
              for (dx,dy) in self.AdjacentEnemyDirections(x,y):
                  sandwiched.extend(self.SandwichedCounters(x,y,dx,dy))
              return sandwiched
      
          def SandwichedCounters(self, x, y, dx, dy):
              """ Return the coordinates of all opponent counters sandwiched between (x,y) and my counter.
              """
              x += dx
              y += dy
              sandwiched = []
              while self.IsOnBoard(x,y) and self.board[x][y] == self.playerJustMoved:
                  sandwiched.append((x,y))
                  x += dx
                  y += dy
              if self.IsOnBoard(x,y) and self.board[x][y] == 3 - self.playerJustMoved:
                  return sandwiched
              else:
                  return [] # nothing sandwiched
      
          def IsOnBoard(self, x, y):
              return x >= 0 and x < self.size and y >= 0 and y < self.size
      
          def GetResult(self, playerjm):
              """ Get the game result from the viewpoint of playerjm. 
              """
              jmcount = len([(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == playerjm])
              notjmcount = len([(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == 3 - playerjm])
              if jmcount > notjmcount: return 1.0
              elif notjmcount > jmcount: return 0.0
              else: return 0.5 # draw
      
          def __repr__(self):
              s= ""
              for y in range(self.size-1,-1,-1):
                  for x in range(self.size):
                      s += ".XO"[self.board[x][y]]
                  s += "\n"
              return s
      
      class Node:
          """ A node in the game tree. Note wins is always from the viewpoint of playerJustMoved.
              Crashes if state not specified.
          """
          def __init__(self, move = None, parent = None, state = None):
              self.move = move # the move that got us to this node - "None" for the root node
              self.parentNode = parent # "None" for the root node
              self.childNodes = []
              self.wins = 0
              self.visits = 0
              self.untriedMoves = state.GetMoves() # future child nodes
              self.playerJustMoved = state.playerJustMoved # the only part of the state that the Node needs later
      
          def UCTSelectChild(self):
              """ Use the UCB1 formula to select a child node. Often a constant UCTK is applied so we have
                  lambda c: c.wins/c.visits + UCTK * sqrt(2*log(self.visits)/c.visits to vary the amount of
                  exploration versus exploitation.
              """
              s = sorted(self.childNodes, key = lambda c: c.wins/c.visits + sqrt(2*log(self.visits)/c.visits))[-1]
              return s
      
          def AddChild(self, m, s):
              """ Remove m from untriedMoves and add a new child node for this move.
                  Return the added child node
              """
              n = Node(move = m, parent = self, state = s)
              self.untriedMoves.remove(m)
              self.childNodes.append(n)
              return n
      
          def Update(self, result):
              """ Update this node - one additional visit and result additional wins. result must be from the viewpoint of playerJustmoved.
              """
              self.visits += 1
              self.wins += result
      
          def __repr__(self):
              return "[M:" + str(self.move) + " W/V:" + str(self.wins) + "/" + str(self.visits) + " U:" + str(self.untriedMoves) + "]"
      
          def TreeToString(self, indent):
              s = self.IndentString(indent) + str(self)
              for c in self.childNodes:
                   s += c.TreeToString(indent+1)
              return s
      
          def IndentString(self,indent):
              s = "\n"
              for i in range (1,indent+1):
                  s += "| "
              return s
      
          def ChildrenToString(self):
              s = ""
              for c in self.childNodes:
                   s += str(c) + "\n"
              return s
      
      
      def UCT(rootstate, itermax, verbose = False):
          """ Conduct a UCT search for itermax iterations starting from rootstate.
              Return the best move from the rootstate.
              Assumes 2 alternating players (player 1 starts), with game results in the range [0.0, 1.0]."""
      
          rootnode = Node(state = rootstate)
      
          for i in range(itermax):
              node = rootnode
              state = rootstate.Clone()
      
              # Select
              while node.untriedMoves == [] and node.childNodes != []: # node is fully expanded and non-terminal
                  node = node.UCTSelectChild()
                  state.DoMove(node.move)
      
              # Expand
              if node.untriedMoves != []: # if we can expand (i.e. state/node is non-terminal)
                  m = random.choice(node.untriedMoves) 
                  state.DoMove(m)
                  node = node.AddChild(m,state) # add child and descend tree
      
              # Rollout - this can often be made orders of magnitude quicker using a state.GetRandomMove() function
              while state.GetMoves() != []: # while state is non-terminal
                  state.DoMove(random.choice(state.GetMoves()))
      
              # Backpropagate
              while node != None: # backpropagate from the expanded node and work back to the root node
                  node.Update(state.GetResult(node.playerJustMoved)) # state is terminal. Update node with result from POV of node.playerJustMoved
                  node = node.parentNode
      
          # Output some information about the tree - can be omitted
          if (verbose): print (rootnode.TreeToString(0))
          else: print (rootnode.ChildrenToString())
      
          return sorted(rootnode.childNodes, key = lambda c: c.visits)[-1].move # return the move that was most visited
      
      def UCTPlayGame():
          """ Play a sample game between two UCT players where each player gets a different number 
              of UCT iterations (= simulations = tree nodes).
          """
          # state = OthelloState(4) # uncomment to play Othello on a square board of the given size
          state = OXOState() # uncomment to play OXO
          # state = NimState(15) # uncomment to play Nim with the given number of starting chips
          while (state.GetMoves() != []):
              print (str(state))
              if state.playerJustMoved == 1:
                  m = UCT(rootstate = state, itermax = 1000, verbose = False) # play with values for itermax and verbose = True
              else:
                  m = UCT(rootstate = state, itermax = 100, verbose = False)
              print ("Best Move: " + str(m) + "\n")
              state.DoMove(m)
          if state.GetResult(state.playerJustMoved) == 1.0:
              print ("Player " + str(state.playerJustMoved) + " wins!")
          elif state.GetResult(state.playerJustMoved) == 0.0:
              print ("Player " + str(3 - state.playerJustMoved) + " wins!")
          else: print ("Nobody wins!")
      
      if __name__ == "__main__":
          """ Play a single game to the end using UCT for both players. 
          """
          UCTPlayGame()
      
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值