Chapter 8 Planning and Learning with Tabular Methods

最新推荐文章于 2023-09-29 08:52:06 发布

滑稽树

最新推荐文章于 2023-09-29 08:52:06 发布

阅读量693

点赞数

分类专栏：强化学习笔记游戏AI

本文链接：https://blog.csdn.net/dengyibing/article/details/80628652

版权

强化学习笔记同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

游戏AI

10 篇文章 0 订阅

订阅专栏

Model-based方法依赖于planning，将模型作为输入并产生或改进与模拟环境交互的策略；
Model-free方法依赖于learning，将环境的真实experience作为输入；

现在把两种方法结合起来。

Integrated Architectures

8.1 Models and Planning

distribution models：描述所有可能以及这些可能的概率的model
sample model：描述一个可能，并根据该可能的概率进行采样

在动态规划中假设的MDP就是一个distribution models；而blackjack例子中介绍的就是sample model。

在任何一种情况下，model都是用来simulate the environment and produce simulated experience.

planning就是任何将model作为输入并产生或改进与模拟环境交互的policy的任何计算过程：

参考自David silver的强化学习公开课ppt

详细解释 Model

一个Model $M$ 就表示一个 $\text{MDP} \langle S,A,P,R \rangle$ ，参数是 $\eta$
假设state space $S$ 和action space $A$ 都是已知的
所以一个model $M = \langle P_{\eta},R_{\eta} \rangle$ 表示状态转移 $P_{\eta} \approx P$ 和reward $R_{\eta} \approx R$

$S t + 1 R t + 1 \approx P η (S t + 1 | S t, A t) = R η (R t + 1 | S t, A t)$ $\begin{align*} S_{t+1} & \approx P_{\eta}(S_{t+1}|S_t,A_t)\\ R_{t+1} & = R_{\eta}(R_{t+1}|S_t,A_t) \end{align*}$
还要假设状态转移和reward是条件独立的

$P [S t + 1, R t + 1 | S t, A t] = P [S t + 1 | S t, A t] P [R t + 1 | S t, A t]$ $\mathbb{P}[S_{t+1},R_{t+1}|S_{t},A_{t}]=\mathbb{P}[S_{t+1}|S_t,A_t]\mathbb{P}[R_{t+1}|S_t,A_t]$

Model Leanring的目的是从experience中估计model $M_{\eta}$ ，这是监督学习问题。

学习 $s,a \rightarrow r$ 是回归问题
学习 $s,a\rightarrow s^\prime$ 是密度估计问题
选择损失函数，如mean-squared error, KL diergence等
选择参数 $\eta$ 最小化经验损失

使用的Model的例子
- Table Lookup Model (书和下面说的论文上介绍的Model)
- Linear Expectation Model
- Linear Gaussian Model
- Gaussian Process Model
- Deep Belief Network Model
…

简要介绍Table Lookup Model
Model是显式的 $\text{MDP}, \hat P, \hat R$ ， $N(s,a)$ 是总的访问次数

P^a s, s' R^a s = 1 N ( s , a ) \sum t = 1 T 1 (S t, A t, S t + 1 = s, a, s') = 1 N ( s , a ) \sum t = 1 T 1 (S t, A t = s, a) R t

$\begin{align*} \hat{P}^a_{s,s^\prime} & = \frac{1}{N(s,a)}\sum_{t=1}^{T}1(S_t,A_t,S_{t+1}=s,a,s^\prime)\\ \hat{R}_s^a & = \frac{1}{N(s,a)}\sum_{t=1}^{T}1(S_t,A_t=s,a)R_t \end{align*}$

详细解释 Planning

Planning with a Model

给定一个model $M_{\eta}=\langle P_{\eta},R_{\eta} \rangle$
解决 $\text{MDP }\langle S,A,P_{\eta},R_{\eta} \rangle$
使用的planning算法
- Value iteration
- Policy iteration
- Tree search (这是书上介绍的MCTS)

Sample-Based Planning 就是从model中sample experience来做planning。使用model-free RL来sample，如：Monte-Carlo control; Sarsa; Q-learning

Integrating Learning and Planning

本书主要讲到的方法是 State-space planning：通过状态空间搜索最优策略或到达目标的最佳路径。

接下来讲到的方法包含两个基本想法：

所有的state-space planning 方法都包含把计算value functions作为改进policy的关键中间步骤
它们将updates或者backup operations应用到simulated experience上来计算value functions

basic idea planning

Random-sample one-step tabular Q-planning

8.2 Dyna: Integrated Planning, Acting, and Learning

把两者结合起来的强化学习的结构
policy summarized
直接学习：直接根据环境的experience学习和改进policy
间接学习：使用experience学习model，然后把model作为planning的输入，学习和改进policy
The general Dyna Architecture
对Dyna-Q，the acting, model-learning,direct RL都需要很少的时间，剩下的最多的时间都是planning process消耗的。
Tabular Dyna-Q
(d)(e)(f)分别完成Direct reinforcement learning，model-learning，planning

先使用真实数据学习出Model，然后再使用Planning更新Q值

可以看一个planning作用的例子
Dyna-Q example
可以看到planning的次数越多，收敛的越快

8.3 When the Model Is Wrong

之前都是假设环境是不会发生变化的，但是如果环境发生变化，那么model就是错误的。已经生成的policy就基本不会去探索那些已经发生变化的位置。
Dyna+ exmple
这里提出一个启发式的方法解决上述提到的问题，实现该方法的就是Dyna-Q+

给长时间没有探索过的step更多的reward

给出一个比较实际的解决方案：

In particular, if the modeled reward for a transition is $r$ , and the transition has not been tried in $\tau$ time steps, then planning updates are done as if that transition produced a reward of $r+k\sqrt{\tau}$ , for some small $k$ .

8.4 Prioritized Sweeping

之前的planning都是随机的从model中选取state-action pairs的，这样做很没有效率。用一个队列保存state-action pairs，估计的变化不是平稳的，有限改变变化大的。

A queue is maintained of every state{action pair whose estimated
value would change nontrivially if updated , prioritized by the size of the change.

Prioritized sweeping for a deterministic environment

8.5 Expected vs. Sample Updates

从one-step updates来考虑，到目前为止主要集中的三个维度（之后会介绍第四个维度）

whether they update state values or action values. (eg.

q_{*}, v_{*}

$q_*, v_*$ )

whether they estimate the value for the optimal policy or for an arbitrary given policy. (eg. $q_*, v_*, q_{\pi}, v_{\pi}$ )
whether the updates are expected updates, considering all possible events that might happen, or sample updates, considering a single sample of what might happen.

Backup diagrams for all the one-step updates considered in this book.

expected update: 正确性只受后继状态Q的限制

Q (s, a) \leftarrow \sum s', r p^(s', r | s, a) [r + γ max a' Q (s', a')]

$Q(s,a) \leftarrow \sum_{s^\prime, r} \hat p(s^\prime, r|s,a)[r+\gamma \underset{a^\prime}{\max} Q(s^\prime, a^\prime)]$

sample update: 还收到sample error的影响，但是计算量小

Q (s, a) \leftarrow Q (s, a) + α [R + γ max a' Q (s', a') - Q (s, a)]

$Q(s,a) \leftarrow Q(s,a)+\alpha[R+\gamma \underset{a^\prime}{\max} Q(s^\prime,a^\prime)-Q(s,a)]$

通过实验表明，在多分枝状态问题上，sample update 比 expected update要好

8.6 Trajectory Sampling

In other words, one simulates explicit individual trajectories and performs updates at the state or state-action pairs encountered along the way. We call this way of generating experience and updates trajectory sampling.

比较on-policy和uniform两种策略下，从初始状态S开始，在接下来的b个状态中选择一个，使用on-policy就是根据现在的policy选择其中一个状态，更新状态的值；使用uniform就是等可能的选择一个状态，来更新状态值。
在任何情况下，短期on-policy都表现的更好，但是长期来看，uniform会更好。

8.7 Real-time Dynamic Programming

RTDP updates the values of states visited in actual or simulated trajectories by means of expected tabular value-iteration updates
更新公式类似于

$v k + 1 (s) ≐ max a E [R t + 1 + γ v k (S t + 1) | S t = s, A t = a] = max a \sum s', r p (s', r | s, a) [r + γ v k (s')]$ $\begin{align*} v_{k+1}(s) & \doteq \underset{a}{\max} \mathbb{E}[R_{t+1}+\gamma v_k(S_{t+1})|S_t=s,A_t=a] \\ & = \underset{a}{\max} \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_k(s^\prime)] \end{align*}$

Simulation-Based Search

8.8 Planning at Decision Time

background planning : planning is not focussed on the current state.

decision-time planning：从一个状态 $S_t$ 开始选择一个动作 $A_t$ ，然后 $S_{t+1}$ 接着 $A_{t+1}$ ，重复该步骤，直到终止状态。更一般地说，以这种方式使用planning可能比one-step-ahead看起来更深，并评估导致许多不同预测状态和reward trajectories的行动选择。与background planning使用planning不同，这里的planning侧重于特定的状态。

Decision-time planning is most useful in applications in which fast responses are not required.

Heuristic Search

经典的人工智能 state-space planning 方法是统一称为启发式搜索的decision planning methods。

如果有一个完美的模型和一个不完善的action-value function，那么实际上更深的搜索通常会产生更好的policy。

举西洋双陆棋的例子，说明即使是向前看几步，效果都会显著变好。

人自己下棋就会希望自己估计的准确，但人不能记住所有的状态，计算机却能，这就是为什么启发式搜索会很有效。

通过计算相关候选状态，decision-time planning可以比随便更新新产生更好的policy。
Heuristic search

8.10 Rollout Algorithms

Rollout算法是基于MC control的decision-time planning算法，用来simulated trajectories，所有的trajectories都始于当前环境state。对于给定policy，平均所有模拟轨迹的returns来估计action values。

选择给出最大估计的action然后执行，得到下一个状态，然后重复执行该操作。

Rollout algorithms在给定rollout policy的情况下估计每个当前状态蒙特卡洛估计

8.11 Monte Carlo Tree Search

参考论文 A Survey of Monte Carlo Tree Search Methods

Monte Carlo Tree Search (MCTS)

MONTE CARLO TREE SEARCH (MCTS) is a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search tree according to the results.

蒙特卡洛方法（Monte Carlo Methods）

Q (s, a) = 1 N ( s , a ) \sum i = 1 N (s) 1 i (s, a) z i

$Q(s,a)=\frac{1}{N(s,a)}\sum_{i=1}^{N(s)}1_i(s,a)z_i$
其中

N(s,a) N ( s , a ) $N(s,a)$ 表示动作a从状态s开始被选中的次数；

N(s) N ( s ) $N(s)$ 表示经过状态

s s $s$ 的游戏次数；

z_{i}

$z_i$ 表示从状态s开始的第i次模拟；

1i(s,a) 1 i ( s , a ) $1_i(s,a)$ 则是当

a a $a$ 从状态

s

$s$ 被选择时为1。

对于很多reward distribution，不存在增长速率小于 $O(\ln n)$ 的policy

Upper Confidence Bounds (UCBs)

因为最简单，所以也成称 $\text{UCB1}$
考虑多臂老虎机，考虑选择使用臂 $j$ ，来最大化

UCB1 = {\bar{X}}_{j} + \sqrt{\frac{2 \ln n}{n_{j}}}

$\text{UCB1} = \bar X_j + \sqrt \frac{2 \ln n}{n_j}$

其中 $\bar X_j$ 表示服从臂 $j$ 的平均reward； $n_j$ 表示臂 $j$ 被使用的次数； $n$ 是到目前为止玩的次数。
其中 $\bar X_j$ 鼓励expolitation，而 $\sqrt{(2 \ln n)/(n_j)}$ 鼓励exploration。

A. Algorithm

The basic algorithm involves iteratively building a search tree until some predefined computational budget–typically a time, memory, or iteration constraint–is reached, at which point the search is halted and the best performing root action returned.

Selection: Starting at the root node, a child selection policy is recursively applied to descend through the tree until the most urgent expandable node is reached. A node is expandable if it represents a nonterminal state and has unvisited (i.e., unexpanded) children.
Expansion: One (or more) child nodes are added to expand the tree, according to the available actions.
Simulation: A simulation is run from the new node(s) according to the default policy to produce an outcome.
Backpropagation: The simulation result is “backed up” (i.e., backpropagated) through the selected nodes to update their statistics.

上面的操作可以分成两组：

Tree policy：Select or create a leaf node from the nodes already contained within the search tree (selection and expansion).
Default policy: Play out the domain from a given nonterminal state to produce a value estimate (simulation).

General MCTS approch
The result of the overall search $a(\text{BESTCHILD}(v_0))$ is the action $a$ that leads to the best child of the root node $v_0$

One iteration of the general MCTS approach

Schadd describes four criteria for selecting the winning action, based on the work of Chaslot et al.:

max child: select the root child with the highest reward;
robust child: select the most visited root child;
max-robust child: select the root child with both the highest visit count and the highest reward; if none exists, then continue searching until an acceptable visit count is achieved;
secure child: select the child which maximizes a lower confidence bound.

B. Development

蒙特卡洛方法在选择action的时候没有游戏理论保证，简单来说就是即使运行了很长时间，最后选择的action也可能不是最优的。
蒙特卡洛搜索采用了树搜索的选择机制。可以使用 $\text{UCB1}$ 等方法来平衡exploitation和exploration，最后可以尽快的减小error probability。

C. Upper Confidence Bounds for Trees (UCT)

1) The UCT Algorithm: MCTS的目的是逼近可能从当前状态出发选择的action的正确的game-theoretic value。可以选择使用 $\text{UCB1}$ 做为 tree policy。
UCB1很简单，并且保证处于对回归增长的最佳限制的恒定因素之内。

every time a node (action) is to be selected within the existing tree, the choicemay bemodeled as an independentmultiarmed bandit problem. A child node is selected to maximize

$UCT = X ¯ j + 2 C p 2 ln n n j - - - - - \sqrt$ $\text{UCT} = \bar X_j + 2C_p \sqrt \frac{2\ln n}{n_j}$
where $n$ is the number of times the current (parent) node has been visited, is the number of times child has been visited, and is a constant.

上面的 $C_p$ 可以用来调整减少或增加exploration。byKocsis and Szepesv¨¢ri的论文中 $C_p=1/\sqrt 2$

剩下的部分，如果一个被UCB decent选择的节点有子节点目前还不是树的一部分，那么其中一个子节点就会被随机的选择并添加到树中。然后一直使用default policy直到终止状态。在最简单的情况下，default policy是均匀随机的。最后终止状态 $s_T$ 的值 $\Delta$ 反向传播更新在本次迭代中，从新添加的节点到根节点中的所有已经访问过的节点。
每个节点都保存了两个值，节点被访问的次数 $N(v)$ ，以及所有playout的值 $Q(v)$ 的加和。则 $Q(v)/N(v)$ 是节点的game-theoretic value的逼近。

UCT1
UCT2

Algorithm 3 shows an alternative and more efficient backup method for two-player, zero-sum games with alternating moves, which is typically found in the literature.
UCT3
UCT_3_2

给足够的时间和内存，UCT允许MCTS收敛到minimax tree，这就是最优的

下面是稍作修改，运行在Python3上的UCT的例子代码，包含三个游戏，来源于
www.mcts.ai

# This is a very simple implementation of the UCT Monte Carlo Tree Search algorithm in Python 2.7.
# The function UCT(rootstate, itermax, verbose = False) is towards the bottom of the code.
# It aims to have the clearest and simplest possible code, and for the sake of clarity, the code
# is orders of magnitude less efficient than it could be made, particularly by using a 
# state.GetRandomMove() or state.DoRandomRollout() function.
# 
# Example GameState classes for Nim, OXO and Othello are included to give some idea of how you
# can write your own GameState use UCT in your 2-player game. Change the game to be played in 
# the UCTPlayGame() function at the bottom of the code.
# 
# Written by Peter Cowling, Ed Powley, Daniel Whitehouse (University of York, UK) September 2012.
# 
# Licence is granted to freely use and distribute for any sensible/legal purpose so long as this comment
# remains in any distributed code.
# 
# For more information about Monte Carlo Tree Search check out our web site at www.mcts.ai

from math import *
import random

class GameState:
    """ A state of the game, i.e. the game board. These are the only functions which are
        absolutely necessary to implement UCT in any 2-player complete information deterministic 
        zero-sum game, although they can be enhanced and made quicker, for example by using a 
        GetRandomMove() function to generate a random move during rollout.
        By convention the players are numbered 1 and 2.
    """
    def __init__(self):
            self.playerJustMoved = 2 # At the root pretend the player just moved is player 2 - player 1 has the first move

    def Clone(self):
        """ Create a deep clone of this game state.
        """
        st = GameState()
        st.playerJustMoved = self.playerJustMoved
        return st

    def DoMove(self, move):
        """ Update a state by carrying out the given move.
            Must update playerJustMoved.
        """
        self.playerJustMoved = 3 - self.playerJustMoved

    def GetMoves(self):
        """ Get all possible moves from this state.
        """

    def GetResult(self, playerjm):
        """ Get the game result from the viewpoint of playerjm. 
        """

    def __repr__(self):
        """ Don't need this - but good style.
        """
        pass


class NimState:
    """ A state of the game Nim. In Nim, players alternately take 1,2 or 3 chips with the 
        winner being the player to take the last chip. 
        In Nim any initial state of the form 4n+k for k = 1,2,3 is a win for player 1
        (by choosing k) chips.
        Any initial state of the form 4n is a win for player 2.
    """
    def __init__(self, ch):
        self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
        self.chips = ch

    def Clone(self):
        """ Create a deep clone of this game state.
        """
        st = NimState(self.chips)
        st.playerJustMoved = self.playerJustMoved
        return st

    def DoMove(self, move):
        """ Update a state by carrying out the given move.
            Must update playerJustMoved.
        """
        assert move >= 1 and move <= 3 and move == int(move)
        self.chips -= move
        self.playerJustMoved = 3 - self.playerJustMoved

    def GetMoves(self):
        """ Get all possible moves from this state.
        """
        return [i for i in range(1,min([4, self.chips + 1]))]

    def GetResult(self, playerjm):
        """ Get the game result from the viewpoint of playerjm. 
        """
        assert self.chips == 0
        if self.playerJustMoved == playerjm:
            return 1.0 # playerjm took the last chip and has won
        else:
            return 0.0 # playerjm's opponent took the last chip and has won

    def __repr__(self):
        s = "Chips:" + str(self.chips) + " JustPlayed:" + str(self.playerJustMoved)
        return s

class OXOState:
    """ A state of the game, i.e. the game board.
        Squares in the board are in this arrangement
        012
        345
        678
        where 0 = empty, 1 = player 1 (X), 2 = player 2 (O)
    """
    def __init__(self):
        self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
        self.board = [0,0,0,0,0,0,0,0,0] # 0 = empty, 1 = player 1, 2 = player 2

    def Clone(self):
        """ Create a deep clone of this game state.
        """
        st = OXOState()
        st.playerJustMoved = self.playerJustMoved
        st.board = self.board[:]
        return st

    def DoMove(self, move):
        """ Update a state by carrying out the given move.
            Must update playerToMove.
        """
        assert move >= 0 and move <= 8 and move == int(move) and self.board[move] == 0
        self.playerJustMoved = 3 - self.playerJustMoved
        self.board[move] = self.playerJustMoved

    def GetMoves(self):
        """ Get all possible moves from this state.
        """
        return [i for i in range(9) if self.board[i] == 0]

    def GetResult(self, playerjm):
        """ Get the game result from the viewpoint of playerjm. 
        """
        for (x,y,z) in [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6)]:
            if self.board[x] == self.board[y] == self.board[z]:
                if self.board[x] == playerjm:
                    return 1.0
                else:
                    return 0.0
        if self.GetMoves() == []: return 0.5 # draw
        assert False # Should not be possible to get here

    def __repr__(self):
        s= ""
        for i in range(9): 
            s += ".XO"[self.board[i]]
            if i % 3 == 2: s += "\n"
        return s

class OthelloState:
    """ A state of the game of Othello, i.e. the game board.
        The board is a 2D array where 0 = empty (.), 1 = player 1 (X), 2 = player 2 (O).
        In Othello players alternately place pieces on a square board - each piece played
        has to sandwich opponent pieces between the piece played and pieces already on the 
        board. Sandwiched pieces are flipped.
        This implementation modifies the rules to allow variable sized square boards and
        terminates the game as soon as the player about to move cannot make a move (whereas
        the standard game allows for a pass move). 
    """
    def __init__(self,sz = 8):
        self.playerJustMoved = 2 # At the root pretend the player just moved is p2 - p1 has the first move
        self.board = [] # 0 = empty, 1 = player 1, 2 = player 2
        self.size = sz
        assert sz == int(sz) and sz % 2 == 0 # size must be integral and even
        for y in range(sz):
            self.board.append([0]*sz)
        self.board[sz//2][sz//2] = self.board[sz//2-1][sz//2-1] = 1
        self.board[sz//2][sz//2-1] = self.board[sz//2-1][sz//2] = 2

    def Clone(self):
        """ Create a deep clone of this game state.
        """
        st = OthelloState()
        st.playerJustMoved = self.playerJustMoved
        st.board = [self.board[i][:] for i in range(self.size)]
        st.size = self.size
        return st

    def DoMove(self, move):
        """ Update a state by carrying out the given move.
            Must update playerToMove.
        """
        (x,y)=(move[0],move[1])
        assert x == int(x) and y == int(y) and self.IsOnBoard(x,y) and self.board[x][y] == 0
        m = self.GetAllSandwichedCounters(x,y)
        self.playerJustMoved = 3 - self.playerJustMoved
        self.board[x][y] = self.playerJustMoved
        for (a,b) in m:
            self.board[a][b] = self.playerJustMoved

    def GetMoves(self):
        """ Get all possible moves from this state.
        """
        return [(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == 0 and self.ExistsSandwichedCounter(x,y)]

    def AdjacentToEnemy(self,x,y):
        """ Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
        """
        for (dx,dy) in [(0,+1),(+1,+1),(+1,0),(+1,-1),(0,-1),(-1,-1),(-1,0),(-1,+1)]:
            if self.IsOnBoard(x+dx,y+dy) and self.board[x+dx][y+dy] == self.playerJustMoved:
                return True
        return False

    def AdjacentEnemyDirections(self,x,y):
        """ Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
        """
        es = []
        for (dx,dy) in [(0,+1),(+1,+1),(+1,0),(+1,-1),(0,-1),(-1,-1),(-1,0),(-1,+1)]:
            if self.IsOnBoard(x+dx,y+dy) and self.board[x+dx][y+dy] == self.playerJustMoved:
                es.append((dx,dy))
        return es

    def ExistsSandwichedCounter(self,x,y):
        """ Does there exist at least one counter which would be flipped if my counter was placed at (x,y)?
        """
        for (dx,dy) in self.AdjacentEnemyDirections(x,y):
            if len(self.SandwichedCounters(x,y,dx,dy)) > 0:
                return True
        return False

    def GetAllSandwichedCounters(self, x, y):
        """ Is (x,y) a possible move (i.e. opponent counters are sandwiched between (x,y) and my counter in some direction)?
        """
        sandwiched = []
        for (dx,dy) in self.AdjacentEnemyDirections(x,y):
            sandwiched.extend(self.SandwichedCounters(x,y,dx,dy))
        return sandwiched

    def SandwichedCounters(self, x, y, dx, dy):
        """ Return the coordinates of all opponent counters sandwiched between (x,y) and my counter.
        """
        x += dx
        y += dy
        sandwiched = []
        while self.IsOnBoard(x,y) and self.board[x][y] == self.playerJustMoved:
            sandwiched.append((x,y))
            x += dx
            y += dy
        if self.IsOnBoard(x,y) and self.board[x][y] == 3 - self.playerJustMoved:
            return sandwiched
        else:
            return [] # nothing sandwiched

    def IsOnBoard(self, x, y):
        return x >= 0 and x < self.size and y >= 0 and y < self.size

    def GetResult(self, playerjm):
        """ Get the game result from the viewpoint of playerjm. 
        """
        jmcount = len([(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == playerjm])
        notjmcount = len([(x,y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == 3 - playerjm])
        if jmcount > notjmcount: return 1.0
        elif notjmcount > jmcount: return 0.0
        else: return 0.5 # draw

    def __repr__(self):
        s= ""
        for y in range(self.size-1,-1,-1):
            for x in range(self.size):
                s += ".XO"[self.board[x][y]]
            s += "\n"
        return s

class Node:
    """ A node in the game tree. Note wins is always from the viewpoint of playerJustMoved.
        Crashes if state not specified.
    """
    def __init__(self, move = None, parent = None, state = None):
        self.move = move # the move that got us to this node - "None" for the root node
        self.parentNode = parent # "None" for the root node
        self.childNodes = []
        self.wins = 0
        self.visits = 0
        self.untriedMoves = state.GetMoves() # future child nodes
        self.playerJustMoved = state.playerJustMoved # the only part of the state that the Node needs later

    def UCTSelectChild(self):
        """ Use the UCB1 formula to select a child node. Often a constant UCTK is applied so we have
            lambda c: c.wins/c.visits + UCTK * sqrt(2*log(self.visits)/c.visits to vary the amount of
            exploration versus exploitation.
        """
        s = sorted(self.childNodes, key = lambda c: c.wins/c.visits + sqrt(2*log(self.visits)/c.visits))[-1]
        return s

    def AddChild(self, m, s):
        """ Remove m from untriedMoves and add a new child node for this move.
            Return the added child node
        """
        n = Node(move = m, parent = self, state = s)
        self.untriedMoves.remove(m)
        self.childNodes.append(n)
        return n

    def Update(self, result):
        """ Update this node - one additional visit and result additional wins. result must be from the viewpoint of playerJustmoved.
        """
        self.visits += 1
        self.wins += result

    def __repr__(self):
        return "[M:" + str(self.move) + " W/V:" + str(self.wins) + "/" + str(self.visits) + " U:" + str(self.untriedMoves) + "]"

    def TreeToString(self, indent):
        s = self.IndentString(indent) + str(self)
        for c in self.childNodes:
             s += c.TreeToString(indent+1)
        return s

    def IndentString(self,indent):
        s = "\n"
        for i in range (1,indent+1):
            s += "| "
        return s

    def ChildrenToString(self):
        s = ""
        for c in self.childNodes:
             s += str(c) + "\n"
        return s


def UCT(rootstate, itermax, verbose = False):
    """ Conduct a UCT search for itermax iterations starting from rootstate.
        Return the best move from the rootstate.
        Assumes 2 alternating players (player 1 starts), with game results in the range [0.0, 1.0]."""

    rootnode = Node(state = rootstate)

    for i in range(itermax):
        node = rootnode
        state = rootstate.Clone()

        # Select
        while node.untriedMoves == [] and node.childNodes != []: # node is fully expanded and non-terminal
            node = node.UCTSelectChild()
            state.DoMove(node.move)

        # Expand
        if node.untriedMoves != []: # if we can expand (i.e. state/node is non-terminal)
            m = random.choice(node.untriedMoves) 
            state.DoMove(m)
            node = node.AddChild(m,state) # add child and descend tree

        # Rollout - this can often be made orders of magnitude quicker using a state.GetRandomMove() function
        while state.GetMoves() != []: # while state is non-terminal
            state.DoMove(random.choice(state.GetMoves()))

        # Backpropagate
        while node != None: # backpropagate from the expanded node and work back to the root node
            node.Update(state.GetResult(node.playerJustMoved)) # state is terminal. Update node with result from POV of node.playerJustMoved
            node = node.parentNode

    # Output some information about the tree - can be omitted
    if (verbose): print (rootnode.TreeToString(0))
    else: print (rootnode.ChildrenToString())

    return sorted(rootnode.childNodes, key = lambda c: c.visits)[-1].move # return the move that was most visited

def UCTPlayGame():
    """ Play a sample game between two UCT players where each player gets a different number 
        of UCT iterations (= simulations = tree nodes).
    """
    # state = OthelloState(4) # uncomment to play Othello on a square board of the given size
    state = OXOState() # uncomment to play OXO
    # state = NimState(15) # uncomment to play Nim with the given number of starting chips
    while (state.GetMoves() != []):
        print (str(state))
        if state.playerJustMoved == 1:
            m = UCT(rootstate = state, itermax = 1000, verbose = False) # play with values for itermax and verbose = True
        else:
            m = UCT(rootstate = state, itermax = 100, verbose = False)
        print ("Best Move: " + str(m) + "\n")
        state.DoMove(m)
    if state.GetResult(state.playerJustMoved) == 1.0:
        print ("Player " + str(state.playerJustMoved) + " wins!")
    elif state.GetResult(state.playerJustMoved) == 0.0:
        print ("Player " + str(3 - state.playerJustMoved) + " wins!")
    else: print ("Nobody wins!")

if __name__ == "__main__":
    """ Play a single game to the end using UCT for both players. 
    """
    UCTPlayGame()

滑稽树

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Chapter 8 Planning and Learning with Tabular Methods

Model-based方法依赖于planning，将模型作为输入并产生或改进与模拟环境交互的策略； Model-free方法依赖于learning，将环境的真实experience作为输入；现在把两种方法结合起来。8.1 Models and Planning...
复制链接

扫一扫