【书籍阅读 Ch5】Reinforcement Learning An Introduction, 2nd Edition_python implementation of reinforcement learning: a-CSDN博客

本文链接：https://blog.csdn.net/qq_39537898/article/details/112779150

本文介绍了蒙特卡洛方法在强化学习中的应用，包括无模型学习、策略评估和控制。在21点游戏中，通过不同策略的蒙特卡洛方法，如首次访问和每次访问策略，估计状态价值。此外，探讨了重要性采样在离策略预测中的作用，以及无限方差问题和增量实现。最后，通过赛车轨道任务展示了蒙特卡洛控制方法的实施，以找到最优策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Chapter 5: Monte Carlo Methods

回顾与进入
5.1 Monte Carlo Prediction
- Example 5.1
5.2 Monte Carlo Estimation of Action Value
5.3 Monte Carlo Control
- Example 5.3 Solving Blackjack
5.5 Off-policy Prediction via Importance Sampling
- Example 5.4 Off-policy Estimation of a Blackjack State Value
- Example 5.5 Infinite Variance
5.6 Incremental Implementation
5.7
5.9 *Pre-decision Importance Sampling
5.10 Summary
All Exercise Part

前言：第1、2章点此进入；第3章点此进入
注：每一个目录对应的是在pdf的页数（如果LPage就是书左上角的页码 - 因为我发现后面我要在两页之间加空白页做练习lol 例如：LPage28 就是左上角书页28页，RPage29就是右上角书页29页）；【】这个括号之间有时候是我留的疑问，与一些关于方向上连接的想法主要集中于无人驾驶的控制层/理论层，带问号结束的就是…我的疑问

更新时间：2021/01/18

推荐观看：
1.英文 - PDF链接
2 中文 - 官方京东书籍购买链接
代码参考：
1.github 关于整本书的图python代码
2.github 关于整本书的练习solution参考

回顾与进入

在上一章DP动态规划中，我们是有着对环境的完全掌握的，同时对于每个状态下动作的发生概率也是已知的，而蒙特卡洛这一章节解决的是：如果我们对环境不知道，但是我们有之前行为的一些信息，Learning from actual experience OR Learning from simulated experience
(OS 看到后面发现这个对环境有完全的掌握这一概念和environment model 是指什么？ return是自己定义的难道是 action之后所处的state不明确吗？)

5.1 Monte Carlo Prediction

使用monte carlo方法去学习出state-value function 噢所以对于环境的掌握在于它不使用MDP的马尔可夫链？也就是不知道所有的state状态是个什么情况
然后计算方法就是recall that the value of a state is the expected return – expected cumulative future discounted reward – starting from that state,说中文就是：state-value是期望的回报，也就是对一个状态：把未来的reward全面从开始到结束加。
随后把之前得到的state-value按照各个状态进行独自想加再平均，比如(N|(0,0))在0,0点往北的state-value在100次运行里有100个state-value想加他们的值再平均就是。
As more return are observed, the average should converge to the expected value. 这就是Monte Carlo的方法思想。

first-visit MC method: 首次访问的这个state的returns 平均值估计 $v_\pi(s)$
every-visit MC method：所有访问的这个state的returns 平均值

可以在这个图中看出来第二个Loop的时候是Loop for each step of episode，而如果这个状态只有没出现过时，才会将现在的returns apped到G里，再平均得到state-value
在这里插入图片描述

Example 5.1

首先是关于整个21点游戏的元素定义：比如action:hit/stand,

import numpy as np
import matplotlib
matplotlib.use('Agg') # 使得plot不显示
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# actions: hit or stand
ACTION_HIT = 0
ACTION_STAND = 1  #  "strike" in the book
ACTIONS = [ACTION_HIT, ACTION_STAND]

# policy for player
POLICY_PLAYER = np.zeros(22, dtype=np.int)
for i in range(12, 20):
    POLICY_PLAYER[i] = ACTION_HIT
POLICY_PLAYER[20] = ACTION_STAND
POLICY_PLAYER[21] = ACTION_STAND

# function form of target policy of player
def target_policy_player(usable_ace_player, player_sum, dealer_card):
    return POLICY_PLAYER[player_sum]

# function form of behavior policy of player
def behavior_policy_player(usable_ace_player, player_sum, dealer_card):
    if np.random.binomial(1, 0.5) == 1:
        return ACTION_STAND
    return ACTION_HIT

# policy for dealer 这里是庄家的固定策略：一直要牌，>=17 停牌
POLICY_DEALER = np.zeros(22)
for i in range(12, 17):
    POLICY_DEALER[i] = ACTION_HIT
for i in range(17, 22):
    POLICY_DEALER[i] = ACTION_STAND

# get a new card
def get_card():
    card = np.random.randint(1, 14) #这里是按我们的手牌来的从2--10(8张)；JQK(3张)；A(1张)
    card = min(card, 10) #但是因为JQK算作10的数，所以取最大就到10,1的话就算做useable_ace
    return card

# get the value of a card (11 for ace).
def card_value(card_id):
    return 11 if card_id == 1 else card_id

# play a game
# @policy_player: specify policy for player
# @initial_state: [whether player has a usable Ace, sum of player's cards, one card of dealer]
# @initial_action: the initial action
def play(policy_player, initial_state=None, initial_action=None):
    # player status

    # sum of player
    player_sum = 0

    # trajectory of player
    player_trajectory = []

    # whether player uses Ace as 11
    usable_ace_player = False

    # dealer status
    dealer_card1 = 0
    dealer_card2 = 0
    usable_ace_dealer = False

    if initial_state is None:
        # generate a random initial state
        # ！！！！这里有点问题的是，定义的开始是玩家和庄家都拿2张牌，而这个while意味着玩家可能有两张牌以上，比如如果发的是2,3那么玩家还是被发牌 是否不对？！！！！！
        # 看上去程序是上面说的那个意思，但是可能这里程序不是都拿两张牌，而是玩家已经开始要牌了，玩家的三个变量之一：his current sum (12-21)
        while player_sum < 12:
            # if sum of player is less than 12, always hit
            card = get_card()
            player_sum += card_value(card)

            # If the player's sum is larger than 21, he may hold one or two aces.
            if player_sum > 21:
                assert player_sum == 22
                # last card must be ace
                player_sum -= 10
            else:
                usable_ace_player |= (1 == card)
                #|或，也就是说usable_ace_player或者 1==card其中一个是True usable_ace_player就是True 也是因为怕第一次是了之后 第二张卡不是 usable就变了 所以是或
        # initialize cards of dealer, suppose dealer will show the first card he gets
        dealer_card1 = get_card()
        dealer_card2 = get_card()

    else:
        # use specified initial state
        usable_ace_player, player_sum, dealer_card1 = initial_state
        dealer_card2 = get_card()

    # initial state of the game
    state = [usable_ace_player, player_sum, dealer_card1]

    # initialize dealer's sum
    dealer_sum = card_value(dealer_card1) + card_value(dealer_card2)
    usable_ace_dealer = 1 in (dealer_card1, dealer_card2)
    # if the dealer's sum is larger than 21, he must hold two aces.
    if dealer_sum > 21:
        assert dealer_sum == 22
        # use one Ace as 1 rather than 11
        dealer_sum -= 10
    assert dealer_sum <= 21
    assert player_sum <= 21

    # game starts!

    # player's turn
    while True:
        if initial_action is not None:
            action = initial_action
            initial_action = None
        else:
            # get action based on current sum
            action = policy_player(usable_ace_player, player_sum, dealer_card1)

        # track player's trajectory for importance sampling
        player_trajectory.append([(usable_ace_player, player_sum, dealer_card1), action])

        if action == ACTION_STAND:
            break
        # if hit, get new card
        card = get_card()
        # Keep track of the ace count. the usable_ace_player flag is insufficient alone as it cannot
        # distinguish between having one ace or two.
        ace_count = int(usable_ace_player)
        if card == 1:
            ace_count += 1
        player_sum += card_value(card)
        # If the player has a usable ace, use it as 1 to avoid busting and continue.
        while player_sum > 21 and ace_count:
            player_sum -= 10
            ace_count -= 1
        # player busts
        if player_sum > 21:
            return state, -1, player_trajectory
        assert player_sum <= 21
        usable_ace_player = (ace_count == 1)

    # dealer's turn
    while True:
        # get action based on current sum
        action = POLICY_DEALER[dealer_sum]
        if action == ACTION_STAND:
            break
        # if hit, get a new card
        new_card = get_card()
        ace_count = int(usable_ace_dealer)
        if new_card == 1:
            ace_count += 1
        dealer_sum += card_value(new_card)
        # If the dealer has a usable ace, use it as 1 to avoid busting and continue.
        while dealer_sum > 21 and ace_count:
            dealer_sum -= 10
            ace_count -= 1
        # dealer busts
        if dealer_sum > 21:
            return state, 1, player_trajectory
        usable_ace_dealer = (ace_count == 1)

    # compare the sum between player and dealer
    assert player_sum <= 21 and dealer_sum <= 21
    if player_sum > dealer_sum:
        return state, 1, player_trajectory
    elif player_sum == dealer_sum:
        return state, 0, player_trajectory
    else:
        return state, -1, player_trajectory

# Monte Carlo Sample with On-Policy
def monte_carlo_on_policy(episodes):
    states_usable_ace = np.zeros((10, 10))
    # initialze counts to 1 to avoid 0 being divided
    states_usable_ace_count = np.ones((10, 10))
    states_no_usable_ace = np.zeros((10, 10))
    # initialze counts to 1 to avoid 0 being divided
    states_no_usable_ace_count = np.ones((10, 10))
    for i in tqdm(range(0, episodes)):
        _, reward, player_trajectory = play(target_policy_player)
        for (usable_ace, player_sum, dealer_card), _ in player_trajectory:
            #因为是从玩家的牌面12状态开始算起，所以要先减去12 就是矩阵的0,x位置了
            player_sum -= 12
            dealer_card -= 1
            if usable_ace:
                states_usable_ace_count[player_sum, dealer_card] += 1
                states_usable_ace[player_sum, dealer_card] += reward
            else:
                states_no_usable_ace_count[player_sum, dealer_card] += 1
                states_no_usable_ace[player_sum, dealer_card] += reward
    return states_usable_ace / states_usable_ace_count, states_no_usable_ace / states_no_usable_ace_count

以上的部分放在同一个Python文件中，下面这个是画图的code：

def figure_5_1():
    states_usable_ace_1, states_no_usable_ace_1 = monte_carlo_on_policy(10000)
    states_usable_ace_2, states_no_usable_ace_2 = monte_carlo_on_policy(500000)

    states = [states_usable_ace_1,
              states_usable_ace_2,
              states_no_usable_ace_1,
              states_no_usable_ace_2]

    titles = ['Usable Ace, 10000 Episodes',
              'Usable Ace, 500000 Episodes',
              'No Usable Ace, 10000 Episodes',
              'No Usable Ace, 500000 Episodes']

    _, axes = plt.subplots(2, 2, figsize=(40, 30))
    plt.subplots_adjust(wspace=0.1, hspace=0.2)
    axes = axes.flatten()

    for state, title, axis in zip(states, titles, axes):
        fig = sns.heatmap(np.flipud(state), cmap="YlGnBu", ax=axis, xticklabels=range(1, 11),
                          yticklabels=list(reversed(range(12, 22))))
        fig.set_ylabel('player sum', fontsize=30)
        fig.set_xlabel('dealer showing', fontsize=30)
        fig.set_title(title, fontsize=30)

    plt.savefig('../images/figure_5_1.png')
    plt.close()

由固定策略计算出的average state-value如图：
在这里插入图片描述
Monte Carlo 算法的重要一点：each state are independent. The estimate for one state does not build upon the estimate of any other state.

5.2 Monte Carlo Estimation of Action Value

在5.1的时候，21点的例子是已经确定策略了也就是：点数之和小于20要牌，否则停牌；而这样子，每一个状态都只有一个action的returns，这一点上只是在判断单单这一个action对于这个状态的价值而没有从其他action中吸取更好的，所以为了比较同一个状态下action value，我们需要估计这一个状态下的所有action value，而不是遵循某个特定的action value function。
但这里又引出了另一个问题：如果一些action没有被探访到的话，怎么办？ (many state-action pairs may never be visited.) This is the general problem of maintaining exploration 解决办法就有点类似于遗传算法里的变异感觉，保证每个state-action都有非零概率被选中。具体可以看5.3 Monte Carlo ES, for estimating

5.3 Monte Carlo Control

前半段在证明policy会收敛，这一节不同于5.1的地方是这次会使用ES去学习policy而不是看固定的policy对应的state-value了，那么怎么学习那？
在这里插入图片描述

Example 5.3 Solving Blackjack

在这里不同于5.1的是：此例子中在学习policy 得到optimal policy和optimal value 而不是拿固定了的策略去行动
monte carlo es的代码：

# Monte Carlo with Exploring Starts
def monte_carlo_es(episodes):
    # (playerSum, dealerCard, usableAce, action)
    state_action_values = np.zeros((10, 10, 2, 2))
    # initialze counts to 1 to avoid division by 0
    state_action_pair_count = np.ones((10, 10, 2, 2))

    # behavior policy is greedy
    def behavior_policy(usable_ace, player_sum, dealer_card):
        usable_ace = int(usable_ace)
        player_sum -= 12
        dealer_card -= 1
        # get argmax of the average returns(s, a)
        values_ = state_action_values[player_sum, dealer_card, usable_ace, :] / \
                  state_action_pair_count[player_sum, dealer_card, usable_ace, :]
        return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])

    # play for several episodes
    for episode in tqdm(range(episodes)):
        # for each episode, use a randomly initialized state and action
        initial_state = [bool(np.random.choice([0, 1])),
                       np.random.choice(range(12, 22)),
                       np.random.choice(range(1, 11))]
        initial_action = np.random.choice(ACTIONS)
        # 第一次使用的是目标策略（也就是选择最优），后面都是用behavior_policy
        current_policy = behavior_policy if episode else target_policy_player
        _, reward, trajectory = play(current_policy, initial_state, initial_action)
        first_visit_check = set()
        for (usable_ace, player_sum, dealer_card), action in trajectory:
            usable_ace = int(usable_ace)
            player_sum -= 12
            dealer_card -= 1
            state_action = (usable_ace, player_sum, dealer_card, action)
            if state_action in first_visit_check:
                continue
            first_visit_check.add(state_action)
            # update values of state-action pairs
            state_action_values[player_sum, dealer_card, usable_ace, action] += reward
            state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1

    return state_action_values / state_action_pair_count

这个是整体图5.2的绘制：

def figure_5_2():
    state_action_values = monte_carlo_es(500000)

    state_value_no_usable_ace = np.max(state_action_values[:, :, 0, :], axis=-1)
    state_value_usable_ace = np.max(state_action_values[:, :, 1, :], axis=-1)

    # get the optimal policy
    action_no_usable_ace = np.argmax(state_action_values[:, :, 0, :], axis=-1)
    action_usable_ace = np.argmax(state_action_values[:, :, 1, :], axis=-1)

    images = [action_usable_ace,
              state_value_usable_ace,
              action_no_usable_ace,
              state_value_no_usable_ace]

    titles = ['Optimal policy with usable Ace',
              'Optimal value with usable Ace',
              'Optimal policy without usable Ace',
              'Optimal value without usable Ace']

    _, axes = plt.subplots(2, 2, figsize=(40, 30))
    plt.subplots_adjust(wspace=0.1, hspace=0.2)
    axes = axes.flatten()

    for image, title, axis in zip(images, titles, axes):
        fig = sns.heatmap(np.flipud(image), cmap="YlGnBu", ax=axis, xticklabels=range(1, 11),
                          yticklabels=list(reversed(range(12, 22))))
        fig.set_ylabel('player sum', fontsize=30)
        fig.set_xlabel('dealer showing', fontsize=30)
        fig.set_title(title, fontsize=30)

    plt.savefig('../images/figure_5_2.png')
    plt.close()

5.5 Off-policy Prediction via Importance Sampling

首先要理解重要度采样比（the importance sampling ratio）
$\rho_{t: T-1} \doteq \frac{\prod_{k=t}^{T-1} \pi\left(A_{k} \mid S_{k}\right) p\left(S_{k+1} \mid S_{k}, A_{k}\right)}{\prod_{k=t}^{T-1} b\left(A_{k} \mid S_{k}\right) p\left(S_{k+1} \mid S_{k}, A_{k}\right)}=\prod_{k=t}^{T-1} \frac{\pi\left(A_{k} \mid S_{k}\right)}{b\left(A_{k} \mid S_{k}\right)}$ 也就是说behavior policy的action一致的时候，这个才计入，不然b=0意味着整个式子=0

Example 5.4 Off-policy Estimation of a Blackjack State Value

而在21点游戏中，action就两个一个继续要牌；一个停止要牌，所以相等的时候就是50%的概率： $b(A_k|S_k)=0.5$

# Monte Carlo Sample with Off-Policy
def monte_carlo_off_policy(episodes):
    initial_state = [True, 13, 2]

    rhos = []
    returns = []

    for i in range(0, episodes):
        _, reward, player_trajectory = play(behavior_policy_player, initial_state=initial_state)

        # get the importance ratio
        numerator = 1.0
        denominator = 1.0
        for (usable_ace, player_sum, dealer_card), action in player_trajectory:
            if action == target_policy_player(usable_ace, player_sum, dealer_card):
                denominator *= 0.5
            else:
                numerator = 0.0
                break
        rho = numerator / denominator
        rhos.append(rho)
        returns.append(reward)

    rhos = np.asarray(rhos)
    returns = np.asarray(returns)
    weighted_returns = rhos * returns

    weighted_returns = np.add.accumulate(weighted_returns)
    rhos = np.add.accumulate(rhos)

    ordinary_sampling = weighted_returns / np.arange(1, episodes + 1)

    with np.errstate(divide='ignore',invalid='ignore'):
        weighted_sampling = np.where(rhos != 0, weighted_returns / rhos, 0)

    return ordinary_sampling, weighted_sampling

这里说明了我们评估的是庄家露出2，showing a deuce；玩家现在的牌之和为13,有一张可用的A（也就是A和2）然后The value of this state under the target policy is approximately -0.2776所以我们就有了一个true value去求方差大小

def figure_5_3():
    true_value = -0.27726 #在书中说明了
    episodes = 10000
    runs = 100
    error_ordinary = np.zeros(episodes)
    error_weighted = np.zeros(episodes)
    for i in tqdm(range(0, runs)):
        ordinary_sampling_, weighted_sampling_ = monte_carlo_off_policy(episodes)
        # get the squared error
        error_ordinary += np.power(ordinary_sampling_ - true_value, 2)
        error_weighted += np.power(weighted_sampling_ - true_value, 2)
    error_ordinary /= runs
    error_weighted /= runs

    plt.plot(error_ordinary, label='Ordinary Importance Sampling')
    plt.plot(error_weighted, label='Weighted Importance Sampling')
    plt.xlabel('Episodes (log scale)')
    plt.ylabel('Mean square error')
    plt.xscale('log')
    plt.legend()

    plt.savefig('../images/figure_5_3.png')
    plt.close()

Example 5.5 Infinite Variance

这个想解释是因为第一次我一直不理解

为什么target policy和behaviour policy 不一样的话 $\rho$ importance sampling会为0
为什么target policy里面： all returns would be exactly 1

自问自答：

我在上面自问自答了 $\prod_{k=t}^{T-1} \frac{\pi\left(A_{k} \mid S_{k}\right)}{b\left(A_{k} \mid S_{k}\right)}$ 这里，注意target policy和behaviour policy是一样的 $A_k$ 所以当表现不一样的时候整个式子就是0，我们计算的是他们一致时采样度
因为target policy是学习成为了最优策略，所以他知道现在一直要往左走，而target policy在state s下的value一直都是1，因为他的最优就是在 $\pi(\text{go left|state s})$

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

ACTION_BACK = 0
ACTION_END = 1

# behavior policy
def behavior_policy():
    return np.random.binomial(1, 0.5)

# target policy
def target_policy():
    return ACTION_BACK

# one turn
def play():
    # track the action for importance ratio
    trajectory = []
    while True:
        action = behavior_policy()
        trajectory.append(action)
        if action == ACTION_END:
            return 0, trajectory
        if np.random.binomial(1, 0.9) == 0:
            return 1, trajectory

def figure_5_4():
    runs = 10
    episodes = 100000
    for run in range(runs):
        rewards = []
        for episode in range(0, episodes):
            reward, trajectory = play()
            if trajectory[-1] == ACTION_END:
                rho = 0
            else:
                rho = 1.0 / pow(0.5, len(trajectory))
            rewards.append(rho * reward)
        rewards = np.add.accumulate(rewards)
        estimations = np.asarray(rewards) / np.arange(1, episodes + 1)
        plt.plot(estimations)
    plt.xlabel('Episodes (log scale)')
    plt.ylabel('Ordinary Importance Sampling')
    plt.xscale('log')

    plt.savefig('../images/figure_5_4.png')
    plt.close()

if __name__ == '__main__':
    figure_5_4()

5.6 Incremental Implementation

Chapter 2 我们使用的是：average to rewards
$Q_{n+1}=Q_n+\frac{1}{n}[R_n-Q_n]$
Monto Carlo 我们使用的是：average to returns
$\larr \gamma G+R_{t+1}$

后面讨论的off-policy using weighted importance sampling，所以需要对return进行加权平均，如果写成第二章的形式：
首先已知，对于return 序列为 $G_1,G_2, \dots, G_{n-1}$ 都是从相同的状态开始，每一个return对应一个随机的权重 $W_i$ ，我们希望的是在获得额外的return $G_n$ 时能进行更新
$V_n=\frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1}W_k}, n \ge 2$ 为了不断更新 $V_n$ ，假设 $C_n=\sum_{k=1}^{n}W_k$ ：
$V_{n+1}=V_n+\frac{W_n}{C_n}[G_n-V_n]$ 这个式子拿上面的一直和 $C_n$ 就可以推出来… 把第一个式子分子乘到 $V_n$ 那边，然后写正常第一个式子的 $V_{n+1}$ 就好了【读到后面才发现练习5.10要证，所以贴在后面了】
这里是整个off-policy MC加了重要度采样的伪代码：这里是policy evaluation $\thickapprox q_*$
在这里插入图片描述

对于on-policy也同样适用，只需要选择同样的behavior policy 和 target policy 也就是W始终为1，伪代码的最后一句不要。近似值 $Q$ 收敛于 $Q_\pi$ 【这里书中说b可能是不同的策略，不太懂，因为之类不是选定了同样的behavior和target policy那 $\pi=b$ ，那为什么根据b选出的动作可能不同的策略呢？】

5.7

这里是policy improvement $\pi \thickapprox \pi_*$
在这里插入图片描述
在这一节提到了潜在的问题：this method learns only from the tail of episodes但是如果有前面的episodes里的action都是最优的就可能需要很久才能找到，同样的这一问题在nongreedy actions在long episodes前面的时候更为明显【问：这里的nongreedy适合nonoptimal是一个意思嘛？】
【Exercise 5.12很有意思】

5.9 *Pre-decision Importance Sampling

一开始的疑问主要集中在为什么5.12可以通过5.13消的这么多，也就是only the first and the last are related：【也是ex 5.13的文字版解释】

首先得明确 $R_{t+1}$ 是怎么得到的，是 $A_t|S_t$ 后的一个结果，所以只和前一个状态动作有关
在处理期望的时候，相乘相关的可以一起，独立的可以提出来单独乘
然后其余单独乘的项按照5.13的解释：也就是 $\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ 的期望就是所有的behaviour 可能发生的动作概率乘以这个分数求和就是期望值

那么通过以上解释后，就能得到式子5.14啦，这也是importance-sampling的一种替代方法，也叫：pre-decision importance sampling

5.10 Summary

这一章节是使用monte carlo去学习value function和optimal policies from experience，对比与DP方法的优点是：

they can be used to learn optimal behavior directly from interaction with the environment, with no model of the environment’s dynamics
they can be used with simulation or sample models
easy and efficient to focus Monte Carlo methods on the small subset of the states
less harmed by violations of the Markov property

第四点是因为Monte Carlo没有使用bootstrap

All Exercise Part

Exercise 5.1: Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why are the front most values higher in the upper diagrams than in the lower?

1. It is due to the strategy that player will not stop until meeting 20 or 21. That indicates player would face the risk of failing by hitting, which results the low value part right before 20 and 21. On the 20 and 21, however, the player stops and has a very high opportunity to win, especially when dealer will stop at 17 or higher.
2. It drop off for the whole last row on the left because if Dealer showing an ACE, It has very high possibility of getting higher score than the player when it counts as 11. Thus, the value of dealer’s A has contained the dealer’s winning rate of making it usable or not. Other cards have no such condition thus A is a special which makes the gap.
3. Front most values are higher in the upper diagrams because A represent dual values of being used as 1 and 11 in the upper diagram. It makes the player better o and is similar with the condition of having drop in the leftmost rows.

Exercise 5.2: Suppose every-visit MC was used instead of first-visit MC on the blackjack task. Would you expect the results to be very different? Why or why not?
No. Black jack does not contain two duplicate state in any episode, making first-visit and every-visit method essentially the same thing.

Exercise 5.10: Derive the weighted-average update rule (5.8) from (5.7). Follow the pattern of the derivation of the unweighted rule (2.3).
在这里插入图片描述

Exercise 5.12: Racetrack (programming) Consider driving a race car around a turn like those shown in Figure 5.5. You want to go as fast as possible, but not so fast as to run off the track. In our simplified racetrack, the car is at one of a discrete set of grid positions, the cells in the diagram. The velocity is also discrete, a number of grid cells moved horizontally and vertically per time step. The actions are increments to the velocity components. Each may be changed by +1, -1, or 0 in each step, for a total of nine (3x3) actions. Both velocity components are restricted to be nonnegative and less than 5, and they cannot both be zero except at the starting line. Each episode begins in one of the randomly selected start states with both velocity components zero and ends when the car crosses the finish line. The rewards are -1 for each step until the car crosses the finish line. If the car hits the track boundary, it is moved back to a random position on the starting line, both velocity components are reduced to zero, and the episode continues. Before updating the car’s location at each time step, check to see if the projected path of the car intersects the track boundary. If it intersects the finish line, the episode ends; if it intersects anywhere else, the car is considered to have hit the track boundary and is sent back to the starting line. To make the task more challenging, with probability 0.1 at each time step the velocity increments are both zero, independently of the intended increments. Apply a Monte Carlo control method to this task to compute the optimal policy from each starting state. Exhibit several trajectories following the optimal policy (but turn the noise off for these trajectories).
赛道地图示意：

引用参考：Solving Racetrack in Reinforcement Learning using Monte Carlo Control

最后的结果：

首先解决这个问题：
0. 对问题进行分析：
在这里插入图片描述

分割问题：
a. We need to have a generator whose responsibility will be to randomly generate racetracks for us.
b. We need to build an environment for this problem. Its main responsibilities will be to start and end episodes. It should also be able to get the new state and reward given the current state and action values.
c. We need to have an agent (i.e. car here) that would choose an action given the state.
A visualizer is also needed to visualize the generated racetracks along with the agent’s location on it.
d. Implementation of the Monte Carlo Off-Policy Control algorithm.

a. 画出地图
整张地图是100x100的，然后frac来保证地图上一半都是无效点，每次while循环内的δ会+1，所以会逐渐从两个角落使那个cell无效，一开始在fracrandom出来的是start和goal的一条线，也就是说从[49,32] 是从第32个到49个都是起点线的cell；

class Generator:
    
    #HELPFUL FUNCTIONS
    def widen_hole_transformation(self,racetrack,start_cell,end_cell):
        
        δ = 1
        while(1):
            if ((start_cell[1] < δ) or (start_cell[0] < δ)):
                racetrack[0:end_cell[0],0:end_cell[1]] = -1
                break

            if ((end_cell[1]+δ > 100) or (end_cell[0]+δ > 100)):
                racetrack[start_cell[0]:100,start_cell[1]:100] = -1
                break
                
            δ += 1

        return racetrack
    
    def calculate_valid_fraction(self, racetrack):
        '''
        Returns the fraction of valid cells in the racetrack
        '''
        return (len(racetrack[racetrack==0])/10000)

    def mark_finish_states(self, racetrack):
        '''
        Marks finish states in the racetrack
        Returns racetrack
        '''
        last_col = racetrack[0:100,99]
        last_col[last_col==0] = 2
        return racetrack
    
    def mark_start_states(self, racetrack):
        '''
        Marks start states in the racetrack
        Returns racetrack
        '''
        last_row = racetrack[99,0:100]
        last_row[last_row==0] = 1
        return racetrack
    
    
    #CONSTRUCTOR
    def __init__(self):
        pass
    
    def generate_racetrack(self):
        '''
        racetrack is a 2d numpy array
        codes for racetrack:
            0,1,2 : valid racetrack cells
            -1: invalid racetrack cell
            1: start line cells
            2: finish line cells
        returns randomly generated racetrack
        '''
        racetrack = np.zeros((100,100),dtype='int')
        
        frac = 1
        while frac > 0.5:    
            
            #transformation
            random_cell = np.random.randint((100,100))
            random_hole_dims = np.random.randint((25,25))
            start_cell = np.array([max(0,x - y//2) for x,y in zip(random_cell,random_hole_dims)])
            end_cell = np.array([min(100,x+y) for x,y in zip(start_cell,random_hole_dims)])
        
            #apply_transformation
            racetrack = self.widen_hole_transformation(racetrack, start_cell, end_cell)
            frac = self.calculate_valid_fraction(racetrack)
        
        racetrack = self.mark_start_states(racetrack)
        racetrack = self.mark_finish_states(racetrack)
        
        return racetrack

然后地图建好了后，建整个我们说的状态，动作，reward等的东西，也就是我们通常说的训练环境：

class Environment:
    
    #HELPFUL FUNCTIONS
    
    def get_new_state(self, state, action):
        '''
        Get new state after applying action on this state
        Assumption: The car keeps on moving with the current velocity and then action is applied to 
        change the velocity
        '''
        new_state = state.copy()
        new_state[0] = state[0] - state[2]
        new_state[1] = state[1] + state[3]
        new_state[2] = state[2] + action[0]
        new_state[3] = state[3] + action[1]
        return new_state
    
    def select_randomly(self,NUMPY_ARR):
        '''
        Returns a value uniform randomly from NUMPY_ARR
        Here NUMPY_ARR should be 1 dimensional
        '''
        return np.random.choice(NUMPY_ARR)
    
    def set_zero(NUMPY_ARR):
        '''
        Returns NUMPY_ARR after making zero all the elements in it
        '''
        NUMPY_ARR[:] = 0
        return NUMPY_ARR
    
    def is_finish_line_crossed(self, state, action):
        '''
        Returns True if the car crosses the finish line
                False otherwise
        '''
        new_state = self.get_new_state(state, action)
        old_cell, new_cell = state[0:2], new_state[0:2]
        
        '''
        new_cell's row index will be less
        '''
        rows = np.array(range(new_cell[0],old_cell[0]+1))
        cols = np.array(range(old_cell[1],new_cell[1]+1))
        fin = set([tuple(x) for x in self.data.finish_line])
        row_col_matrix = [(x,y) for x in rows for y in cols]
        intersect = [x for x in row_col_matrix if x in fin]
        
        return len(intersect) > 0
    
    def is_out_of_track(self, state, action):
        '''
        Returns True if the car goes out of track if action is taken on state
                False otherwise
        '''
        new_state = self.get_new_state(state, action)
        old_cell, new_cell = state[0:2], new_state[0:2]
        
        if new_cell[0] < 0 or new_cell[0] >= 100 or new_cell[1] < 0 or new_cell[1] >= 100:
            return True
        
        else:
            return self.data.racetrack[tuple(new_cell)] == -1
    
    #CONSTRUCTOR
    def __init__(self, data, gen):
        '''
        initialize step_count to be 0
        '''
        self.data = data
        self.gen = gen
        self.step_count = 0
    
    #MEMBER FUNCTIONS
    
    def reset(self):
        self.data.episode = dict({'S':[],'A':[],'probs':[],'R':[None]})
        self.step_count = 0
    
    def start(self):
        '''
        Makes the velocity of the car to be zero
        Returns the randomly selected start state.
        '''
        state = np.zeros(4,dtype='int')
        state[0] = 99
        state[1] = self.select_randomly(self.data.start_line[:,1])
        '''
        state[2] and state[3] are already zero
        '''
        return state
    
    def step(self, state, action):
        '''
        Returns the reward and new state when action is taken on state
        Checks the following 2 cases maintaining the order:
            1. car finishes race by crossing the finish line
            2. car goes out of track
        Ends the episode by returning reward as None and state as usual (which will be terminating)
        '''
        self.data.episode['A'].append(action)
        reward = -1
        
        if (self.is_finish_line_crossed(state, action)):
            new_state = self.get_new_state(state, action)
            
            self.data.episode['R'].append(reward)
            self.data.episode['S'].append(new_state)
            self.step_count += 1
            
            return None, new_state
            
        elif (self.is_out_of_track(state, action)):
            new_state = self.start()
        else:
            new_state = self.get_new_state(state, action)
        
        self.data.episode['R'].append(reward)
        self.data.episode['S'].append(new_state)
        self.step_count += 1
        
        return reward, new_state

最后是我们的agent，就是需要一个agent根据env返回的进行状态的再次返回，然后再使用some policy to make actions. Also, we wrote some helpful functions like finding valid possible actions given current velocity which keeps in check the given constraint on velocity and mapping actions from 1-D to 2-D and vice-versa.

class Agent:
    
    #HELPFUL FUNCTIONS
    def possible_actions(self, velocity):
        '''
        *** Performs two tasks, can be split up ***
        Universe of actions:  α = [(-1,-1),(-1,0),(0,-1),(-1,1),(0,0),(1,-1),(0,1),(1,0),(1,1)]
                            
        Uses constraints to filter out invalid actions given the velocity
        
        0 <= v_x < 5
        0 <= v_y < 5
        v_x and v_y cannot be made both zero (you can't take an action which would make them zero simultaneously)
        Returns list of possible actions given the velocity
        '''
        α = [(-1,-1),(-1,0),(0,-1),(-1,1),(0,0),(1,-1),(0,1),(1,0),(1,1)]
        α = [np.array(x) for x in α]

        β = []
        for i,x in zip(range(9),α):
            new_vel = np.add(velocity,x)
            if (new_vel[0] < 5) and (new_vel[0] >= 0) and (new_vel[1] < 5) and (new_vel[1] >= 0) and ~(new_vel[0] == 0 and new_vel[1] == 0):
                β.append(i)
        β = np.array(β)
        
        return β
    
    def map_to_1D(self,action):
        α = [(-1,-1),(-1,0),(0,-1),(-1,1),(0,0),(1,-1),(0,1),(1,0),(1,1)]
        for i,x in zip(range(9),α):
            if action[0]==x[0] and action[1]==x[1]:
                return i
    
    def map_to_2D(self,action):
        α = [(-1,-1),(-1,0),(0,-1),(-1,1),(0,0),(1,-1),(0,1),(1,0),(1,1)]
        return α[action]
    
    #CONSTRUCTOR
    def __init__(self):
        pass
    
    def get_action(self, state, policy):
        '''
        Returns action given state using policy
        '''
        return self.map_to_2D(policy(state, self.possible_actions(state[2:4])))

其他的关于可视化的界面的生成是通过pygame来可视化的，具体的代码如下：

class Visualizer:
    
    #HELPFUL FUNCTIONS
    
    def create_window(self):
        '''
        Creates window and assigns self.display variable
        '''
        self.display = pygame.display.set_mode((self.width, self.height))
        pygame.display.set_caption("Racetrack")
    
    def setup(self):
        '''
        Does things which occur only at the beginning
        '''
        self.cell_edge = 9
        self.width = 100*self.cell_edge
        self.height = 100*self.cell_edge
        self.create_window()
        self.window = True

    def close_window(self):
        self.window = False
        pygame.quit()

    def draw(self, state = np.array([])):
        self.display.fill(0)
        for i in range(100):
            for j in range(100):
                if self.data.racetrack[i,j]!=-1:
                    if self.data.racetrack[i,j] == 0:
                        color = (255,0,0)
                    elif self.data.racetrack[i,j] == 1:
                        color = (255,255,0)
                    elif self.data.racetrack[i,j] == 2:
                        color = (0,255,0)
                    pygame.draw.rect(self.display,color,((j*self.cell_edge,i*self.cell_edge),(self.cell_edge,self.cell_edge)),1)
        
        if len(state)>0:
            pygame.draw.rect(self.display,(0,0,255),((state[1]*self.cell_edge,state[0]*self.cell_edge),(self.cell_edge,self.cell_edge)),1)
        
        pygame.display.update()
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                self.loop = False
                self.close_window()
                return 'stop'
            elif event.type == pygame.KEYDOWN and event.key == pygame.K_SPACE:
                self.loop = False
                
        return None
        
    def visualize_racetrack(self, state = np.array([])):
        '''
        Draws Racetrack in a pygame window
        '''
        if self.window == False:
            self.setup()
        self.loop = True
        while(self.loop):
            ret = self.draw(state)
            if ret!=None:
                return ret

现在主要就是怎样去训练出我们想要的policy了，也就是The Off-Policy Monte Carlo Control Algorithm

class Monte_Carlo_Control:
    
    #HELPFUL FUNCTIONS
    
    def evaluate_target_policy(self):
        env.reset()
        state = env.start()
        self.data.episode['S'].append(state)
        rew = -1
        while rew!=None:
            action = agent.get_action(state,self.generate_target_policy_action)
            rew, state = env.step(state,action)
            
        self.data.rewards.append(sum(self.data.episode['R'][1:]))
        
    
    def plot_rewards(self):
        ax, fig = plt.subplots(figsize=(30,15))
        x = np.arange(1,len(self.data.rewards)+1)
        plt.plot(x*10, self.data.rewards, linewidth=0.5, color = '#BB8FCE')
        plt.xlabel('Episode number', size = 20)
        plt.ylabel('Reward',size = 20)
        plt.title('Plot of Reward vs Episode Number',size=20)
        plt.xticks(size=20)
        plt.yticks(size=20)
        plt.savefig('RewardGraph.png')
        plt.close()
    
    def save_your_work(self):
        self.data.save_Q_vals()
        self.data.save_C_vals()
        self.data.save_π()
        self.data.save_rewards()
    
    def determine_probability_behaviour(self, state, action, possible_actions):
        best_action = self.data.π[tuple(state)]
        num_actions = len(possible_actions)
        
        if best_action in possible_actions:
            if action == best_action:
                prob = 1 - self.data.ε + self.data.ε/num_actions
            else:
                prob = self.data.ε/num_actions
        else:
            prob = 1/num_actions
        
        self.data.episode['probs'].append(prob)
    
    def generate_target_policy_action(self, state, possible_actions):
        '''
        Returns target policy action, takes state and
        returns an action using this policy
        '''
        if self.data.π[tuple(state)] in possible_actions:
            action = self.data.π[tuple(state)]
        else:
            action = np.random.choice(possible_actions)
            
        return action
    
    def generate_behavioural_policy_action(self, state, possible_actions):
        '''
        Returns behavioural policy action
        which would be ε-greedy π policy, takes state and
        returns an action using this ε-greedy π policy
        '''
        if np.random.rand() > self.data.ε and self.data.π[tuple(state)] in possible_actions:
            action = self.data.π[tuple(state)]
        else:
            action = np.random.choice(possible_actions)
        
        self.determine_probability_behaviour(state, action, possible_actions)
    
        return action
    
    #CONSTRUCTOR
    def __init__(self, data):
        '''
        Initialize, for all s ∈ S, a ∈ A(s):
            data.Q(s, a) ← arbitrary (done in Data)
            data.C(s, a) ← 0 (done in Data)
            π(s) ← argmax_a Q(s,a) 
            (with ties broken consistently) 
            (some consistent approach needs to be followed))
        '''
        self.data = data
        for i in range(100):
            for j in range(100):
                if self.data.racetrack[i,j]!=-1:
                    for k in range(5):
                        for l in range(5):
                            self.data.π[i,j,k,l] = np.argmax(self.data.Q_vals[i,j,k,l])
    
    def control(self,env,agent):
        '''
        Performs MC control using episode list [ S0 , A0 , R1, . . . , ST −1 , AT −1, RT , ST ]
        G ← 0
        W ← 1
        For t = T − 1, T − 2, . . . down to 0:
            G ← γ*G + R_t+1
            C(St, At ) ← C(St,At ) + W
            Q(St, At ) ← Q(St,At) + (W/C(St,At))*[G − Q(St,At )]
            π(St) ← argmax_a Q(St,a) (with ties broken consistently)
            If At != π(St) then exit For loop
            W ← W * (1/b(At|St))        
        '''
        env.reset()
        state = env.start()
        self.data.episode['S'].append(state)
        rew = -1
        while rew!=None:
            action = agent.get_action(state,self.generate_behavioural_policy_action)
            rew, state = env.step(state,action)
        
        G = 0
        W = 1
        T = env.step_count
    
        for t in range(T-1,-1,-1):
            G = data.γ * G + self.data.episode['R'][t+1]
            S_t = tuple(self.data.episode['S'][t])
            A_t = agent.map_to_1D(self.data.episode['A'][t])