【书籍阅读 Ch5】Reinforcement Learning An Introduction, 2nd Edition

前言:第1、2章点此进入第3章点此进入
注:每一个目录对应的是在pdf的页数(如果LPage就是书左上角的页码 - 因为我发现后面我要在两页之间加空白页 做练习lol 例如:LPage28 就是左上角书页28页,RPage29就是右上角书页29页);【】这个括号之间有时候是我留的疑问,与一些关于方向上连接的想法 主要集中于无人驾驶的控制层/理论层,带问号结束的就是…我的疑问

更新时间:2021/01/18

推荐观看:
1.英文 - PDF链接
2 中文 - 官方京东书籍购买链接
代码参考:
1.github 关于整本书的图python代码
2.github 关于整本书的练习solution参考

回顾与进入

在上一章DP动态规划中,我们是有着对环境的完全掌握的,同时对于每个状态下动作的发生概率也是已知的,而蒙特卡洛这一章节解决的是:如果我们对环境不知道,但是我们有之前行为的一些信息,Learning from actual experience OR Learning from simulated experience
(OS 看到后面发现这个对环境有完全的掌握 这一概念和environment model 是指什么? return是自己定义的 难道是 action之后所处的state不明确吗?)

5.1 Monte Carlo Prediction

使用monte carlo方法去学习出state-value function 噢 所以对于环境的掌握在于它不使用MDP的马尔可夫链?也就是不知道所有的state状态是个什么情况
然后计算方法就是recall that the value of a state is the expected return – expected cumulative future discounted reward – starting from that state,说中文就是:state-value是期望的回报,也就是对一个状态:把未来的reward全面从开始到结束加。
随后把之前得到的state-value按照各个状态进行独自想加再平均,比如(N|(0,0))在0,0点往北的state-value在100次运行里有100个state-value想加他们的值再平均就是。
As more return are observed, the average should converge to the expected value. 这就是Monte Carlo的方法思想。

  • first-visit MC method: 首次访问的这个state的returns 平均值估计 v π ( s ) v_\pi(s) vπ(s)
  • every-visit MC method:所有访问的这个state的returns 平均值

可以在这个图中看出来第二个Loop的时候是Loop for each step of episode,而如果这个状态只有没出现过时,才会将现在的returns apped到G里,再平均得到state-value
在这里插入图片描述

Example 5.1

首先是关于整个21点游戏的元素定义:比如action:hit/stand,

import numpy as np
import matplotlib
matplotlib.use('Agg') # 使得plot不显示
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# actions: hit or stand
ACTION_HIT = 0
ACTION_STAND = 1  #  "strike" in the book
ACTIONS = [ACTION_HIT, ACTION_STAND]

# policy for player
POLICY_PLAYER = np.zeros(22, dtype=np.int)
for i in range(12, 20):
    POLICY_PLAYER[i] = ACTION_HIT
POLICY_PLAYER[20] = ACTION_STAND
POLICY_PLAYER[21] = ACTION_STAND

# function form of target policy of player
def target_policy_player(usable_ace_player, player_sum, dealer_card):
    return POLICY_PLAYER[player_sum]

# function form of behavior policy of player
def behavior_policy_player(usable_ace_player, player_sum, dealer_card):
    if np.random.binomial(1, 0.5) == 1:
        return ACTION_STAND
    return ACTION_HIT

# policy for dealer 这里是庄家的固定策略:一直要牌,>=17 停牌
POLICY_DEALER = np.zeros(22)
for i in range(12, 17):
    POLICY_DEALER[i] = ACTION_HIT
for i in range(17, 22):
    POLICY_DEALER[i] = ACTION_STAND

# get a new card
def get_card():
    card = np.random.randint(1, 14) #这里是按我们的手牌来的从2--10(8张);JQK(3张);A(1张)
    card = min(card, 10) #但是因为JQK算作10的数,所以取最大就到10,1的话就算做useable_ace
    return card

# get the value of a card (11 for ace).
def card_value(card_id):
    return 11 if card_id == 1 else card_id

# play a game
# @policy_player: specify policy for player
# @initial_state: [whether player has a usable Ace, sum of player's cards, one card of dealer]
# @initial_action: the initial action
def play(policy_player, initial_state=None, initial_action=None):
    # player status

    # sum of player
    player_sum = 0

    # trajectory of player
    player_trajectory = []

    # whether player uses Ace as 11
    usable_ace_player = False

    # dealer status
    dealer_card1 = 0
    dealer_card2 = 0
    usable_ace_dealer = False

    if initial_state is None:
        # generate a random initial state
        # !!!!这里有点问题的是,定义的开始是玩家和庄家都拿2张牌,而这个while意味着玩家可能有两张牌以上,比如如果发的是2,3那么玩家还是被发牌 是否不对?!!!!!
        # 看上去程序是上面说的那个意思,但是可能这里程序不是都拿两张牌,而是玩家已经开始要牌了,玩家的三个变量之一:his current sum (12-21)
        while player_sum < 12:
            # if sum of player is less than 12, always hit
            card = get_card()
            player_sum += card_value(card)

            # If the player's sum is larger than 21, he may hold one or two aces.
            if player_sum > 21:
                assert player_sum == 22
                # last card must be ace
                player_sum -= 10
            else:
                usable_ace_player |= (1 == card)
                #|或,也就是说usable_ace_player或者 1==card其中一个是True usable_ace_player就是True 也是因为怕第一次是了之后 第二张卡不是 usable就变了 所以是或
        # initialize cards of dealer, suppose dealer will show the first card he gets
        dealer_card1 = get_card()
        dealer_card2 = get_card()

    else:
        # use specified initial state
        usable_ace_player, player_sum, dealer_card1 = initial_state
        dealer_card2 = get_card()

    # initial state of the game
    state = [usable_ace_player, player_sum, dealer_card1]

    # initialize dealer's sum
    dealer_sum = card_value(dealer_card1) + card_value(dealer_card2)
    usable_ace_dealer = 1 in (dealer_card1, dealer_card2)
    # if the dealer's sum is larger than 21, he must hold two aces.
    if dealer_sum > 21:
        assert dealer_sum == 22
        # use one Ace as 1 rather than 11
        dealer_sum -= 10
    assert dealer_sum <= 21
    assert player_sum <= 21

    # game starts!

    # player's turn
    while True:
        if initial_action is not None:
            action = initial_action
            initial_action = None
        else:
            # get action based on current sum
            action = policy_player(usable_ace_player, player_sum, dealer_card1)

        # track player's trajectory for importance sampling
        player_trajectory.append([(usable_ace_player, player_sum, dealer_card1), action])

        if action == ACTION_STAND:
            break
        # if hit, get new card
        card = get_card()
        # Keep track of the ace count. the usable_ace_player flag is insufficient alone as it cannot
        # distinguish between having one ace or two.
        ace_count = int(usable_ace_player)
        if card == 1:
            ace_count += 1
        player_sum += card_value(card)
        # If the player has a usable ace, use it as 1 to avoid busting and continue.
        while player_sum > 21 and ace_count:
            player_sum -= 10
            ace_count -= 1
        # player busts
        if player_sum > 21:
            return state, -1, player_trajectory
        assert player_sum <= 21
        usable_ace_player = (ace_count == 1)

    # dealer's turn
    while True:
        # get action based on current sum
        action = POLICY_DEALER[dealer_sum]
        if action == ACTION_STAND:
            break
        # if hit, get a new card
        new_card = get_card()
        ace_count = int(usable_ace_dealer)
        if new_card == 1:
            ace_count += 1
        dealer_sum += card_value(new_card)
        # If the dealer has a usable ace, use it as 1 to avoid busting and continue.
        while dealer_sum > 21 and ace_count:
            dealer_sum -= 10
            ace_count -= 1
        # dealer busts
        if dealer_sum > 21:
            return state, 1, player_trajectory
        usable_ace_dealer = (ace_count == 1)

    # compare the sum between player and dealer
    assert player_sum <= 21 and dealer_sum <= 21
    if player_sum > dealer_sum:
        return state, 1, player_trajectory
    elif player_sum == dealer_sum:
        return state, 0, player_trajectory
    else:
        return state, -1, player_trajectory
# Monte Carlo Sample with On-Policy
def monte_carlo_on_policy(episodes):
    states_usable_ace = np.zeros((10, 10))
    # initialze counts to 1 to avoid 0 being divided
    states_usable_ace_count = np.ones((10, 10))
    states_no_usable_ace = np.zeros((10, 10))
    # initialze counts to 1 to avoid 0 being divided
    states_no_usable_ace_count = np.ones((10, 10))
    for i in tqdm(range(0, episodes)):
        _, reward, player_trajectory = play(target_policy_player)
        for (usable_ace, player_sum, dealer_card), _ in player_trajectory:
            #因为是从玩家的牌面12状态开始算起,所以要先减去12 就是矩阵的0,x位置了
            player_sum -= 12
            dealer_card -= 1
            if usable_ace:
                states_usable_ace_count[player_sum, dealer_card] += 1
                states_usable_ace[player_sum, dealer_card] += reward
            else:
                states_no_usable_ace_count[player_sum, dealer_card] += 1
                states_no_usable_ace[player_sum, dealer_card] += reward
    return states_usable_ace / states_usable_ace_count, states_no_usable_ace / states_no_usable_ace_count

以上的部分放在同一个Python文件中,下面这个是画图的code:

def figure_5_1():
    states_usable_ace_1, states_no_usable_ace_1 = monte_carlo_on_policy(10000)
    states_usable_ace_2, states_no_usable_ace_2 = monte_carlo_on_policy(500000)

    states = [states_usable_ace_1,
              states_usable_ace_2,
              states_no_usable_ace_1,
              states_no_usable_ace_2]

    titles = ['Usable Ace, 10000 Episodes',
              'Usable Ace, 500000 Episodes',
              'No Usable Ace, 10000 Episodes',
              'No Usable Ace, 500000 Episodes']

    _, axes = plt.subplots(2, 2, figsize=(40, 30))
    plt.subplots_adjust(wspace=0.1, hspace=0.2)
    axes = axes.flatten()

    for state, title, axis in zip(states, titles, axes):
        fig = sns.heatmap(np.flipud(state), cmap="YlGnBu", ax=axis, xticklabels=range(1, 11),
                          yticklabels=list(reversed(range(12, 22))))
        fig.set_ylabel('player sum', fontsize=30)
        fig.set_xlabel('dealer showing', fontsize=30)
        fig.set_title(title, fontsize=30)

    plt.savefig('../images/figure_5_1.png')
    plt.close()

由固定策略计算出的average state-value如图:
在这里插入图片描述
Monte Carlo 算法的重要一点:each state are independent. The estimate for one state does not build upon the estimate of any other state.

5.2 Monte Carlo Estimation of Action Value

在5.1的时候,21点的例子是已经确定策略了 也就是:点数之和小于20要牌,否则停牌;而这样子,每一个状态都只有一个action的returns,这一点上只是在判断单单这一个action对于这个状态的价值而没有从其他action中吸取更好的,所以为了比较同一个状态下action value,我们需要估计这一个状态下的所有action value,而不是遵循某个特定的action value function。
但这里又引出了另一个问题:如果一些action没有被探访到的话,怎么办? (many state-action pairs may never be visited.) This is the general problem of maintaining exploration 解决办法就有点类似于遗传算法里的变异感觉,保证每个state-action都有非零概率被选中。具体可以看5.3 Monte Carlo ES, for estimating

5.3 Monte Carlo Control

前半段在证明policy会收敛,这一节不同于5.1的地方是 这次会使用ES去学习policy而不是看固定的policy对应的state-value了,那么怎么学习那?
在这里插入图片描述

Example 5.3 Solving Blackjack

在这里不同于5.1的是:此例子中在学习policy 得到optimal policy和optimal value 而不是拿固定了的策略去行动
monte carlo es的代码:

# Monte Carlo with Exploring Starts
def monte_carlo_es(episodes):
    # (playerSum, dealerCard, usableAce, action)
    state_action_values = np.zeros((10, 10, 2, 2))
    # initialze counts to 1 to avoid division by 0
    state_action_pair_count = np.ones((10, 10, 2, 2))

    # behavior policy is greedy
    def behavior_policy(usable_ace, player_sum, dealer_card):
        usable_ace = int(usable_ace)
        player_sum -= 12
        dealer_card -= 1
        # get argmax of the average returns(s, a)
        values_ = state_action_values[player_sum, dealer_card, usable_ace, :] / \
                  state_action_pair_count[player_sum, dealer_card, usable_ace, :]
        return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])

    # play for several episodes
    for episode in tqdm(range(episodes)):
        # for each episode, use a randomly initialized state and action
        initial_state = [bool(np.random.choice([0, 1])),
                       np.random.choice(range(12, 22)),
                       np.random.choice(range(1, 11))]
        initial_action = np.random.choice(ACTIONS)
        # 第一次使用的是目标策略(也就是选择最优),后面都是用behavior_policy
        current_policy = behavior_policy if episode else target_policy_player
        _, reward, trajectory = play(current_policy, initial_state, initial_action)
        first_visit_check = set()
        for (usable_ace, player_sum, dealer_card), action in trajectory:
            usable_ace = int(usable_ace)
            player_sum -= 12
            dealer_card -= 1
            state_action = (usable_ace, player_sum, dealer_card, action)
            if state_action in first_visit_check:
                continue
            first_visit_check.add(state_action)
            # update values of state-action pairs
            state_action_values[player_sum, dealer_card, usable_ace, action] += reward
            state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1

    return state_action_values / state_action_pair_count

这个是整体图5.2的绘制:

def figure_5_2():
    state_action_values = monte_carlo_es(500000)

    state_value_no_usable_ace = np.max(state_action_values[:, :, 0, :], axis=-1)
    state_value_usable_ace = np.max(state_action_values[:, :, 1, :], axis=-1)

    # get the optimal policy
    action_no_usable_ace = np.argmax(state_action_values[:, :, 0, :], axis=-1)
    action_usable_ace = np.argmax(state_action_values[:, :, 1, :], axis=-1)

    images = [action_usable_ace,
              state_value_usable_ace,
              action_no_usable_ace,
              state_value_no_usable_ace]

    titles = ['Optimal policy with usable Ace',
              'Optimal value with usable Ace',
              'Optimal policy without usable Ace',
              'Optimal value without usable Ace']

    _, axes = plt.subplots(2, 2, figsize=(40, 30))
    plt.subplots_adjust(wspace=0.1, hspace=0.2)
    axes = axes.flatten()

    for image, title, axis in zip(images, titles, axes):
        fig = sns.heatmap(np.flipud(image), cmap="YlGnBu", ax=axis, xticklabels=range(1, 11),
                          yticklabels=list(reversed(range(12, 22))))
        fig.set_ylabel('player sum', fontsize=30)
        fig.set_xlabel('dealer showing', fontsize=30)
        fig.set_title(title, fontsize=30)

    plt.savefig('../images/figure_5_2.png')
    plt.close()

5.5 Off-policy Prediction via Importance Sampling

首先要理解重要度采样比(the importance sampling ratio)
ρ t : T − 1 ≐ ∏ k = t T − 1 π ( A k ∣ S k ) p ( S k + 1 ∣ S k , A k ) ∏ k = t T − 1 b ( A k ∣ S k ) p ( S k + 1 ∣ S k , A k ) = ∏ k = t T − 1 π ( A k ∣ S k ) b ( A k ∣ S k ) \rho_{t: T-1} \doteq \frac{\prod_{k=t}^{T-1} \pi\left(A_{k} \mid S_{k}\right) p\left(S_{k+1} \mid S_{k}, A_{k}\right)}{\prod_{k=t}^{T-1} b\left(A_{k} \mid S_{k}\right) p\left(S_{k+1} \mid S_{k}, A_{k}\right)}=\prod_{k=t}^{T-1} \frac{\pi\left(A_{k} \mid S_{k}\right)}{b\left(A_{k} \mid S_{k}\right)} ρt:T1k=tT1b(AkSk)p(Sk+1Sk,Ak)k=tT1π(AkSk)p(Sk+1Sk,Ak)=k=tT1b(AkSk)π(AkSk)也就是说behavior policy的action一致的时候,这个才计入,不然b=0意味着整个式子=0

Example 5.4 Off-policy Estimation of a Blackjack State Value

而在21点游戏中,action就两个一个继续要牌;一个停止要牌,所以相等的时候就是50%的概率 : b ( A k ∣ S k ) = 0.5 b(A_k|S_k)=0.5 b(AkSk)=0.5

# Monte Carlo Sample with Off-Policy
def monte_carlo_off_policy(episodes):
    initial_state = [True, 13, 2]

    rhos = []
    returns = []

    for i in range(0, episodes):
        _, reward, player_trajectory = play(behavior_policy_player, initial_state=initial_state)

        # get the importance ratio
        numerator = 1.0
        denominator = 1.0
        for (usable_ace, player_sum, dealer_card), action in player_trajectory:
            if action == target_policy_player(usable_ace, player_sum, dealer_card):
                denominator *= 0.5
            else:
                numerator = 0.0
                break
        rho = numerator / denominator
        rhos.append(rho)
        returns.append(reward)

    rhos = np.asarray(rhos)
    returns = np.asarray(returns)
    weighted_returns = rhos * returns

    weighted_returns = np.add.accumulate(weighted_returns)
    rhos = np.add.accumulate(rhos)

    ordinary_sampling = weighted_returns / np.arange(1, episodes + 1)

    with np.errstate(divide='ignore',invalid='ignore'):
        weighted_sampling = np.where(rhos != 0, weighted_returns / rhos, 0)

    return ordinary_sampling, weighted_sampling

这里说明了我们评估的是庄家露出2,showing a deuce;玩家现在的牌之和为13,有一张可用的A(也就是A和2)然后The value of this state under the target policy is approximately -0.2776所以我们就有了一个true value去求方差大小

def figure_5_3():
    true_value = -0.27726 #在书中说明了
    episodes = 10000
    runs = 100
    error_ordinary = np.zeros(episodes)
    error_weighted = np.zeros(episodes)
    for i in tqdm(range(0, runs)):
        ordinary_sampling_, weighted_sampling_ = monte_carlo_off_policy(episodes)
        # get the squared error
        error_ordinary += np.power(ordinary_sampling_ - true_value, 2)
        error_weighted += np.power(weighted_sampling_ - true_value, 2)
    error_ordinary /= runs
    error_weighted /= runs

    plt.plot(error_ordinary, label='Ordinary Importance Sampling')
    plt.plot(error_weighted, label='Weighted Importance Sampling')
    plt.xlabel('Episodes (log scale)')
    plt.ylabel('Mean square error')
    plt.xscale('log')
    plt.legend()

    plt.savefig('../images/figure_5_3.png')
    plt.close()

Example 5.5 Infinite Variance

这个想解释是因为第一次我一直不理解

  1. 为什么target policy和behaviour policy 不一样的话 ρ \rho ρ importance sampling会为0
  2. 为什么target policy里面: all returns would be exactly 1

自问自答:

  1. 我在上面自问自答了 ∏ k = t T − 1 π ( A k ∣ S k ) b ( A k ∣ S k ) \prod_{k=t}^{T-1} \frac{\pi\left(A_{k} \mid S_{k}\right)}{b\left(A_{k} \mid S_{k}\right)} k=tT1b(AkSk)π(AkSk) 这里,注意target policy和behaviour policy是一样的 A k A_k Ak所以当表现不一样的时候整个式子就是0,我们计算的是他们一致时采样度
  2. 因为target policy是学习成为了最优策略,所以他知道现在一直要往左走,而target policy在state s下的value一直都是1,因为他的最优就是在 π ( go left|state s ) \pi(\text{go left|state s}) π(go left|state s)
#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

ACTION_BACK = 0
ACTION_END = 1

# behavior policy
def behavior_policy():
    return np.random.binomial(1, 0.5)

# target policy
def target_policy():
    return ACTION_BACK

# one turn
def play():
    # track the action for importance ratio
    trajectory = []
    while True:
        action = behavior_policy()
        trajectory.append(action)
        if action == ACTION_END:
            return 0, trajectory
        if np.random.binomial(1, 0.9) == 0:
            return 1, trajectory

def figure_5_4():
    runs = 10
    episodes = 100000
    for run in range(runs):
        rewards = []
        for episode in range(0, episodes):
            reward, trajectory = play()
            if trajectory[-1] == ACTION_END:
                rho = 0
            else:
                rho = 1.0 / pow(0.5, len(trajectory))
            rewards.append(rho * reward)
        rewards = np.add.accumulate(rewards)
        estimations = np.asarray(rewards) / np.arange(1, episodes + 1)
        plt.plot(estimations)
    plt.xlabel('Episodes (log scale)')
    plt.ylabel('Ordinary Importance Sampling')
    plt.xscale('log')

    plt.savefig('../images/figure_5_4.png')
    plt.close()

if __name__ == '__main__':
    figure_5_4()

5.6 Incremental Implementation

  • Chapter 2 我们使用的是:average to rewards
    Q n + 1 = Q n + 1 n [ R n − Q n ] Q_{n+1}=Q_n+\frac{1}{n}[R_n-Q_n] Qn+1=Qn+n1[RnQn]
  • Monto Carlo 我们使用的是:average to returns
    G ← γ G + R t + 1 G \larr \gamma G+R_{t+1} GγG+Rt+1

后面讨论的off-policy using weighted importance sampling,所以需要对return进行加权平均,如果写成第二章的形式:
首先已知,对于return 序列为 G 1 , G 2 , … , G n − 1 G_1,G_2, \dots, G_{n-1} G1,G2,,Gn1 都是从相同的状态开始,每一个return对应一个随机的权重 W i W_i Wi,我们希望的是在获得额外的return G n G_n Gn时能进行更新
V n = ∑ k = 1 n − 1 W k G k ∑ k = 1 n − 1 W k , n ≥ 2 V_n=\frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1}W_k}, n \ge 2 Vn=k=1n1Wkk=1n1WkGk,n2为了不断更新 V n V_n Vn,假设 C n = ∑ k = 1 n W k C_n=\sum_{k=1}^{n}W_k Cn=k=1nWk
V n + 1 = V n + W n C n [ G n − V n ] V_{n+1}=V_n+\frac{W_n}{C_n}[G_n-V_n] Vn+1=Vn+CnWn[GnVn]这个式子拿上面的一直和 C n C_n Cn就可以推出来… 把第一个式子分子乘到 V n V_n Vn那边,然后写正常第一个式子的 V n + 1 V_{n+1} Vn+1就好了【读到后面才发现练习5.10要证,所以贴在后面了】
这里是整个off-policy MC加了重要度采样的伪代码:这里是policy evaluation Q ≈ q ∗ Q \thickapprox q_* Qq
在这里插入图片描述

对于on-policy也同样适用,只需要选择同样的behavior policy 和 target policy 也就是W始终为1,伪代码的最后一句不要。近似值 Q Q Q收敛于 Q π Q_\pi Qπ【这里书中说b可能是不同的策略,不太懂,因为之类不是选定了同样的behavior和target policy那 π = b \pi=b π=b,那为什么根据b选出的动作可能不同的策略呢?】

5.7

这里是policy improvement π ≈ π ∗ \pi \thickapprox \pi_* ππ
在这里插入图片描述
在这一节提到了潜在的问题:this method learns only from the tail of episodes但是如果有前面的episodes里的action都是最优的 就可能需要很久才能找到,同样的这一问题在nongreedy actions在long episodes前面的时候更为明显【问:这里的nongreedy适合nonoptimal是一个意思嘛?】
【Exercise 5.12很有意思】

5.9 *Pre-decision Importance Sampling

一开始的疑问主要集中在为什么5.12可以通过5.13消的这么多,也就是only the first and the last are related:【也是ex 5.13的文字版解释】

  1. 首先得明确 R t + 1 R_{t+1} Rt+1是怎么得到的,是 A t ∣ S t A_t|S_t AtSt后的一个结果,所以只和前一个状态动作有关
  2. 在处理期望的时候,相乘相关的可以一起,独立的可以提出来单独乘
  3. 然后其余单独乘的项按照5.13的解释:也就是 π ( A k ∣ S k ) b ( A k ∣ S k ) \frac{\pi(A_k|S_k)}{b(A_k|S_k)} b(AkSk)π(AkSk)的期望就是所有的behaviour 可能发生的动作概率乘以这个分数求和就是期望值

那么通过以上解释后,就能得到式子5.14啦,这也是importance-sampling的一种替代方法,也叫:pre-decision importance sampling

5.10 Summary

这一章节是使用monte carlo去学习value function和optimal policies from experience,对比与DP方法的优点是:

  • they can be used to learn optimal behavior directly from interaction with the environment, with no model of the environment’s dynamics
  • they can be used with simulation or sample models
  • easy and efficient to focus Monte Carlo methods on the small subset of the states
  • less harmed by violations of the Markov property

第四点是因为Monte Carlo没有使用bootstrap

All Exercise Part

Exercise 5.1: Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why are the front most values higher in the upper diagrams than in the lower?

1. It is due to the strategy that player will not stop until meeting 20 or 21. That indicates player would face the risk of failing by hitting, which results the low value part right before 20 and 21. On the 20 and 21, however, the player stops and has a very high opportunity to win, especially when dealer will stop at 17 or higher.
2. It drop off for the whole last row on the left because if Dealer showing an ACE, It has very high possibility of getting higher score than the player when it counts as 11. Thus, the value of dealer’s A has contained the dealer’s winning rate of making it usable or not. Other cards have no such condition thus A is a special which makes the gap.
3. Front most values are higher in the upper diagrams because A represent dual values of being used as 1 and 11 in the upper diagram. It makes the player better o and is similar with the condition of having drop in the leftmost rows.

Exercise 5.2: Suppose every-visit MC was used instead of first-visit MC on the blackjack task. Would you expect the results to be very different? Why or why not?
No. Black jack does not contain two duplicate state in any episode, making first-visit and every-visit method essentially the same thing.

Exercise 5.10: Derive the weighted-average update rule (5.8) from (5.7). Follow the pattern of the derivation of the unweighted rule (2.3).
在这里插入图片描述

Exercise 5.12: Racetrack (programming) Consider driving a race car around a turn like those shown in Figure 5.5. You want to go as fast as possible, but not so fast as to run off the track. In our simplified racetrack, the car is at one of a discrete set of grid positions, the cells in the diagram. The velocity is also discrete, a number of grid cells moved horizontally and vertically per time step. The actions are increments to the velocity components. Each may be changed by +1, -1, or 0 in each step, for a total of nine (3x3) actions. Both velocity components are restricted to be nonnegative and less than 5, and they cannot both be zero except at the starting line. Each episode begins in one of the randomly selected start states with both velocity components zero and ends when the car crosses the finish line. The rewards are -1 for each step until the car crosses the finish line. If the car hits the track boundary, it is moved back to a random position on the starting line, both velocity components are reduced to zero, and the episode continues. Before updating the car’s location at each time step, check to see if the projected path of the car intersects the track boundary. If it intersects the finish line, the episode ends; if it intersects anywhere else, the car is considered to have hit the track boundary and is sent back to the starting line. To make the task more challenging, with probability 0.1 at each time step the velocity increments are both zero, independently of the intended increments. Apply a Monte Carlo control method to this task to compute the optimal policy from each starting state. Exhibit several trajectories following the optimal policy (but turn the noise off for these trajectories).
赛道地图示意:

引用参考:Solving Racetrack in Reinforcement Learning using Monte Carlo Control

最后的结果:

首先解决这个问题:
0. 对问题进行分析:
在这里插入图片描述

  1. 分割问题:
    a. We need to have a generator whose responsibility will be to randomly generate racetracks for us.
    b. We need to build an environment for this problem. Its main responsibilities will be to start and end episodes. It should also be able to get the new state and reward given the current state and action values.
    c. We need to have an agent (i.e. car here) that would choose an action given the state.
    A visualizer is also needed to visualize the generated racetracks along with the agent’s location on it.
    d. Implementation of the Monte Carlo Off-Policy Control algorithm.

a. 画出地图
整张地图是100x100的,然后frac来保证地图上一半都是无效点,每次while循环内的δ会+1,所以会逐渐从两个角落使那个cell无效,一开始在fracrandom出来的是start和goal的一条线,也就是说从[49,32] 是从第32个到49个都是起点线的cell;

class Generator:
    
    #HELPFUL FUNCTIONS
    def widen_hole_transformation(self,racetrack,start_cell,end_cell):
        
        δ = 1
        while(1):
            if ((start_cell[1] < δ) or (start_cell[0] < δ)):
                racetrack[0:end_cell[0],0:end_cell[1]] = -1
                break

            if ((end_cell[1]+δ > 100) or (end_cell[0]+δ > 100)):
                racetrack[start_cell[0]:100,start_cell[1]:100] = -1
                break
                
            δ += 1

        return racetrack
    
    def calculate_valid_fraction(self, racetrack):
        '''
        Returns the fraction of valid cells in the racetrack
        '''
        return (len(racetrack[racetrack==0])/10000)

    def mark_finish_states(self, racetrack):
        '''
        Marks finish states in the racetrack
        Returns racetrack
        '''
        last_col = racetrack[0:100,99]
        last_col[last_col==0] = 2
        return racetrack
    
    def mark_start_states(self, racetrack):
        '''
        Marks start states in the racetrack
        Returns racetrack
        '''
        last_row = racetrack[99,0:100]
        last_row[last_row==0] = 1
        return racetrack
    
    
    #CONSTRUCTOR
    def __init__(self):
        pass
    
    def generate_racetrack(self):
        '''
        racetrack is a 2d numpy array
        codes for racetrack:
            0,1,2 : valid racetrack cells
            -1: invalid racetrack cell
            1: start line cells
            2: finish line cells
        returns randomly generated racetrack
        '''
        racetrack = np.zeros((100,100),dtype='int')
        
        frac = 1
        while frac > 0.5:    
            
            #transformation
            random_cell = np.random.randint((100,100))
            random_hole_dims = np.random.randint((25,25))
            start_cell = np.array([max(0,x - y//2) for x,y in zip(random_cell,random_hole_dims)])
            end_cell = np.array([min(100,x+y) for x,y in zip(start_cell,random_hole_dims)])
        
            #apply_transformation
            racetrack = self.widen_hole_transformation(racetrack, start_cell, end_cell)
            frac = self.calculate_valid_fraction(racetrack)
        
        racetrack = self.mark_start_states(racetrack)
        racetrack = self.mark_finish_states(racetrack)
        
        return racetrack

然后地图建好了后,建整个我们说的状态,动作,reward等的东西,也就是我们通常说的训练环境

class Environment:
    
    #HELPFUL FUNCTIONS
    
    def get_new_state(self, state, action):
        '''
        Get new state after applying action on this state
        Assumption: The car keeps on moving with the current velocity and then action is applied to 
        change the velocity
        '''
        new_state = state.copy()
        new_state[0] = state[0] - state[2]
        new_state[1] = state[1] + state[3]
        new_state[2] = state[2] + action[0]
        new_state[3] = state[3] + action[1]
        return new_state
    
    def select_randomly(self,NUMPY_ARR):
        '''
        Returns a value uniform randomly from NUMPY_ARR
        Here NUMPY_ARR should be 1 dimensional
        '''
        return np.random.choice(NUMPY_ARR)
    
    def set_zero(NUMPY_ARR):
        '''
        Returns NUMPY_ARR after making zero all the elements in it
        '''
        NUMPY_ARR[:] = 0
        return NUMPY_ARR
    
    def is_finish_line_crossed(self, state, action):
        '''
        Returns True if the car crosses the finish line
                False otherwise
        '''
        new_state = self.get_new_state(state, action)
        old_cell, new_cell = state[0:2], new_state[0:2]
        
        '''
        new_cell's row index will be less
        '''
        rows = np.array(range(new_cell[0],old_cell[0]+1))
        cols = np.array(range(old_cell[1],new_cell[1]+1))
        fin = set([tuple(x) for x in self.data.finish_line])
        row_col_matrix = [(x,y) for x in rows for y in cols]
        intersect = [x for x in row_col_matrix if x in fin]
        
        return len(intersect) > 0
    
    def is_out_of_track(self, state, action):
        '''
        Returns True if the car goes out of track if action is taken on state
                False otherwise
        '''
        new_state = self.get_new_state(state, action)
        old_cell, new_cell = state[0:2], new_state[0:2]
        
        if new_cell[0] < 0 or new_cell[0] >= 100 or new_cell[1] < 0 or new_cell[1] >= 100:
            return True
        
        else:
            return self.data.racetrack[tuple(new_cell)] == -1
    
    #CONSTRUCTOR
    def __init__(self, data, gen):
        '''
        initialize step_count to be 0
        '''
        self.data = data
        self.gen = gen
        self.step_count = 0
    
    #MEMBER FUNCTIONS
    
    def reset(self):
        self.data.episode = dict({'S':[],'A':[],'probs':[],'R':[None]})
        self.step_count = 0
    
    def start(self):
        '''
        Makes the velocity of the car to be zero
        Returns the randomly selected start state.
        '''
        state = np.zeros(4,dtype='int')
        state[0] = 99
        state[1] = self.select_randomly(self.data.start_line[:,1])
        '''
        state[2] and state[3] are already zero
        '''
        return state
    
    def step(self, state, action):
        '''
        Returns the reward and new state when action is taken on state
        Checks the following 2 cases maintaining the order:
            1. car finishes race by crossing the finish line
            2. car goes out of track
        Ends the episode by returning reward as None and state as usual (which will be terminating)
        '''
        self.data.episode['A'].append(action)
        reward = -1
        
        if (self.is_finish_line_crossed(state, action)):
            new_state = self.get_new_state(state, action)
            
            self.data.episode['R'].append(reward)
            self.data.episode['S'].append(new_state)
            self.step_count += 1
            
            return None, new_state
            
        elif (self.is_out_of_track(state, action)):
            new_state = self.start()
        else:
            new_state = self.get_new_state(state, action)
        
        self.data.episode['R'].append(reward)
        self.data.episode['S'].append(new_state)
        self.step_count += 1
        
        return reward, new_state

最后是我们的agent,就是需要一个agent根据env返回的进行状态的再次返回,然后再使用some policy to make actions. Also, we wrote some helpful functions like finding valid possible actions given current velocity which keeps in check the given constraint on velocity and mapping actions from 1-D to 2-D and vice-versa.

class Agent:
    
    #HELPFUL FUNCTIONS
    def possible_actions(self, velocity):
        '''
        *** Performs two tasks, can be split up ***
        Universe of actions:  α = [(-1,-1),(-1,0),(0,-1),(-1,1),(0,0),(1,-1),(0,1),(1,0),(1,1)]
                            
        Uses constraints to filter out invalid actions given the velocity
        
        0 <= v_x < 5
        0 <= v_y < 5
        v_x and v_y cannot be made both zero (you can't take an action which would make them zero simultaneously)
        Returns list of possible actions given the velocity
        '''
        α = [(-1,-1),(-1,0),(0,-1),(-1,1),(0,0),(1,-1),(0,1),(1,0),(1,1)]
        α = [np.array(x) for x in α]

        β = []
        for i,x in zip(range(9),α):
            new_vel = np.add(velocity,x)
            if (new_vel[0] < 5) and (new_vel[0] >= 0) and (new_vel[1] < 5) and (new_vel[1] >= 0) and ~(new_vel[0] == 0 and new_vel[1] == 0):
                β.append(i)
        β = np.array(β)
        
        return β
    
    def map_to_1D(self,action):
        α = [(-1,-1),(-1,0),(0,-1),(-1,1),(0,0),(1,-1),(0,1),(1,0),(1,1)]
        for i,x in zip(range(9),α):
            if action[0]==x[0] and action[1]==x[1]:
                return i
    
    def map_to_2D(self,action):
        α = [(-1,-1),(-1,0),(0,-1),(-1,1),(0,0),(1,-1),(0,1),(1,0),(1,1)]
        return α[action]
    
    #CONSTRUCTOR
    def __init__(self):
        pass
    
    def get_action(self, state, policy):
        '''
        Returns action given state using policy
        '''
        return self.map_to_2D(policy(state, self.possible_actions(state[2:4])))

其他的关于可视化的界面的生成是通过pygame来可视化的,具体的代码如下:

class Visualizer:
    
    #HELPFUL FUNCTIONS
    
    def create_window(self):
        '''
        Creates window and assigns self.display variable
        '''
        self.display = pygame.display.set_mode((self.width, self.height))
        pygame.display.set_caption("Racetrack")
    
    def setup(self):
        '''
        Does things which occur only at the beginning
        '''
        self.cell_edge = 9
        self.width = 100*self.cell_edge
        self.height = 100*self.cell_edge
        self.create_window()
        self.window = True

    def close_window(self):
        self.window = False
        pygame.quit()

    def draw(self, state = np.array([])):
        self.display.fill(0)
        for i in range(100):
            for j in range(100):
                if self.data.racetrack[i,j]!=-1:
                    if self.data.racetrack[i,j] == 0:
                        color = (255,0,0)
                    elif self.data.racetrack[i,j] == 1:
                        color = (255,255,0)
                    elif self.data.racetrack[i,j] == 2:
                        color = (0,255,0)
                    pygame.draw.rect(self.display,color,((j*self.cell_edge,i*self.cell_edge),(self.cell_edge,self.cell_edge)),1)
        
        if len(state)>0:
            pygame.draw.rect(self.display,(0,0,255),((state[1]*self.cell_edge,state[0]*self.cell_edge),(self.cell_edge,self.cell_edge)),1)
        
        pygame.display.update()
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                self.loop = False
                self.close_window()
                return 'stop'
            elif event.type == pygame.KEYDOWN and event.key == pygame.K_SPACE:
                self.loop = False
                
        return None
        
    def visualize_racetrack(self, state = np.array([])):
        '''
        Draws Racetrack in a pygame window
        '''
        if self.window == False:
            self.setup()
        self.loop = True
        while(self.loop):
            ret = self.draw(state)
            if ret!=None:
                return ret

现在主要就是怎样去训练出我们想要的policy了,也就是The Off-Policy Monte Carlo Control Algorithm

class Monte_Carlo_Control:
    
    #HELPFUL FUNCTIONS
    
    def evaluate_target_policy(self):
        env.reset()
        state = env.start()
        self.data.episode['S'].append(state)
        rew = -1
        while rew!=None:
            action = agent.get_action(state,self.generate_target_policy_action)
            rew, state = env.step(state,action)
            
        self.data.rewards.append(sum(self.data.episode['R'][1:]))
        
    
    def plot_rewards(self):
        ax, fig = plt.subplots(figsize=(30,15))
        x = np.arange(1,len(self.data.rewards)+1)
        plt.plot(x*10, self.data.rewards, linewidth=0.5, color = '#BB8FCE')
        plt.xlabel('Episode number', size = 20)
        plt.ylabel('Reward',size = 20)
        plt.title('Plot of Reward vs Episode Number',size=20)
        plt.xticks(size=20)
        plt.yticks(size=20)
        plt.savefig('RewardGraph.png')
        plt.close()
    
    def save_your_work(self):
        self.data.save_Q_vals()
        self.data.save_C_vals()
        self.data.save_π()
        self.data.save_rewards()
    
    def determine_probability_behaviour(self, state, action, possible_actions):
        best_action = self.data.π[tuple(state)]
        num_actions = len(possible_actions)
        
        if best_action in possible_actions:
            if action == best_action:
                prob = 1 - self.data.ε + self.data.ε/num_actions
            else:
                prob = self.data.ε/num_actions
        else:
            prob = 1/num_actions
        
        self.data.episode['probs'].append(prob)
    
    def generate_target_policy_action(self, state, possible_actions):
        '''
        Returns target policy action, takes state and
        returns an action using this policy
        '''
        if self.data.π[tuple(state)] in possible_actions:
            action = self.data.π[tuple(state)]
        else:
            action = np.random.choice(possible_actions)
            
        return action
    
    def generate_behavioural_policy_action(self, state, possible_actions):
        '''
        Returns behavioural policy action
        which would be ε-greedy π policy, takes state and
        returns an action using this ε-greedy π policy
        '''
        if np.random.rand() > self.data.ε and self.data.π[tuple(state)] in possible_actions:
            action = self.data.π[tuple(state)]
        else:
            action = np.random.choice(possible_actions)
        
        self.determine_probability_behaviour(state, action, possible_actions)
    
        return action
    
    #CONSTRUCTOR
    def __init__(self, data):
        '''
        Initialize, for all s ∈ S, a ∈ A(s):
            data.Q(s, a) ← arbitrary (done in Data)
            data.C(s, a) ← 0 (done in Data)
            π(s) ← argmax_a Q(s,a) 
            (with ties broken consistently) 
            (some consistent approach needs to be followed))
        '''
        self.data = data
        for i in range(100):
            for j in range(100):
                if self.data.racetrack[i,j]!=-1:
                    for k in range(5):
                        for l in range(5):
                            self.data.π[i,j,k,l] = np.argmax(self.data.Q_vals[i,j,k,l])
    
    def control(self,env,agent):
        '''
        Performs MC control using episode list [ S0 , A0 , R1, . . . , ST −1 , AT −1, RT , ST ]
        G ← 0
        W ← 1
        For t = T − 1, T − 2, . . . down to 0:
            G ← γ*G + R_t+1
            C(St, At ) ← C(St,At ) + W
            Q(St, At ) ← Q(St,At) + (W/C(St,At))*[G − Q(St,At )]
            π(St) ← argmax_a Q(St,a) (with ties broken consistently)
            If At != π(St) then exit For loop
            W ← W * (1/b(At|St))        
        '''
        env.reset()
        state = env.start()
        self.data.episode['S'].append(state)
        rew = -1
        while rew!=None:
            action = agent.get_action(state,self.generate_behavioural_policy_action)
            rew, state = env.step(state,action)
        
        G = 0
        W = 1
        T = env.step_count
    
        for t in range(T-1,-1,-1):
            G = data.γ * G + self.data.episode['R'][t+1]
            S_t = tuple(self.data.episode['S'][t])
            A_t = agent.map_to_1D(self.data.episode['A'][t])

下面是一步步的解释:


Exercise 5.13:Show the steps to derive (5.14) from (5.12)
在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
### 回答1: 《强化学习导论》是由Richard S. Sutton和Andrew G. Barto合著的一本经典教材。该书详细介绍了强化学习的基本概念、算法和应用。 强化学习是一种机器学习方法,通过让智能体与环境进行交互学习来解决决策问题。与其他机器学习方法不同,强化学习强调在面对动态环境时基于反馈机制进行学习和优化。智能体通过与环境的交互,通过试错学习来获取最佳行为策略。 该书首先介绍了强化学习的基本概念,包括马尔可夫决策过程、值函数、策略和模型等。然后详细介绍了著名的强化学习算法,如蒙特卡洛方法、时序差分学习、动态规划等。这些算法通过不同的方式来优化智能体的策略或值函数,以实现最优决策。 此外,书中还对探索与利用、函数近似、策略梯度和深度强化学习等重要主题进行了深入讨论。这些主题对了解和解决实际应用中的强化学习问题非常重要。 《强化学习导论》还涵盖了一些实际应用,如机器学习、人工智能领域的自适应控制、机器人学、游戏和金融等。这些实际应用案例有助于读者将强化学习的概念和算法应用于实际问题,并了解其在不同领域的应用情况。 总的来说,《强化学习导论》是学习和了解强化学习领域必不可少的一本书。它详细介绍了强化学习的基本概念、算法和应用,并通过实际案例帮助读者更好地理解和应用强化学习。无论是对学术研究者还是工程师来说,这本书都是一份极具价值的参考资料。 ### 回答2: 《强化学习简介》是Richard S. Sutton和Andrew G. Barto编写的一本经典教材,旨在介绍强化学习的基本理论和方法。本书详细阐述了强化学习中的关键概念,包括马尔可夫决策过程(MDP),值函数、动作值函数和策略等。 在《强化学习简介》中,作者首先介绍了强化学习的背景和定义,并提出了马尔可夫决策过程作为强化学习问题建模的基础。马尔可夫决策过程包括状态、动作、奖励和转移概率这四个主要要素,通过定义系统状态空间、动作空间、奖励函数和状态转移概率函数,可以将强化学习问题转化为一个数学模型。 书中还介绍了强化学习的两种核心学习方法:值函数学习和策略搜索。值函数学习通过估计状态或状态-动作对的值函数,来指导智能体在不同状态下采取最优动作,并通过迭代更新值函数来提高策略的质量。策略搜索则是直接搜索和优化策略本身,通过改进策略来达到最优操作。 此外,本书还介绍了重要的强化学习算法,包括Temporal Difference Learning(TD-Learning)、Q-Learning和策略梯度方法等。这些算法通过巧妙地利用奖励信号和经验数据,来指导智能体学习最佳策略。 《强化学习简介》不仅深入浅出地介绍了强化学习的基本概念和方法,还提供了大量的实例和案例分析,帮助读者更好地理解和应用强化学习。无论是对强化学习感兴趣的研究人员、学生,还是从业者,这本书都是一本不可或缺的参考读物。 ### 回答3: 《强化学习导论》是一本介绍强化学习的重要著作。该书由Richard S. Sutton和Andrew G. Barto合著,共分为十章,全面介绍了强化学习的概念、方法和应用。 在书中,作者首先介绍了强化学习的基本概念,包括马尔科夫决策过程(MDP),状态、动作和奖励的定义以及强化学习中的基本问题,如策略选择和价值函数估计。 接下来,书中介绍了各种强化学习算法,包括值迭代、策略迭代和蒙特卡洛方法等。这些算法分别用于解决不同类型的强化学习问题,如预测、控制和学习价值函数。 此外,书中还介绍了基于模型的强化学习方法,如动态规划和强化学习中的基于模型的规划。这些方法利用对环境的模型进行规划,以改进策略和价值函数的学习效果。 在进一步讨论强化学习的高级主题时,作者介绍了函数逼近和深度强化学习。这些技术允许在复杂环境中处理高维状态和动作空间,并在估计价值函数和优化策略方面取得更好的性能。 最后,书中还包括了对强化学习的应用领域的概述,包括游戏、机器人和交通等。这些应用展示了强化学习在解决实际问题中的潜力和成功案例。 总的来说,《强化学习导论》全面而深入地介绍了强化学习的基本原理、算法和应用。它适合作为学习和研究强化学习的入门材料,并为读者提供了理解和掌握强化学习的基础。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Kin-Zhang

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值