马尔科夫决策过程(MDP) : BlackJack (MC-Off Policy)

本文介绍使用离策略方法对BlackJack游戏中的值函数进行估计。包括两种重要性采样方法:普通重要性采样和加权重要性采样,并探讨了折扣重要性采样和平决策重要性采样的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

BlackJack:

    问题仍以Black Jack为例,通过off-policy的方法进行值函数的估计。

问题抽象:
  • s s s : 状态(闲);Ace,玩家(闲)的牌面,庄家的明牌牌面。
  • a a a : 动作;要牌(hit:1),停牌(stick:0)。
  • r r r : 奖励;[-1,0, 1],输,平,赢。
  • γ = 1 \gamma = 1 γ=1

On-policy 和 Off-policy

  • On-policy:只有一个策略,episode数据的生成和值函数的估计都是基于这个策略。
  • Off-policy:使用两个策略,target policy(目标)和 behavior policy(行为),两个策略各有分工。
    • target policy : 学习获得一个最优策略。
    • behavior policy : 探索环境,生成行为(episode)数据。
    • target policy ≠ \neq = behavior policy,利用behavior policy的数据来估计target policy的值函数。
    • importance sampling : 重要性采样系数 ρ \rho ρ,即:把behavior policy下的动作值期望转换为target policy的动作值期望。 V π ( s ) ← ρ V b ( s ) V_{\pi}(s) \gets \rho V_{b}(s) Vπ(s)ρVb(s)
重要性采样系数(importance sampling ratio)
  1. state-action序列的发生概率。

P r { A t , S t + 1 , A t + 1 , . . . , S T ∣ S t , A t : T − 1 ∼ π } = π ( A t ∣ S t ) p ( S t + 1 ∣ S t , A t ) π ( A t + 1 , S t + 1 ) . . . p ( S T ∣ S T − 1 , A T − 1 ) = ∏ k = t T − 1 π ( A k ∣ S k ) p ( S k + 1 ∣ S k , A k ) \begin{aligned}Pr\{A_t,S_{t+1},A_{t+1},...,S_{T}|S_t,A_{t:T-1} \sim \pi\}&=\pi(A_t|S_t)p(S_{t+1}|S_t,A_t)\pi(A_{t+1},S_{t+1})...p(S_T|S_{T-1},A_{T-1})\\ &= \prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k) \end{aligned} Pr{At,St+1,At+1,...,STSt,At:T1π}=π(AtSt)p(St+1St,At)π(At+1,St+1)...p(STST1,AT1)=k=tT1π(AkSk)p(Sk+1Sk,Ak)

  • S t S_t St : 起始状态
  • π ( A t ∣ S t ) \pi(A_t|S_t) π(AtSt) : 动作概率
  • p ( S t + 1 ∣ S t , A t ) p(S_{t+1}|S_t,A_t) p(St+1St,At) : 状态转移概率
  1. 重要性采样系数:两个策略的s-a序列概率的比值。

ρ t : T − 1 = ∏ k = t T − 1 π ( A k ∣ S k ) π ( S k + 1 ∣ S k , A k ) ∏ k = t T − 1 b ( A k ∣ S k ) π ( S k + 1 ∣ S k , A k ) = ∏ k = t T − 1 π ( A k ∣ S k ) b ( A k ∣ S k ) \rho_{t:T-1} = \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)\pi(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k)\pi(S_{k+1}|S_k,A_k)} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} ρt:T1=k=tT1b(AkSk)π(Sk+1Sk,Ak)k=tT1π(AkSk)π(Sk+1Sk,Ak)=k=tT1b(AkSk)π(AkSk)

  • π \pi π : 目标策略(i.e.greedy policy)
  • b b b : 行为策略(i.e. ϵ \epsilon ϵ-greedy policy)
  1. 行为值函数 → \to 策略值函数

v b ( s ) = E [ G t ∣ S t = s ] v_b(s) = E[G_t|S_t=s] vb(s)=E[GtSt=s]
v π ( s ) = E [ ρ t : T − 1 G t ∣ S t = s ] v_{\pi}(s) = E[\rho_{t:T-1}G_t|S_t=s] vπ(s)=E[ρt:T1GtSt=s]

模拟游戏过程

记录状态、动作、奖励。

import warnings
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import namedtuple
from tqdm.notebook import tqdm

warnings.filterwarnings('ignore')
# dealer policy
def dealer_policy(cards_num):
    if cards_num < 17:
        return 1
    else:
        return 0
def play_blackjack(policy_player, policy_dealer, initial_state=None, initial_action=None):
    '''
    policy_player : state(usable_ace,player_sum,dealer_card)->action
    policy_dealer : dealer_sum
    return -> reward, trajectory
    '''
    def card_value(card):
        return 11 if card == 1 else card
    # 闲
    player_sum = 0
    # 庄
    dealer_card1 = 0
    dealer_card2 = 0
    # trajectory
    player_trajectory = []
    player_transition = namedtuple('Transition', ['state', 'action'])
    # False : Ace = 1, True Ace = 11
    usable_ace_player = False
    usable_ace_dealer = False
    if initial_state is None:
        while player_sum < 12:
            # 点数小于12,一直拿牌
            card = min(np.random.randint(1, 14), 10)
            #print(card)
            # 小于12,Ace = 11
            player_sum += card_value(card)
            # 点数超过21
            if player_sum > 21:
                # Ace = 1
                player_sum -= 10
            else:
                usable_ace_player |= (1 == card)
        # 初始化庄家牌,第一张为明牌
        dealer_card1 = min(np.random.randint(1, 14), 10)
        dealer_card2 = min(np.random.randint(1, 14), 10)
    else:
        # 指定初始状态
        usable_ace_player, player_sum, dealer_card1 = initial_state
        dealer_card2 = min(np.random.randint(1, 14), 10)
    
    dealer_sum = card_value(dealer_card1) + card_value(dealer_card2)
    usable_ace_dealer = 1 in (dealer_card1, dealer_card2)
    if dealer_sum > 21:
        # use Ace = 1
        dealer_sum -= 10
    # 闲先
    while True:
        if initial_action is not None:
            player_action = initial_action
            initial_action = None
        else:
            player_action = policy_player(usable_ace_player, player_sum, dealer_card1)
        # 状态,动作
        player_sa = player_transition((usable_ace_player, player_sum, dealer_card1), player_action)
        player_trajectory.append(player_sa)
        if player_action == 0:
            break
        # 拿牌,默认Ace = 11
        card = min(np.random.randint(1, 14), 10)
        #print(card)
        # Keep track of the ace count
        ace_count = int(usable_ace_player)
        if card == 1:
            ace_count += 1
        player_sum += card_value(card)
        # 避免bust ,Ace = 1
        while player_sum > 21 and ace_count:
            player_sum -= 10
            ace_count -= 1
        if player_sum > 21:
            return -1 , player_trajectory
        usable_ace_player = (ace_count == 1)
    # 庄
    while True:
        dealer_action = policy_dealer(dealer_sum)
        if dealer_action == 0:
            break
        # 拿牌,默认Ace = 11
        new_card = min(np.random.randint(1, 14), 10)
        #print(card)
        ace_count = int(usable_ace_dealer)
        if new_card == 1:
            ace_count += 1
        dealer_sum += card_value(new_card)
        # 避免bust,Ace = 1
        while dealer_sum > 21 and ace_count:
            dealer_sum -= 10
            ace_count -= 1
        if dealer_sum > 21:
            return 1 , player_trajectory
        usable_ace_dealer = (ace_count == 1)
    if player_sum > dealer_sum:
        return 1 , player_trajectory
    elif player_sum == dealer_sum:
        return 0 , player_trajectory
    else:
        return -1 , player_trajectory

动作值函数估计 q ( s , a ) q(s,a) q(s,a)

四种不同重要性采样:

  1. Ordinary importance sampling
  2. Weighted importance sampling
  3. Discounting-aware Importance Sampling(略)
  4. Per-decision Importance Sampling(略)
1.Ordinary importance sampling(普通重要性采样)

Q ( s , a ) = ∑ t ∈ J ( s , a ) ρ t : T ( t ) − 1 G t ∣ J ( s , a ) ∣ Q(s,a) = \frac{\sum_{t\in \mathcal J(s,a)} \rho_{t:T(t)-1} G_t}{|\mathcal J(s,a)|} Q(s,a)=J(s,a)tJ(s,a)ρt:T(t)1Gt

  • J ( s , a ) \mathcal J(s,a) J(s,a) : 状态动作pair ( s , a ) (s,a) (s,a)被访问的time step集合
    • every-visit: ( s , a ) (s,a) (s,a)每次出现的time step
    • first-visit: ( s , a ) (s,a) (s,a),第一次出现time step
  • ∣ J ( s , a ) ∣ |\mathcal J(s,a)| J(s,a) : 状态 s s s被访问的次数
    • every-visit: ( s , a ) (s,a) (s,a),出现的次数
    • first-visit:1
  • G t G_t Gt : 状态 ( s , a ) (s,a) (s,a),time stpe t t t R e t u r n Return Return
MC-off policy(Ordinary importance sampling)
  • Initialize,for all s ∈ S , a ∈ A ( s ) s \in \mathcal S, a \in \mathcal A(s) sS,aA(s):
    • Q ( s , a ) ∈ R ( a r b i t r a r i l y ) Q(s,a) \in \mathbb R(arbitrarily) Q(s,a)R(arbitrarily)
    • C ( s , a ) ← 0 \mathcal C(s,a) \gets 0 C(s,a)0
    • π ( s ) ← a r g m a x a Q ( s , a ) \pi(s) \gets \underset{a}{argmax}Q(s,a) π(s)aargmaxQ(s,a)
  • Loop,for each episode:
    • b ← b \gets b soft policy
    • Generate an episode use b : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T b:S_0,A_0,R_1,...,S_{T-1},A_{T-1},R_T b:S0,A0,R1,...,ST1,AT1,RT
    • G ← 0 G \gets 0 G0
    • W ← 1 W \gets 1 W1
    • Loop, for each step of episode, t = T − 1 , T − 2 , . . . , 0 t = T-1, T-2,...,0 t=T1,T2,...,0:
      • G ← γ G + R t + 1 G \gets \gamma G + R_{t+1} GγG+Rt+1
      • π ( S t ) ← a r g m a x a Q ( S t , a ) \pi(S_t) \gets \underset{a}{argmax}Q(S_t,a) π(St)aargmaxQ(St,a)
      • C ( S t , A t ) ← C ( S t , A t ) + 1 \mathcal C(S_t,A_t) \gets \mathcal C(S_t, A_t) + 1 C(St,At)C(St,At)+1
      • Q ( S t , A t ) ← Q ( S t , A t ) + W G − Q ( S t , A t ) C ( S t , A t ) Q(S_t,A_t) \gets Q(S_t, A_t) + \frac{W G - Q(S_t,A_t)}{\mathcal C(S_t,A_t)} Q(St,At)Q(St,At)+C(St,At)WGQ(St,At)
      • if A t ≠ π ( S t ) A_t \neq \pi(S_t) At=π(St), exit inner Loop (proceed to next episode)
      • W ← W 1 b ( A t ∣ S t ) W \gets W\frac{1}{b(A_t|S_t)} WWb(AtSt)1

  • π ( a ∣ s ) = 1 , { a = a r g m a x a Q ( s , a ) } \pi(a|s) = 1,\{a = \underset{a}{argmax}Q(s,a)\} π(as)=1,{a=aargmaxQ(s,a)}
  • π ( a ∣ s ) b ( a ∣ s ) = 1 b ( a ∣ s ) \frac{\pi(a|s)}{b(a|s)} = \frac{1}{b(a|s)} b(as)π(as)=b(as)1
  • C ( S t , A t ) \mathcal C(S_t, A_t) C(St,At) : ∣ J ( S t , A t ) ∣ |\mathcal J(S_t, A_t)| J(St,At),(s,a)累计次数和
def monte_carlo_off_policy(episodes, gamma=1.0, epsilon=0.1, threshold=0.0001):
    # nearly greedy policy :behavior
    def soft_policy(usable_ace, player_sum, dealer_card, epsilon=epsilon):
        usable_ace = int(usable_ace)
        player_sum -= 12
        dealer_card -= 1
        values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
        proba = np.random.uniform(0, 1)
        if proba <= epsilon:
            action = np.random.randint(0, 2)
        else:
            action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
        return action

    # greedy policy :target
    def greedy_policy(usable_ace, player_sum, dealer_card):
        usable_ace = int(usable_ace)
        player_sum -= 12
        dealer_card -= 1
        values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
        action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
        return action
    
    # random policy : behavior
    def random_policy(usable_ace, player_sum, dealer_card):
        action = np.random.randint(0, 2)
        return action
    # Initialize
    state_action_values = np.zeros((10, 10, 2, 2))
    state_action_pair_count = np.ones((10, 10, 2, 2))
    # Loop for each episode
    delta_history = []
    for episode in tqdm(range(episodes)):
        old_sa = state_action_values.copy()
        # Generate an episode
        player_reward, player_traj = play_blackjack(soft_policy, dealer_policy)
        player_states = [t.state for t in player_traj]
        player_actions = [t.action for t in player_traj]
        player_rewards = [0]*len(player_states)
        player_rewards[-1] = player_reward
        # State,Action,Return
        R = 0
        Gs = []
        for r in player_rewards[::-1]:
            R = r + gamma * R
            Gs.insert(0, R)
        # Loop ::-1
        proba_b_a = 1.0
        for player_state, action, G in zip(player_states[::-1], player_actions[::-1], Gs[::-1]):
            usable_ace_player, player_sum, dealer_card = player_state
            target_action = greedy_policy(usable_ace_player, player_sum, dealer_card)  # target policy
            usable_ace = int(usable_ace_player)
            player_sum -= 12
            dealer_card -= 1
            # Update values of state-action
            if target_action == action:
                proba_b_a *= (1-epsilon)
                
                old_val = state_action_values[player_sum, dealer_card, usable_ace, action]
                sa_count = state_action_pair_count[player_sum, dealer_card, usable_ace, action]
                new_val = old_val + (G*(1/proba_b_a) - old_val)/(sa_count + 1)
                #new_val = old_val + (1/proba_b_a)/(sa_count + 1)*(G - old_val)
                state_action_values[player_sum, dealer_card, usable_ace, action] = new_val
                state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1
            else :
                break
        delta = abs(state_action_values - old_sa).max()
        delta_history.append(delta)
    return state_action_values, delta_history

usable Ace

在这里插入图片描述

no usable Ace
在这里插入图片描述

策略可视化

在这里插入图片描述

weighted importance sampling(加权重要性采样)

Q ( s , a ) = ∑ t ∈ J ( s , a ) ρ t : T ( t ) − 1 G t ∑ t ∈ J ( s , a ) ρ t : T ( t ) − 1 Q(s,a) = \frac{\sum_{t\in \mathcal J(s,a)} \rho_{t:T(t)-1} G_t}{\sum_{t\in \mathcal J(s,a)} \rho_{t:T(t)-1}} Q(s,a)=tJ(s,a)ρt:T(t)1tJ(s,a)ρt:T(t)1Gt

MC-off policy(weighted importance sampling)
  • Initialize,for all s ∈ S , a ∈ A ( s ) s \in \mathcal S, a \in \mathcal A(s) sS,aA(s):
    • Q ( s , a ) ∈ R ( a r b i t r a r i l y ) Q(s,a) \in \mathbb R(arbitrarily) Q(s,a)R(arbitrarily)
    • C ( s , a ) ← 0 C(s,a) \gets 0 C(s,a)0
    • π ( s ) ← a r g m a x a Q ( s , a ) \pi(s) \gets \underset{a}{argmax}Q(s,a) π(s)aargmaxQ(s,a)
  • Loop,for each episode:
    • b ← b \gets b soft policy
    • Generate an episode use b : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T b:S_0,A_0,R_1,...,S_{T-1},A_{T-1},R_T b:S0,A0,R1,...,ST1,AT1,RT
    • G ← 0 G \gets 0 G0
    • W ← 1 W \gets 1 W1
    • Loop, for each step of episode, t = T − 1 , T − 2 , . . . , 0 t = T-1, T-2,...,0 t=T1,T2,...,0:
      • G ← γ G + R t + 1 G \gets \gamma G + R_{t+1} GγG+Rt+1
      • π ( S t ) ← a r g m a x a Q ( S t , a ) \pi(S_t) \gets \underset{a}{argmax}Q(S_t,a) π(St)aargmaxQ(St,a)
      • C ( S t , A t ) ← C ( S t , A t ) + W C(S_t,A_t) \gets C(S_t, A_t) + W C(St,At)C(St,At)+W
      • Q ( S t , A t ) ← Q ( S t , A t ) + W G − Q ( S t , A t ) C ( S t , A t ) o r ( W C ( S t , A t ) [ G − Q ( S t , A t ) ] ) Q(S_t,A_t) \gets Q(S_t, A_t) + \frac{WG - Q(S_t,A_t)}{C(S_t,A_t)} or (\frac{W}{C(S_t,A_t)}[G - Q(S_t,A_t)]) Q(St,At)Q(St,At)+C(St,At)WGQ(St,At)or(C(St,At)W[GQ(St,At)])
      • if A t ≠ π ( S t ) A_t \neq \pi(S_t) At=π(St), exit inner Loop (proceed to next episode)
      • W ← W 1 b ( A t ∣ S t ) W \gets W\frac{1}{b(A_t|S_t)} WWb(AtSt)1

  • C ( S t , A t ) C(S_t, A_t) C(St,At) : ∑ t ∈ J ( s , a ) ρ t : T ( t ) − 1 \sum_{t\in \mathcal J(s,a)} \rho_{t:T(t)-1} tJ(s,a)ρt:T(t)1系数的累计和
  • W W W : 系数 ρ \rho ρ
def monte_carlo_off_policy(episodes, gamma=1.0, epsilon=0.1, threshold=0.0001):
    # nearly greedy policy :behavior
    def soft_policy(usable_ace, player_sum, dealer_card, epsilon=0.1):
        usable_ace = int(usable_ace)
        player_sum -= 12
        dealer_card -= 1
        values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
        proba = np.random.uniform(0, 1)
        if proba <= epsilon:
            action = np.random.randint(0, 2)
        else:
            action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
        return action

    # greedy policy :target
    def greedy_policy(usable_ace, player_sum, dealer_card):
        usable_ace = int(usable_ace)
        player_sum -= 12
        dealer_card -= 1
        values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
        action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
        return action
    # Initialize
    state_action_values = np.zeros((10, 10, 2, 2))
    state_action_pair_weights = np.ones((10, 10, 2, 2))
    # Loop for each episode
    delta_history = []
    for episode in tqdm(range(episodes)):
        old_sa = state_action_values.copy()
        # Generate an episode
        player_reward, player_traj = play_blackjack(soft_policy, dealer_policy)
        player_states = [t.state for t in player_traj]
        player_actions = [t.action for t in player_traj]
        player_rewards = [0]*len(player_states)
        player_rewards[-1] = player_reward
        # State,Action,Return
        R = 0
        Gs = []
        for r in player_rewards[::-1]:
            R = r + gamma * R
            Gs.insert(0, R)
        # Loop ::-1
        proba_b_a = 1.0
        for player_state, action, G in zip(player_states[::-1], player_actions[::-1], Gs[::-1]):
            usable_ace_player, player_sum, dealer_card = player_state
            target_action = greedy_policy(usable_ace_player, player_sum, dealer_card)  # target policy
            usable_ace = int(usable_ace_player)
            player_sum -= 12
            dealer_card -= 1
            # Update values of state-action
            if target_action == action:
                proba_b_a *= 0.9
                
                old_val = state_action_values[player_sum, dealer_card, usable_ace, action]
                sa_weight = state_action_pair_weights[player_sum, dealer_card, usable_ace, action]
                new_val = old_val + (G*(1/proba_b_a) - old_val)/(sa_weight + 1/proba_b_a)
                #new_val = old_val + (1/proba_b_a)/(sa_count + 1)*(G - old_val)
                state_action_values[player_sum, dealer_card, usable_ace, action] = new_val
                state_action_pair_weights[player_sum, dealer_card, usable_ace, action] += 1/proba_b_a
            else :
                break
        delta = abs(state_action_values - old_sa).max()
        delta_history.append(delta)
    return state_action_values, delta_history

usable Ace
在这里插入图片描述

no usable Ace

在这里插入图片描述

策略可视化

在这里插入图片描述

3. Discounting-aware Importance Sampling(折扣重要性采样)

核心思想 flat partial returns:

  • flat: 表示没有折扣。
  • partial : 表示Return不是完整,是部分的。
    G ‾ t : h = R t + 1 + R t + 2 + . . . + R h , 0 < = t < h < = T \overline{G}_{t:h} = R_{t+1} + R_{t+2} + ... + R_h, 0 <= t < h <= T Gt:h=Rt+1+Rt+2+...+Rh,0<=t<h<=T
  • G t : h G_{t:h} Gt:h : time step th之间的Returns
  • T T T : terminal time step
    完整的Returns可以表示为部分Returns的和:
    G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T = ( 1 − γ ) R t + 1 + ( 1 − γ ) γ ( R t + 1 + R t + 2 ) + ( 1 − γ ) γ 2 ( R t + 1 + R t + 2 + R t + 3 ) + . . . + ( 1 − γ ) γ T − t − 2 ( R t + 1 + R t + 2 + . . . + R T − 1 ) + γ T − t − 1 ( R t + 1 + R t + 2 + . . . + R T ) = ( 1 − γ ) ∑ h = t + 1 T − 1 γ h − t − 1 G ‾ t : h + γ T − t − 1 G ‾ t : T \begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... + \gamma^{T-t-1} R_T \\ &= (1-\gamma)R_{t+1} \\ & + (1-\gamma)\gamma(R_{t+1} + R_{t+2}) \\ & + (1-\gamma)\gamma^2(R_{t+1} + R_{t+2} + R_{t+3}) \\ & + ... \\ & + (1-\gamma)\gamma^{T-t-2}(R_{t+1} + R_{t+2} + ... + R_{T-1}) \\ & + \gamma^{T-t-1}(R_{t+1} + R_{t+2} + ... + R_T) \\ &= (1-\gamma)\sum_{h=t+1}^{T-1} \gamma^{h-t-1}\overline{G}_{t:h} + \gamma^{T-t-1}\overline{G}_{t:T} \end{aligned} Gt=Rt+1+γRt+2+γ2Rt+3+...+γTt1RT=(1γ)Rt+1+(1γ)γ(Rt+1+Rt+2)+(1γ)γ2(Rt+1+Rt+2+Rt+3)+...+(1γ)γTt2(Rt+1+Rt+2+...+RT1)+γTt1(Rt+1+Rt+2+...+RT)=(1γ)h=t+1T1γht1Gt:h+γTt1Gt:T
Discounting-aware & ordinary importance-sampling

Q ( s , a ) = ∑ t ∈ J ( s , a ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 G ‾ t : h + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 G ‾ t : T ( t ) ) ∣ J ( s , a ) ∣ Q(s,a) = \frac{\sum_{t\in \mathcal J(s,a)}\big((1-\gamma)\sum_{h=t+1}^{T(t)-1} \gamma^{h-t-1} \rho_{t:h-1}\overline{G}_{t:h} + \gamma^{T(t)-t-1}\rho_{t:T(t)-1}\overline{G}_{t:T(t)}\big)}{|\mathcal{J}(s,a)|} Q(s,a)=J(s,a)tJ(s,a)((1γ)h=t+1T(t)1γht1ρt:h1Gt:h+γT(t)t1ρt:T(t)1Gt:T(t))

Discount-aware & weight importance-sampling

Q ( s , a ) = ∑ t ∈ J ( s , a ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 G ‾ t : h + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 G ‾ t : T ( t ) ) ∑ t ∈ J ( s , a ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 ) Q(s,a) = \frac{\sum_{t\in \mathcal J(s,a)} \big((1-\gamma)\sum_{h=t+1}^{T(t)-1}\gamma^{h-t-1} \rho_{t:h-1}\overline{G}_{t:h} + \gamma^{T(t)-t-1}\rho_{t:T(t)-1}\overline{G}_{t:T(t)}\big)}{\sum_{t\in \mathcal J(s,a)}\big((1-\gamma)\sum_{h=t+1}^{T(t)-1} \gamma^{h-t-1} \rho_{t:h-1} + \gamma^{T(t)-t-1}\rho_{t:T(t)-1}\big)} Q(s,a)=tJ(s,a)((1γ)h=t+1T(t)1γht1ρt:h1+γT(t)t1ρt:T(t)1)tJ(s,a)((1γ)h=t+1T(t)1γht1ρt:h1Gt:h+γT(t)t1ρt:T(t)1Gt:T(t))

4. Per-decision Importance Sampling

关键思想: E [ ρ t : T − 1 R t + 1 ] = E [ ρ t : t R t + 1 ] \mathbb E[\rho_{t:T-1} R_{t+1}] = \mathbb E[\rho_{t:t}R_{t+1}] E[ρt:T1Rt+1]=E[ρt:tRt+1]
E [ ρ t : T − 1 R t + k ] = E [ ρ t : t + k − 1 R t + k ] \mathbb E[\rho_{t:T-1}R_{t+k}] = \mathbb E[\rho_{t:t+k-1}R_{t+k}] E[ρt:T1Rt+k]=E[ρt:t+k1Rt+k]
E [ ρ t : T − 1 G t ] = [ G ~ t ] \mathbb E[\rho_{t:T-1}G_t] = \mathbb [\tilde{G}_t] E[ρt:T1Gt]=[G~t]
G ~ t = ρ t : t R t + 1 + γ ρ t : t + 1 R t + 2 + γ 2 ρ t : t + 2 R t + 3 + . . . + γ T − t − 1 ρ t : T − 1 R T \tilde{G}_t = \rho_{t:t}R_{t+1} + \gamma\rho_{t:t+1}R_{t+2} + \gamma^2\rho_{t:t+2}R_{t+3} + ...+\gamma^{T-t-1}\rho_{t:T-1}R_T G~t=ρt:tRt+1+γρt:t+1Rt+2+γ2ρt:t+2Rt+3+...+γTt1ρt:T1RT
Q ( s , a ) = ∑ t ∈ J ( s , a ) G ~ t ∣ J ( s , a ) ∣ Q(s,a) = \frac{\sum_{t\in \mathcal{J}(s,a)} \tilde{G}_t}{|\mathcal J(s,a)|} Q(s,a)=J(s,a)tJ(s,a)G~t

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值