BlackJack:
问题仍以Black Jack为例,通过off-policy
的方法进行值函数的估计。
问题抽象:
- s s s : 状态(闲);Ace,玩家(闲)的牌面,庄家的明牌牌面。
- a a a : 动作;要牌(hit:1),停牌(stick:0)。
- r r r : 奖励;[-1,0, 1],输,平,赢。
- γ = 1 \gamma = 1 γ=1
On-policy 和 Off-policy
- On-policy:只有一个策略,episode数据的生成和值函数的估计都是基于这个策略。
- Off-policy:使用两个策略,target policy(目标)和 behavior policy(行为),两个策略各有分工。
- target policy : 学习获得一个最优策略。
- behavior policy : 探索环境,生成行为(episode)数据。
- target policy ≠ \neq = behavior policy,利用behavior policy的数据来估计target policy的值函数。
- importance sampling : 重要性采样系数 ρ \rho ρ,即:把behavior policy下的动作值期望转换为target policy的动作值期望。 V π ( s ) ← ρ V b ( s ) V_{\pi}(s) \gets \rho V_{b}(s) Vπ(s)←ρVb(s)
重要性采样系数(importance sampling ratio)
- state-action序列的发生概率。
P r { A t , S t + 1 , A t + 1 , . . . , S T ∣ S t , A t : T − 1 ∼ π } = π ( A t ∣ S t ) p ( S t + 1 ∣ S t , A t ) π ( A t + 1 , S t + 1 ) . . . p ( S T ∣ S T − 1 , A T − 1 ) = ∏ k = t T − 1 π ( A k ∣ S k ) p ( S k + 1 ∣ S k , A k ) \begin{aligned}Pr\{A_t,S_{t+1},A_{t+1},...,S_{T}|S_t,A_{t:T-1} \sim \pi\}&=\pi(A_t|S_t)p(S_{t+1}|S_t,A_t)\pi(A_{t+1},S_{t+1})...p(S_T|S_{T-1},A_{T-1})\\ &= \prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k) \end{aligned} Pr{At,St+1,At+1,...,ST∣St,At:T−1∼π}=π(At∣St)p(St+1∣St,At)π(At+1,St+1)...p(ST∣ST−1,AT−1)=k=t∏T−1π(Ak∣Sk)p(Sk+1∣Sk,Ak)
- S t S_t St : 起始状态
- π ( A t ∣ S t ) \pi(A_t|S_t) π(At∣St) : 动作概率
- p ( S t + 1 ∣ S t , A t ) p(S_{t+1}|S_t,A_t) p(St+1∣St,At) : 状态转移概率
- 重要性采样系数:两个策略的s-a序列概率的比值。
ρ t : T − 1 = ∏ k = t T − 1 π ( A k ∣ S k ) π ( S k + 1 ∣ S k , A k ) ∏ k = t T − 1 b ( A k ∣ S k ) π ( S k + 1 ∣ S k , A k ) = ∏ k = t T − 1 π ( A k ∣ S k ) b ( A k ∣ S k ) \rho_{t:T-1} = \frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)\pi(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k)\pi(S_{k+1}|S_k,A_k)} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} ρt:T−1=∏k=tT−1b(Ak∣Sk)π(Sk+1∣Sk,Ak)∏k=tT−1π(Ak∣Sk)π(Sk+1∣Sk,Ak)=k=t∏T−1b(Ak∣Sk)π(Ak∣Sk)
- π \pi π : 目标策略(i.e.greedy policy)
- b b b : 行为策略(i.e. ϵ \epsilon ϵ-greedy policy)
- 行为值函数 → \to →策略值函数
v
b
(
s
)
=
E
[
G
t
∣
S
t
=
s
]
v_b(s) = E[G_t|S_t=s]
vb(s)=E[Gt∣St=s]
v
π
(
s
)
=
E
[
ρ
t
:
T
−
1
G
t
∣
S
t
=
s
]
v_{\pi}(s) = E[\rho_{t:T-1}G_t|S_t=s]
vπ(s)=E[ρt:T−1Gt∣St=s]
模拟游戏过程
记录状态、动作、奖励。
import warnings
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import namedtuple
from tqdm.notebook import tqdm
warnings.filterwarnings('ignore')
# dealer policy
def dealer_policy(cards_num):
if cards_num < 17:
return 1
else:
return 0
def play_blackjack(policy_player, policy_dealer, initial_state=None, initial_action=None):
'''
policy_player : state(usable_ace,player_sum,dealer_card)->action
policy_dealer : dealer_sum
return -> reward, trajectory
'''
def card_value(card):
return 11 if card == 1 else card
# 闲
player_sum = 0
# 庄
dealer_card1 = 0
dealer_card2 = 0
# trajectory
player_trajectory = []
player_transition = namedtuple('Transition', ['state', 'action'])
# False : Ace = 1, True Ace = 11
usable_ace_player = False
usable_ace_dealer = False
if initial_state is None:
while player_sum < 12:
# 点数小于12,一直拿牌
card = min(np.random.randint(1, 14), 10)
#print(card)
# 小于12,Ace = 11
player_sum += card_value(card)
# 点数超过21
if player_sum > 21:
# Ace = 1
player_sum -= 10
else:
usable_ace_player |= (1 == card)
# 初始化庄家牌,第一张为明牌
dealer_card1 = min(np.random.randint(1, 14), 10)
dealer_card2 = min(np.random.randint(1, 14), 10)
else:
# 指定初始状态
usable_ace_player, player_sum, dealer_card1 = initial_state
dealer_card2 = min(np.random.randint(1, 14), 10)
dealer_sum = card_value(dealer_card1) + card_value(dealer_card2)
usable_ace_dealer = 1 in (dealer_card1, dealer_card2)
if dealer_sum > 21:
# use Ace = 1
dealer_sum -= 10
# 闲先
while True:
if initial_action is not None:
player_action = initial_action
initial_action = None
else:
player_action = policy_player(usable_ace_player, player_sum, dealer_card1)
# 状态,动作
player_sa = player_transition((usable_ace_player, player_sum, dealer_card1), player_action)
player_trajectory.append(player_sa)
if player_action == 0:
break
# 拿牌,默认Ace = 11
card = min(np.random.randint(1, 14), 10)
#print(card)
# Keep track of the ace count
ace_count = int(usable_ace_player)
if card == 1:
ace_count += 1
player_sum += card_value(card)
# 避免bust ,Ace = 1
while player_sum > 21 and ace_count:
player_sum -= 10
ace_count -= 1
if player_sum > 21:
return -1 , player_trajectory
usable_ace_player = (ace_count == 1)
# 庄
while True:
dealer_action = policy_dealer(dealer_sum)
if dealer_action == 0:
break
# 拿牌,默认Ace = 11
new_card = min(np.random.randint(1, 14), 10)
#print(card)
ace_count = int(usable_ace_dealer)
if new_card == 1:
ace_count += 1
dealer_sum += card_value(new_card)
# 避免bust,Ace = 1
while dealer_sum > 21 and ace_count:
dealer_sum -= 10
ace_count -= 1
if dealer_sum > 21:
return 1 , player_trajectory
usable_ace_dealer = (ace_count == 1)
if player_sum > dealer_sum:
return 1 , player_trajectory
elif player_sum == dealer_sum:
return 0 , player_trajectory
else:
return -1 , player_trajectory
动作值函数估计 q ( s , a ) q(s,a) q(s,a)
四种不同重要性采样:
- Ordinary importance sampling
- Weighted importance sampling
- Discounting-aware Importance Sampling(略)
- Per-decision Importance Sampling(略)
1.Ordinary importance sampling(普通重要性采样)
Q ( s , a ) = ∑ t ∈ J ( s , a ) ρ t : T ( t ) − 1 G t ∣ J ( s , a ) ∣ Q(s,a) = \frac{\sum_{t\in \mathcal J(s,a)} \rho_{t:T(t)-1} G_t}{|\mathcal J(s,a)|} Q(s,a)=∣J(s,a)∣∑t∈J(s,a)ρt:T(t)−1Gt
-
J
(
s
,
a
)
\mathcal J(s,a)
J(s,a) : 状态动作pair
(
s
,
a
)
(s,a)
(s,a)被访问的time step集合
- every-visit: ( s , a ) (s,a) (s,a)每次出现的time step
- first-visit: ( s , a ) (s,a) (s,a),第一次出现time step
-
∣
J
(
s
,
a
)
∣
|\mathcal J(s,a)|
∣J(s,a)∣ : 状态
s
s
s被访问的次数
- every-visit: ( s , a ) (s,a) (s,a),出现的次数
- first-visit:1
- G t G_t Gt : 状态 ( s , a ) (s,a) (s,a),time stpe t t t 的 R e t u r n Return Return
MC-off policy(Ordinary importance sampling)
- Initialize,for all
s
∈
S
,
a
∈
A
(
s
)
s \in \mathcal S, a \in \mathcal A(s)
s∈S,a∈A(s):
- Q ( s , a ) ∈ R ( a r b i t r a r i l y ) Q(s,a) \in \mathbb R(arbitrarily) Q(s,a)∈R(arbitrarily)
- C ( s , a ) ← 0 \mathcal C(s,a) \gets 0 C(s,a)←0
- π ( s ) ← a r g m a x a Q ( s , a ) \pi(s) \gets \underset{a}{argmax}Q(s,a) π(s)←aargmaxQ(s,a)
- Loop,for each episode:
- b ← b \gets b← soft policy
- Generate an episode use b : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T b:S_0,A_0,R_1,...,S_{T-1},A_{T-1},R_T b:S0,A0,R1,...,ST−1,AT−1,RT
- G ← 0 G \gets 0 G←0
- W ← 1 W \gets 1 W←1
- Loop, for each step of episode,
t
=
T
−
1
,
T
−
2
,
.
.
.
,
0
t = T-1, T-2,...,0
t=T−1,T−2,...,0:
- G ← γ G + R t + 1 G \gets \gamma G + R_{t+1} G←γG+Rt+1
- π ( S t ) ← a r g m a x a Q ( S t , a ) \pi(S_t) \gets \underset{a}{argmax}Q(S_t,a) π(St)←aargmaxQ(St,a)
- C ( S t , A t ) ← C ( S t , A t ) + 1 \mathcal C(S_t,A_t) \gets \mathcal C(S_t, A_t) + 1 C(St,At)←C(St,At)+1
- Q ( S t , A t ) ← Q ( S t , A t ) + W G − Q ( S t , A t ) C ( S t , A t ) Q(S_t,A_t) \gets Q(S_t, A_t) + \frac{W G - Q(S_t,A_t)}{\mathcal C(S_t,A_t)} Q(St,At)←Q(St,At)+C(St,At)WG−Q(St,At)
- if A t ≠ π ( S t ) A_t \neq \pi(S_t) At=π(St), exit inner Loop (proceed to next episode)
- W ← W 1 b ( A t ∣ S t ) W \gets W\frac{1}{b(A_t|S_t)} W←Wb(At∣St)1
- π ( a ∣ s ) = 1 , { a = a r g m a x a Q ( s , a ) } \pi(a|s) = 1,\{a = \underset{a}{argmax}Q(s,a)\} π(a∣s)=1,{a=aargmaxQ(s,a)}
- π ( a ∣ s ) b ( a ∣ s ) = 1 b ( a ∣ s ) \frac{\pi(a|s)}{b(a|s)} = \frac{1}{b(a|s)} b(a∣s)π(a∣s)=b(a∣s)1
- C ( S t , A t ) \mathcal C(S_t, A_t) C(St,At) : ∣ J ( S t , A t ) ∣ |\mathcal J(S_t, A_t)| ∣J(St,At)∣,(s,a)累计次数和
def monte_carlo_off_policy(episodes, gamma=1.0, epsilon=0.1, threshold=0.0001):
# nearly greedy policy :behavior
def soft_policy(usable_ace, player_sum, dealer_card, epsilon=epsilon):
usable_ace = int(usable_ace)
player_sum -= 12
dealer_card -= 1
values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
proba = np.random.uniform(0, 1)
if proba <= epsilon:
action = np.random.randint(0, 2)
else:
action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
return action
# greedy policy :target
def greedy_policy(usable_ace, player_sum, dealer_card):
usable_ace = int(usable_ace)
player_sum -= 12
dealer_card -= 1
values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
return action
# random policy : behavior
def random_policy(usable_ace, player_sum, dealer_card):
action = np.random.randint(0, 2)
return action
# Initialize
state_action_values = np.zeros((10, 10, 2, 2))
state_action_pair_count = np.ones((10, 10, 2, 2))
# Loop for each episode
delta_history = []
for episode in tqdm(range(episodes)):
old_sa = state_action_values.copy()
# Generate an episode
player_reward, player_traj = play_blackjack(soft_policy, dealer_policy)
player_states = [t.state for t in player_traj]
player_actions = [t.action for t in player_traj]
player_rewards = [0]*len(player_states)
player_rewards[-1] = player_reward
# State,Action,Return
R = 0
Gs = []
for r in player_rewards[::-1]:
R = r + gamma * R
Gs.insert(0, R)
# Loop ::-1
proba_b_a = 1.0
for player_state, action, G in zip(player_states[::-1], player_actions[::-1], Gs[::-1]):
usable_ace_player, player_sum, dealer_card = player_state
target_action = greedy_policy(usable_ace_player, player_sum, dealer_card) # target policy
usable_ace = int(usable_ace_player)
player_sum -= 12
dealer_card -= 1
# Update values of state-action
if target_action == action:
proba_b_a *= (1-epsilon)
old_val = state_action_values[player_sum, dealer_card, usable_ace, action]
sa_count = state_action_pair_count[player_sum, dealer_card, usable_ace, action]
new_val = old_val + (G*(1/proba_b_a) - old_val)/(sa_count + 1)
#new_val = old_val + (1/proba_b_a)/(sa_count + 1)*(G - old_val)
state_action_values[player_sum, dealer_card, usable_ace, action] = new_val
state_action_pair_count[player_sum, dealer_card, usable_ace, action] += 1
else :
break
delta = abs(state_action_values - old_sa).max()
delta_history.append(delta)
return state_action_values, delta_history
usable Ace
no usable Ace
策略可视化
weighted importance sampling(加权重要性采样)
Q ( s , a ) = ∑ t ∈ J ( s , a ) ρ t : T ( t ) − 1 G t ∑ t ∈ J ( s , a ) ρ t : T ( t ) − 1 Q(s,a) = \frac{\sum_{t\in \mathcal J(s,a)} \rho_{t:T(t)-1} G_t}{\sum_{t\in \mathcal J(s,a)} \rho_{t:T(t)-1}} Q(s,a)=∑t∈J(s,a)ρt:T(t)−1∑t∈J(s,a)ρt:T(t)−1Gt
MC-off policy(weighted importance sampling)
- Initialize,for all
s
∈
S
,
a
∈
A
(
s
)
s \in \mathcal S, a \in \mathcal A(s)
s∈S,a∈A(s):
- Q ( s , a ) ∈ R ( a r b i t r a r i l y ) Q(s,a) \in \mathbb R(arbitrarily) Q(s,a)∈R(arbitrarily)
- C ( s , a ) ← 0 C(s,a) \gets 0 C(s,a)←0
- π ( s ) ← a r g m a x a Q ( s , a ) \pi(s) \gets \underset{a}{argmax}Q(s,a) π(s)←aargmaxQ(s,a)
- Loop,for each episode:
- b ← b \gets b← soft policy
- Generate an episode use b : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T b:S_0,A_0,R_1,...,S_{T-1},A_{T-1},R_T b:S0,A0,R1,...,ST−1,AT−1,RT
- G ← 0 G \gets 0 G←0
- W ← 1 W \gets 1 W←1
- Loop, for each step of episode,
t
=
T
−
1
,
T
−
2
,
.
.
.
,
0
t = T-1, T-2,...,0
t=T−1,T−2,...,0:
- G ← γ G + R t + 1 G \gets \gamma G + R_{t+1} G←γG+Rt+1
- π ( S t ) ← a r g m a x a Q ( S t , a ) \pi(S_t) \gets \underset{a}{argmax}Q(S_t,a) π(St)←aargmaxQ(St,a)
- C ( S t , A t ) ← C ( S t , A t ) + W C(S_t,A_t) \gets C(S_t, A_t) + W C(St,At)←C(St,At)+W
- Q ( S t , A t ) ← Q ( S t , A t ) + W G − Q ( S t , A t ) C ( S t , A t ) o r ( W C ( S t , A t ) [ G − Q ( S t , A t ) ] ) Q(S_t,A_t) \gets Q(S_t, A_t) + \frac{WG - Q(S_t,A_t)}{C(S_t,A_t)} or (\frac{W}{C(S_t,A_t)}[G - Q(S_t,A_t)]) Q(St,At)←Q(St,At)+C(St,At)WG−Q(St,At)or(C(St,At)W[G−Q(St,At)])
- if A t ≠ π ( S t ) A_t \neq \pi(S_t) At=π(St), exit inner Loop (proceed to next episode)
- W ← W 1 b ( A t ∣ S t ) W \gets W\frac{1}{b(A_t|S_t)} W←Wb(At∣St)1
- C ( S t , A t ) C(S_t, A_t) C(St,At) : ∑ t ∈ J ( s , a ) ρ t : T ( t ) − 1 \sum_{t\in \mathcal J(s,a)} \rho_{t:T(t)-1} ∑t∈J(s,a)ρt:T(t)−1系数的累计和
- W W W : 系数 ρ \rho ρ
def monte_carlo_off_policy(episodes, gamma=1.0, epsilon=0.1, threshold=0.0001):
# nearly greedy policy :behavior
def soft_policy(usable_ace, player_sum, dealer_card, epsilon=0.1):
usable_ace = int(usable_ace)
player_sum -= 12
dealer_card -= 1
values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
proba = np.random.uniform(0, 1)
if proba <= epsilon:
action = np.random.randint(0, 2)
else:
action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
return action
# greedy policy :target
def greedy_policy(usable_ace, player_sum, dealer_card):
usable_ace = int(usable_ace)
player_sum -= 12
dealer_card -= 1
values_ = state_action_values[player_sum, dealer_card, usable_ace, :]
action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])
return action
# Initialize
state_action_values = np.zeros((10, 10, 2, 2))
state_action_pair_weights = np.ones((10, 10, 2, 2))
# Loop for each episode
delta_history = []
for episode in tqdm(range(episodes)):
old_sa = state_action_values.copy()
# Generate an episode
player_reward, player_traj = play_blackjack(soft_policy, dealer_policy)
player_states = [t.state for t in player_traj]
player_actions = [t.action for t in player_traj]
player_rewards = [0]*len(player_states)
player_rewards[-1] = player_reward
# State,Action,Return
R = 0
Gs = []
for r in player_rewards[::-1]:
R = r + gamma * R
Gs.insert(0, R)
# Loop ::-1
proba_b_a = 1.0
for player_state, action, G in zip(player_states[::-1], player_actions[::-1], Gs[::-1]):
usable_ace_player, player_sum, dealer_card = player_state
target_action = greedy_policy(usable_ace_player, player_sum, dealer_card) # target policy
usable_ace = int(usable_ace_player)
player_sum -= 12
dealer_card -= 1
# Update values of state-action
if target_action == action:
proba_b_a *= 0.9
old_val = state_action_values[player_sum, dealer_card, usable_ace, action]
sa_weight = state_action_pair_weights[player_sum, dealer_card, usable_ace, action]
new_val = old_val + (G*(1/proba_b_a) - old_val)/(sa_weight + 1/proba_b_a)
#new_val = old_val + (1/proba_b_a)/(sa_count + 1)*(G - old_val)
state_action_values[player_sum, dealer_card, usable_ace, action] = new_val
state_action_pair_weights[player_sum, dealer_card, usable_ace, action] += 1/proba_b_a
else :
break
delta = abs(state_action_values - old_sa).max()
delta_history.append(delta)
return state_action_values, delta_history
usable Ace
no usable Ace
策略可视化
3. Discounting-aware Importance Sampling(折扣重要性采样)
核心思想 flat partial returns
:
flat
: 表示没有折扣。partial
: 表示Return不是完整,是部分的。
G ‾ t : h = R t + 1 + R t + 2 + . . . + R h , 0 < = t < h < = T \overline{G}_{t:h} = R_{t+1} + R_{t+2} + ... + R_h, 0 <= t < h <= T Gt:h=Rt+1+Rt+2+...+Rh,0<=t<h<=T-
G
t
:
h
G_{t:h}
Gt:h : time step
t
到h
之间的Returns -
T
T
T : terminal time step
完整的Returns可以表示为部分Returns的和:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T = ( 1 − γ ) R t + 1 + ( 1 − γ ) γ ( R t + 1 + R t + 2 ) + ( 1 − γ ) γ 2 ( R t + 1 + R t + 2 + R t + 3 ) + . . . + ( 1 − γ ) γ T − t − 2 ( R t + 1 + R t + 2 + . . . + R T − 1 ) + γ T − t − 1 ( R t + 1 + R t + 2 + . . . + R T ) = ( 1 − γ ) ∑ h = t + 1 T − 1 γ h − t − 1 G ‾ t : h + γ T − t − 1 G ‾ t : T \begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... + \gamma^{T-t-1} R_T \\ &= (1-\gamma)R_{t+1} \\ & + (1-\gamma)\gamma(R_{t+1} + R_{t+2}) \\ & + (1-\gamma)\gamma^2(R_{t+1} + R_{t+2} + R_{t+3}) \\ & + ... \\ & + (1-\gamma)\gamma^{T-t-2}(R_{t+1} + R_{t+2} + ... + R_{T-1}) \\ & + \gamma^{T-t-1}(R_{t+1} + R_{t+2} + ... + R_T) \\ &= (1-\gamma)\sum_{h=t+1}^{T-1} \gamma^{h-t-1}\overline{G}_{t:h} + \gamma^{T-t-1}\overline{G}_{t:T} \end{aligned} Gt=Rt+1+γRt+2+γ2Rt+3+...+γT−t−1RT=(1−γ)Rt+1+(1−γ)γ(Rt+1+Rt+2)+(1−γ)γ2(Rt+1+Rt+2+Rt+3)+...+(1−γ)γT−t−2(Rt+1+Rt+2+...+RT−1)+γT−t−1(Rt+1+Rt+2+...+RT)=(1−γ)h=t+1∑T−1γh−t−1Gt:h+γT−t−1Gt:T
Discounting-aware & ordinary importance-sampling
Q ( s , a ) = ∑ t ∈ J ( s , a ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 G ‾ t : h + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 G ‾ t : T ( t ) ) ∣ J ( s , a ) ∣ Q(s,a) = \frac{\sum_{t\in \mathcal J(s,a)}\big((1-\gamma)\sum_{h=t+1}^{T(t)-1} \gamma^{h-t-1} \rho_{t:h-1}\overline{G}_{t:h} + \gamma^{T(t)-t-1}\rho_{t:T(t)-1}\overline{G}_{t:T(t)}\big)}{|\mathcal{J}(s,a)|} Q(s,a)=∣J(s,a)∣∑t∈J(s,a)((1−γ)∑h=t+1T(t)−1γh−t−1ρt:h−1Gt:h+γT(t)−t−1ρt:T(t)−1Gt:T(t))
Discount-aware & weight importance-sampling
Q ( s , a ) = ∑ t ∈ J ( s , a ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 G ‾ t : h + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 G ‾ t : T ( t ) ) ∑ t ∈ J ( s , a ) ( ( 1 − γ ) ∑ h = t + 1 T ( t ) − 1 γ h − t − 1 ρ t : h − 1 + γ T ( t ) − t − 1 ρ t : T ( t ) − 1 ) Q(s,a) = \frac{\sum_{t\in \mathcal J(s,a)} \big((1-\gamma)\sum_{h=t+1}^{T(t)-1}\gamma^{h-t-1} \rho_{t:h-1}\overline{G}_{t:h} + \gamma^{T(t)-t-1}\rho_{t:T(t)-1}\overline{G}_{t:T(t)}\big)}{\sum_{t\in \mathcal J(s,a)}\big((1-\gamma)\sum_{h=t+1}^{T(t)-1} \gamma^{h-t-1} \rho_{t:h-1} + \gamma^{T(t)-t-1}\rho_{t:T(t)-1}\big)} Q(s,a)=∑t∈J(s,a)((1−γ)∑h=t+1T(t)−1γh−t−1ρt:h−1+γT(t)−t−1ρt:T(t)−1)∑t∈J(s,a)((1−γ)∑h=t+1T(t)−1γh−t−1ρt:h−1Gt:h+γT(t)−t−1ρt:T(t)−1Gt:T(t))
4. Per-decision Importance Sampling
关键思想:
E
[
ρ
t
:
T
−
1
R
t
+
1
]
=
E
[
ρ
t
:
t
R
t
+
1
]
\mathbb E[\rho_{t:T-1} R_{t+1}] = \mathbb E[\rho_{t:t}R_{t+1}]
E[ρt:T−1Rt+1]=E[ρt:tRt+1]
E
[
ρ
t
:
T
−
1
R
t
+
k
]
=
E
[
ρ
t
:
t
+
k
−
1
R
t
+
k
]
\mathbb E[\rho_{t:T-1}R_{t+k}] = \mathbb E[\rho_{t:t+k-1}R_{t+k}]
E[ρt:T−1Rt+k]=E[ρt:t+k−1Rt+k]
E
[
ρ
t
:
T
−
1
G
t
]
=
[
G
~
t
]
\mathbb E[\rho_{t:T-1}G_t] = \mathbb [\tilde{G}_t]
E[ρt:T−1Gt]=[G~t]
G
~
t
=
ρ
t
:
t
R
t
+
1
+
γ
ρ
t
:
t
+
1
R
t
+
2
+
γ
2
ρ
t
:
t
+
2
R
t
+
3
+
.
.
.
+
γ
T
−
t
−
1
ρ
t
:
T
−
1
R
T
\tilde{G}_t = \rho_{t:t}R_{t+1} + \gamma\rho_{t:t+1}R_{t+2} + \gamma^2\rho_{t:t+2}R_{t+3} + ...+\gamma^{T-t-1}\rho_{t:T-1}R_T
G~t=ρt:tRt+1+γρt:t+1Rt+2+γ2ρt:t+2Rt+3+...+γT−t−1ρt:T−1RT
Q
(
s
,
a
)
=
∑
t
∈
J
(
s
,
a
)
G
~
t
∣
J
(
s
,
a
)
∣
Q(s,a) = \frac{\sum_{t\in \mathcal{J}(s,a)} \tilde{G}_t}{|\mathcal J(s,a)|}
Q(s,a)=∣J(s,a)∣∑t∈J(s,a)G~t