Chapter 7 n-step Bootstrapping

最新推荐文章于 2023-12-13 23:52:57 发布

滑稽树

最新推荐文章于 2023-12-13 23:52:57 发布

阅读量908

点赞数

分类专栏：强化学习笔记游戏AI

本文链接：https://blog.csdn.net/dengyibing/article/details/80623399

版权

强化学习笔记同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

游戏AI

10 篇文章 0 订阅

订阅专栏

核心思想就是在做bootstrapping之前再向前多走几步

7.1 n-step TD Prediction

The backup diagrams of n-step methods
temporal difference 扩展了n步，这就被称为n-step TD methods

n-step returns

G t : t + n ≐ R t + 1 + γ R t + 2 + \dots + γ n - 1 R t + n + γ n V t + n - 1 (S t + n)

$G_{t:t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^n V_{t+n-1}(S_{t+n})$

其中 $V_t:S \rightarrow \mathbb{R}$ 这里是在t时刻对 $v_{\pi}$ 的估计

因为又向后看了几步，所以只有等到得到 $R_{t+n}$ 和计算出 $V_{t+n-1}$ 之后才能做更新

V t + n (S t) ≐ V t + n - 1 (S t) + α [G t : t + n - V t + n - 1 (S t)], 0 \leq t \leq T

$V_{t+n}(S_t) \doteq V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_t)], \qquad 0 \leq t \leq T$

n-step TD for estimating

error reduction property of n-step returns
the worst error of the expected n-step return is guaranteed to be less than or equal to $\gamma^n$ times the worst error under $V_{t+n-1}$ :

max s | E π [G t : t + n | S t = s] - v π (s) | \leq γ n max s | V t + n - 1 (s) - v π (s) |

$\underset{s}{\max}|\mathbb{E}_{\pi}[G_{t:t+n}|S_t=s]-v_{\pi}(s)| \leq \gamma^n \underset{s}{\max}|V_{t+n-1}(s)-v_{\pi}(s)|$

这表明所有的n-step TD方法在合适的技术条件下都收敛到正确的预测

7.2 n-step Sarsa

跟之前介绍的Sarsa相比，只有G变成了n-step returns

G t : t + n ≐ R t + 1 + γ R t + 2 + \dots + γ n - 1 R t + n + γ n Q t + n - 1 (S t n, A t + n), n \geq 1, 0 \leq t < T - n

$G_{t:t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^n Q_{t+n-1}(S_{t_n},A_{t+n}), \qquad n \geq 1,0 \leq t \lt T-n$
更新公式也基本没有发生变化

Q t + n (S t, A t) ≐ Q t + n - 1 (S t, A t) + α [G t : t + n - Q t + n - 1 (S t, A t)], 0 \leq t \leq T

$Q_{t+n}(S_t,A_t) \doteq Q_{t+n-1}(S_t,A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)], \qquad 0 \leq t \leq T$
The backup diagrams for the spectrum of n-step methods for state-action values

对于上图展示的Expected Sarsa。跟n-step Sarsa类似，除了最后考虑的一项不同。

G t : t + n ≐ R t + 1 + \dots + γ n - 1 R t + n + γ n V ¯ t + n - 1 (S t + n), t + n < T,

$G_{t:t+n} \doteq R_{t+1}+\cdots+\gamma^{n-1}R_{t+n}+\gamma^n \bar V_{t+n-1}(S_{t+n}), \qquad t+n \lt T,$
这里的不同点有

Gt:t+n≐Gt for t+n≥T G t : t + n ≐ G t for t + n ≥ T $G_{t:t+n} \doteq G_t \text{ for } t+n \geq T$ ，
其中

V¯t(s) V ¯ t ( s ) $\bar V_t(s)$ 是 expected approximte value of state s

V ¯ t (s) ≐ \sum a π (a | s) Q t (s, a), for all s \in S

$\bar V_t(s) \doteq \sum_a \pi(a|s)Q_t(s,a), \qquad \text{for all } s \in S$

7.3 n-step On-policy Learning by Importance Sampling

这一节有关于off-policy learning很好的介绍。off-policy learning就是学习一个policy $\pi$ 的值，同时遵循另外一个policy b的experience。通常， $\pi$ 是对当前action-value估计的greedy policy，而b是一个跟具有探索性的policy，或许是 $\varepsilon\text{-greedy}$

还是要用上 importance sampling ratio

ρ t : h ≐ \prod k = t min (k, T - 1) π ( A k | S k ) b ( A k | S k )

$\rho_{t:h} \doteq \prod_{k=t}^{\min(k,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

更新公式

V t + n (S t) ≐ V t + n - 1 (S t) + α ρ t : t + n - 1 [G t : t + n - V t + n - 1 (S t)], 0 \leq t < T

$V_{t+n}(S_t) \doteq V_{t+n-1}(S_t)+\alpha \rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], \qquad 0 \leq t \lt T$

off-policy form n-step Sarsa

Q t + n (S t, A t) ≐ Q t + n - 1 (S t, A t) + α ρ t + 1 : t + n - 1 [G t : t + n - Q t + n - 1 (S t, A t)], 0 \leq t < T

$Q_{t+n}(S_t,A_t) \doteq Q_{t+n-1}(S_t,A_t)+\alpha \rho_{t+1:t+n-1}[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)], \qquad 0 \leq t \lt T$
off-policy n-step Sarsa

7.4 *Per-decision Off-policy Methods with Control Variates

A more sophisticated approach would use per-decision importance sampling ideas

n-step returns可以写为
$G_{t:h} = R_{t+1}+\gamma G_{t+1:h}, \qquad t \lt h \lt T,$

off-policy definition of the n-step return ending at horizon

G t : h ≐ ρ t (R t + 1 + γ G t + 1 : h) + (1 - ρ t) V h - 1 (S t), t < h < T, (7.13)

$G_{t:h} \doteq \rho_t(R_{t+1}+\gamma G_{t+1:h})+(1-\rho_t)V_{h-1}(S_t), \qquad t \lt h \lt T, \tag {7.13}$
同时有

Gh:h≐Vh−1(Sh) G h : h ≐ V h − 1 ( S h ) $G_{h:h} \doteq V_{h-1}(S_h)$
上式7.13中的第二项被称为 control variate
control variate 不会改变期望更新，因为在5.9节介绍过，importance sampling ratio的期望值是1。

An off-policy form with control variates

G t : h ≐ R t + 1 + γ (ρ t + 1 G t + 1 : h + V ¯ h - 1 (S t + 1) - ρ t + 1 Q h - 1 (S t + 1, A t + 1)), = R t + 1 + γ ρ t + 1 (G t + 1 : h + Q h - 1 (S t + 1, A t + 1)) + γ V ¯ h - 1 (S t + 1), t < h \leq T .

$\begin{align*} G_{t:h} &\doteq R_{t+1}+\gamma(\rho_{t+1}G_{t+1:h}+\bar V_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1})), \\ & = R_{t+1}+\gamma \rho_{t+1}(G_{t+1:h}+Q_{h-1}(S_{t+1},A_{t+1}))+\gamma \bar V_{h-1}(S_{t+1}), \qquad t \lt h \leq T. \end{align*}$
如果

h<t h < t $h \lt t$ ，则递归以

Gh:h≐Qh−1(Sh,Ah) G h : h ≐ Q h − 1 ( S h , A h ) $G_{h:h} \doteq Q_{h-1}(S_h,A_h)$ 结束；如果

h≥T h ≥ T $h \geq T$ ，则递归以

GT−1:T≐RT G T − 1 : T ≐ R T $G_{T-1:T} \doteq R_T$ 结束。

control variates就是一种减小方差的方法

7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

不需要importance sampling的off-policy方法
tree-backup update

tree-backup n-step return的一般形式

G t : t + n ≐ R t + t + γ \sum α \neq A t + 1 π (a | S t + 1) Q t + n - 1 (S t + 1, a) + γ π (A t + 1, S t + 1) G t + 1 : t + n, t < T - 1

$G_{t:t+n} \doteq R_{t+t}+\gamma \sum_{\alpha \neq A_{t+1}} \pi(a|S_{t+1})Q_{t+n-1}(S_{t+1},a)+\gamma \pi(A_{t+1},S_{t+1})G_{t+1:t+n}, \qquad t \lt T-1$
当n=1时，

GT−1:T≐RT G T − 1 : T ≐ R T $G_{T-1:T} \doteq R_T$

上述action-value用于n-step Sarsa

Q t + n (S t, A t) ≐ Q t + n - 1 (S t, A t) + α [G t : t n - Q t + n - 1 (S t, A t)], 0 \leq t < T,

$Q_{t+n}(S_t,A_t) \doteq Q_{t+n-1}(S_t,A_t)+\alpha[G_{t:tn}-Q_{t+n-1}(S_t,A_t)], \qquad 0 \leq t \lt T,$
n-step Tree Backup for estimating

7.6 *A Unifying Algorithm: n-step $Q(\delta)$

跟前面描述的类似，就是往前看的方式变了，其他的都是一样的，看下图
The backup diagrams

改写7.16的形式为如下：

G t : h = R t + 1 + γ \sum a \neq A t + 1 π (a | S t + 1) Q h - 1 (S t + 1, a) + γ π (A t + 1 | S t + 1) G t + 1 : h = R t + 1 + γ V ¯ h - 1 (S t + 1) - γ π (A t + 1 | S t + 1) Q h - 1 (S t + 1, A t + 1) + γ π (A t + 1 | S t + 1) G t + 1 : h = R t + 1 + γ π (A t + 1 | S t + 1) (G t + 1 : h - Q h - 1 (S t + 1, A t + 1)) + γ V ¯ h - 1 (S t + 1) ，

$\begin{align*} G_{t:h} & = R_{t+1}+\gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1})Q_{h-1}(S{t+1},a)+\gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\\ & = R_{t+1}+\gamma \bar V_{h-1}(S_{t+1})-\gamma \pi(A_{t+1}|S_{t+1})Q_{h-1}(S_{t+1},A_{t+1})+\gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\\ & = R_{t+1}+\gamma \pi(A_{t+1}|S_{t+1})(G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma \bar V_{h-1}(S_{t+1})， \end{align*}$
把其中的

π(At+1|St+1) π ( A t + 1 | S t + 1 ) $\pi(A_{t+1}|S{t+1})$ 替换成importance-sampling ratio

ρt+1 ρ t + 1 $\rho_{t+1}$

G t : h ≐ R t + 1 + γ (δ t + 1 ρ t + 1 + (1 - δ t + 1) π (A t + 1 | S t + 1)) (G t + 1 : h - Q h - 1 (S t + 1, A t + 1)) + γ V ¯ h - 1 (S t + 1)

$G_{t:h} \doteq R_{t+1}+\gamma(\delta_{t+1}\rho_{t+1}+(1-\delta_{t+1})\pi(A_{t+1|S_{t+1}}))(G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma \bar V_{h-1}(S_{t+1})$
对于

t<h≤T t < h ≤ T $t \lt h \leq T$ ，如果

h<T h < T $h \lt T$ ，则递归式最后以

Gh:h≐0 G h : h ≐ 0 $G_{h:h} \doteq 0$ 结束；如果

h=T h = T $h=T$ ，则递归式最后以

GT−1:T≐RT G T − 1 : T ≐ R T $G_{T-1:T} \doteq R_T$ 结束。

Off-policy n-step Q(delta)

滑稽树

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Chapter 7 n-step Bootstrapping

核心思想就是在做bootstrapping之前再向前多走几步7.1 n-step TD Prediction temporal difference 扩展了n步，这就被称为n-step TD methodsn-step returns Gt:t+n≐Rt+1+γRt+2+⋯+γn−1Rt+n+γnVt+n−1(Stn)Gt:t+n≐Rt+1+γRt+2+⋯+γn−1Rt+n+γ...
复制链接

扫一扫