作为一个新手,写这个强化学习-基础知识专栏是想和大家分享一下自己学习强化学习的学习历程,希望对大家能有所帮助。这个系列后面会不断更新,希望自己在2021年能保证平均每日一更的更新速度,主要是介绍强化学习的基础知识,后面也会更新强化学习的论文阅读专栏。本来是想每一篇多更新一点内容的,后面发现大家上CSDN主要是来提问的,就把很多拆分开来了(而且这样每天任务量也小一点哈哈哈哈偷懒大法)。但是我还是希望知识点能成系统,所以我在目录里面都好按章节系统地写的,而且在github上写成了书籍的形式,如果大家觉得有帮助,希望从头看的话欢迎关注我的github啊,谢谢大家!另外我还会分享深度学习-基础知识专栏以及深度学习-论文阅读专栏,很早以前就和小伙伴们花了很多精力写的,如果有对深度学习感兴趣的小伙伴也欢迎大家关注啊。大家一起互相学习啊!可能会有很多错漏,希望大家批评指正!不要高估一年的努力,也不要低估十年的积累,与君共勉!
Off-Policy policy gradient
由于上面这种On-policy的policy gradient的方法在策略更新之后就需要把过去采集到的样本丢弃,因此效率非常低,所以我们就想着能不能将其改变成为off-policy的方法。
Importance Sampling
Importance Sampling(下面简称IS)是一个使用易采样的数据的样本估计难采样的样本的期望的方法。基于这个工作,则可以利用来自其他policy的数据来提高当前的policy,从而达到off-policy的作用。
假设可以从q(x)中进行采样,需要求的是f(x)关于p(x)的期望,那么可以通过如下的方式得到:
E x ∼ p ( x ) [ f ( x ) ] = ∫ p ( x ) f ( x ) d x = ∫ q ( x ) q ( x ) p ( x ) f ( x ) d x = ∫ q ( x ) p ( x ) q ( x ) f ( x ) d x = E x ∼ q ( x ) [ p ( x ) q ( x ) f ( x ) ] \begin{aligned} E_{x \sim p(x)}[f(x)] &=\int p(x) f(x) d x \\ &=\int \frac{q(x)}{q(x)} p(x) f(x) d x \\ &=\int q(x) \frac{p(x)}{q(x)} f(x) d x \\ &=E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right] \end{aligned} Ex∼p(x)[f(x)]=∫p(x)f(x)dx=∫q(x)q(x)p(x)f(x)dx=∫q(x)q(x)p(x)f(x)dx=Ex∼q(x)[q(x)p(x)f(x)]
Off-policy policy gradient
将importance sampling引入policy gradient,假设我们有来自
π
ˉ
(
τ
)
\bar{\pi}(\tau)
πˉ(τ) 的数据, 那么计算
π
(
τ
)
\pi(\tau)
π(τ) 关于这些数据的目标函数则可以转化为:
J
(
θ
)
=
E
τ
∼
π
ˉ
(
τ
)
[
π
θ
(
τ
)
π
ˉ
(
τ
)
r
(
τ
)
]
J(\theta)=E_{\tau \sim \bar{\pi}(\tau)}\left[\frac{\pi_{\theta}(\tau)}{\bar{\pi}(\tau)} r(\tau)\right]
J(θ)=Eτ∼πˉ(τ)[πˉ(τ)πθ(τ)r(τ)]
由于
π
θ
(
τ
)
\pi_{\theta}(\tau)
πθ(τ) 可以表示为:
π
θ
(
τ
)
=
p
(
s
1
)
∏
t
=
1
T
π
θ
(
a
t
∣
s
t
)
p
(
s
t
+
1
∣
s
t
,
a
t
)
\pi_{\theta}(\tau)=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)
πθ(τ)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at)
约掉来自environment的initial state和transition probability, 两个policy的比值可以转化为:
π
θ
(
τ
)
π
ˉ
(
τ
)
=
∏
t
=
1
T
π
θ
(
a
t
∣
s
t
)
∏
t
=
1
T
π
ˉ
(
a
t
∣
s
t
)
\frac{\pi_{\theta}(\tau)}{\bar{\pi}(\tau)}=\frac{\prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)}{\prod_{t=1}^{T} \bar{\pi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)}
πˉ(τ)πθ(τ)=∏t=1Tπˉ(at∣st)∏t=1Tπθ(at∣st)
令
p
θ
′
(
τ
)
p
θ
(
τ
)
=
∏
t
=
1
T
π
θ
′
(
a
t
∣
s
t
)
∏
t
=
1
T
π
θ
(
a
t
∣
s
t
)
\frac{p_{\theta^{\prime}}(\tau)}{p_{\theta}(\tau)}=\frac{\prod_{t=1}^{T} \pi_{\theta^{\prime}}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)}{\prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)}
pθ(τ)pθ′(τ)=∏t=1Tπθ(at∣st)∏t=1Tπθ′(at∣st),
∇ θ ′ J ( θ ′ ) = E τ ∼ p θ ( τ ) [ p θ ′ ( τ ) p θ ( τ ) ∇ θ ′ log π θ ′ ( τ ) r ( τ ) ] when θ ≠ θ ′ \nabla_{\theta^{\prime}} J\left(\theta^{\prime}\right)=E_{\tau \sim p_{\theta}(\tau)}\left[\frac{p_{\theta^{\prime}}(\tau)}{p_{\theta}(\tau)} \nabla_{\theta^{\prime}} \log \pi_{\theta^{\prime}}(\tau) r(\tau)\right] \quad \text { when } \theta \neq \theta^{\prime} ∇θ′J(θ′)=Eτ∼pθ(τ)[pθ(τ)pθ′(τ)∇θ′logπθ′(τ)r(τ)] when θ=θ′
= E τ ∼ p θ ( τ ) [ ( ∏ t = 1 T π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) ) ( ∑ t = 1 T ∇ θ ′ log π θ ′ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ] =E_{\tau \sim p_{\theta}(\tau)}\left[\left(\prod_{t=1}^{T} \frac{\pi_{\theta^{\prime}}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)}{\pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)}\right)\left(\sum_{t=1}^{T} \nabla_{\theta^{\prime}} \log \pi_{\theta^{\prime}}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\right] =Eτ∼pθ(τ)[(t=1∏Tπθ(at∣st)πθ′(at∣st))(t=1∑T∇θ′logπθ′(at∣st))(t=1∑Tr(st,at))]
∇
θ
′
J
(
θ
′
)
=
E
τ
∼
p
θ
(
τ
)
[
∑
t
=
1
T
∇
θ
′
log
π
θ
′
(
a
t
∣
s
t
)
(
∏
t
′
=
1
t
π
θ
′
(
a
t
′
∣
s
t
′
)
π
θ
(
a
t
′
∣
s
t
′
)
)
(
∑
t
′
=
t
T
r
(
s
t
′
,
a
t
′
)
)
]
\nabla_{\theta^{\prime}} J\left(\theta^{\prime}\right)=E_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t=1}^{T} \nabla_{\theta^{\prime}} \log \pi_{\theta^{\prime}}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\left(\prod_{t^{\prime}=1}^{t} \frac{\pi_{\theta^{\prime}}\left(\mathbf{a}_{t^{\prime}} \mid \mathbf{s}_{t^{\prime}}\right)}{\pi_{\theta}\left(\mathbf{a}_{t^{\prime}} \mid \mathbf{s}_{t^{\prime}}\right)}\right)\left(\sum_{t^{\prime}=t}^{T} r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)\right)\right]
∇θ′J(θ′)=Eτ∼pθ(τ)[t=1∑T∇θ′logπθ′(at∣st)(t′=1∏tπθ(at′∣st′)πθ′(at′∣st′))(t′=t∑Tr(st′,at′))]
上一篇:强化学习的学习之路(三十三)_2021-02-02:REINFOCR算法的缺陷及应对方法
下一篇:强化学习的学习之路(三十五)_2021-02-04:Tips in Policy Gradient Descent