[paper reading] IMPALA: V-trace

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

https://arxiv.org/abs/1802.01561

IMPALA 原作者代码:https://github.com/deepmind/scala ble_agent

V-trace

考虑一个由策略 μ \mu μ 产生的经验轨迹 ( x t , a t , r t ) t = s t = s + n (x_t, a_t, r_t)_{t=s}^{t=s+n} (xt,at,rt)t=st=s+n。定以 n 步 V-trace 目标如下

v s = d e f V ( x s ) + ∑ t = s s + n − 1 γ t − s ( ∏ i = s t − 1 c i ) δ t V v_s\overset{def}{=} V(x_s) + \sum_{t=s}^{s+n-1} \gamma^{t-s} \left( \prod_{i=s}^{t-1} c_i \right) \delta_t V vs=defV(xs)+t=ss+n1γts(i=st1ci)δtV

其中 δ t V = d e f ρ t ( r t + γ V ( x t + 1 ) − V ( x t ) ) \delta_t V \overset{def}{=} \rho_t (r_t + \gamma V(x_{t+1}) - V(x_t)) δtV=defρt(rt+γV(xt+1)V(xt)) 是 TD 误差, ρ t = d e f min ⁡ ( ρ ˉ , π ( a t ∣ x t ) μ ( a t ∣ x t ) ) \rho_t \overset{def}{=} \min \left( \bar{\rho}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)} \right) ρt=defmin(ρˉ,μ(atxt)π(atxt)) c i = d e f min ⁡ ( ρ ˉ , π ( a i ∣ x i ) μ ( a i ∣ x i ) ) c_i \overset{def}{=} \min \left( \bar{\rho}, \frac{\pi(a_i|x_i)}{\mu(a_i|x_i)} \right) ci=defmin(ρˉ,μ(aixi)π(aixi)) 是截断重要性采样,假设 ρ ˉ ≥ c ˉ \bar{\rho} \geq \bar{c} ρˉcˉ s = t s=t s=t 的时候,注意 ∏ i = s t − 1 c i = 1 \prod_{i=s}^{t-1} c_i=1 i=st1ci=1

在完全 on-policy 情况下, ρ t = c i = 1 \rho_t=c_i=1 ρt=ci=1,因此上式可以写为

v s = V ( x s ) + ∑ t = s s + n − 1 γ t − s ( r t + γ V ( x t + 1 ) − V ( x t ) ) = r s + γ r s + 1 + γ 2 r s + 1 + ⋯ + γ n − 1 r s + n − 1 + γ n V ( x s + n ) \begin{aligned} v_s &= V(x_s) + \sum_{t=s}^{s+n-1} \gamma^{t-s} (r_t + \gamma V(x_{t+1}) - V(x_t)) \\ &= r_s + \gamma r_{s+1} + \gamma^2 r_{s+1} + \cdots + \gamma^{n-1} r_{s+n-1} + \gamma^n V(x_{s+n}) \end{aligned} vs=V(xs)+t=ss+n1γts(rt+γV(xt+1)V(xt))=rs+γrs+1+γ2rs+1++γn1rs+n1+γnV(xs+n)

这就是 on-policy 时的 n 步贝尔曼目标。这是 Retrace 所没有的特性。

注意,截断重要性采样中 c i c_i ci ρ t \rho_t ρt 的作用是不同的。

代码中 advantage 的计算:https://github.com/deepmind/scalable_agent/blob/master/vtrace.py#L275

GAE-Vtrace

首先计算 g a e V s gaeV_s gaeVs

{ g a e V s = δ s + γ ρ s + 1 δ s + 1 + γ 2 ρ s + 1 ρ s + 2 δ s + 2 + ⋯ + γ n − 1 ( ∏ i = s + 1 s + n − 1 ρ i ) δ s + n − 1 g a e V s + 1 = δ s + 1 + γ ρ s + 2 δ s + 2 + ⋯ + γ n − 2 ( ∏ i = s + 2 s + n − 1 ρ i ) δ s + n − 1 ⋯ g a e V s + n − 1 = δ s + n − 1 \begin{cases}\begin{aligned} &gaeV_s &&= &&\delta_s + \gamma \rho_{s+1} \delta_{s+1} + \gamma^2 \rho_{s+1} \rho_{s+2} \delta_{s+2} + \cdots + \gamma^{n-1} \left( \prod_{i=s+1}^{s+n-1} \rho_i \right) \delta_{s+n-1} \\ &gaeV_{s+1} &&= &&\delta_{s+1} + \gamma \rho_{s+2} \delta_{s+2} + \cdots + \gamma^{n-2} \left( \prod_{i=s+2}^{s+n-1} \rho_i \right) \delta_{s+n-1} \\ &\cdots \\ &gaeV_{s+n-1} &&= &&\delta_{s+n-1} \end{aligned}\end{cases} gaeVsgaeVs+1gaeVs+n1===δs+γρs+1δs+1+γ2ρs+1ρs+2δs+2++γn1(i=s+1s+n1ρi)δs+n1δs+1+γρs+2δs+2++γn2(i=s+2s+n1ρi)δs+n1δs+n1

然后,计算 v s v_s vs

{ v s = ρ s ⋅ g a e V s + V s v s + 1 = ρ s + 1 ⋅ g a e V s + 1 + V s + 1 ⋯ v s + n − 1 = ρ s + n − 1 ⋅ g a e V s + n − 1 + V s + n − 1 \begin{cases}\begin{aligned} &v_s &&= &&\rho_s &&\cdot &&gaeV_s &&+ &&V_s \\ &v_{s+1} &&= &&\rho_{s+1} &&\cdot &&gaeV_{s+1} &&+ &&V_{s+1} \\ &\cdots \\ &v_{s+n-1} &&= &&\rho_{s+n-1} &&\cdot &&gaeV_{s+n-1} &&+ &&V_{s+n-1} \end{aligned}\end{cases} vsvs+1vs+n1===ρsρs+1ρs+n1gaeVsgaeVs+1gaeVs+n1+++VsVs+1Vs+n1

IMPALA 中计算 advantage 的方式为 r s + γ v s + 1 − V s r_s + \gamma v_{s+1} - V_s rs+γvs+1Vs,我们进行一下化简:

r s + γ v s + 1 − V s = r s + γ ( ρ s + 1 ⋅ g a e V s + 1 + V s + 1 ) − V s = γ ρ s + 1 ⋅ g a e V s + 1 + ( r s + γ V s + 1 − V s ) = γ ρ s + 1 ⋅ g a e V s + 1 + δ s = g a e V s \begin{aligned} r_s + \gamma v_{s+1} - V_s &= r_s + \gamma (\rho_{s+1} \cdot gaeV_{s+1} + V_{s+1}) - V_s \\ &= \gamma \rho_{s+1} \cdot gaeV_{s+1} + (r_s + \gamma V_{s+1}-V_s) \\ &= \gamma \rho_{s+1} \cdot gaeV_{s+1} + \delta_s \\ &= gaeV_s \end{aligned} rs+γvs+1Vs=rs+γ(ρs+1gaeVs+1+Vs+1)Vs=γρs+1gaeVs+1+(rs+γVs+1Vs)=γρs+1gaeVs+1+δs=gaeVs

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值