【强化学习的数学原理】第十课:Actor-Critic

actor 与 critic

  • actor:策略,更新策略的过程
  • critic:依据value estimation 的 policy evaluation

θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1}=\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q_t\left(s_t, a_t\right) θt+1=θt+αθlnπ(atst,θt)qt(st,at)

  • 上式子就是actor算法
  • q t ( s t , a t ) q_t\left(s_t, a_t\right) qt(st,at)是critic方法
    • 上节课说的采用蒙特卡洛的方法其就是REINFORCE
    • 本节课采用的TD方法来进行解决就是actor-critic方法

【QAC伪代码】

  • 目标:为了 maximizing J ( θ ) J(\theta) J(θ). 选择最优的策略

  • 在每一步 t t t,首先依据 π ( a ∣ s t , θ t ) \pi\left(a \mid s_t, \theta_t\right) π(ast,θt)生成 a t a_t at,获得 r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1,然后依据 π ( a ∣ s t + 1 , θ t ) \pi\left(a \mid s_{t+1}, \theta_t\right) π(ast+1,θt)获得 a t + 1 a_{t+1} at+1 ( s t , a t , r t + 1 , S t + 1 , a t + 1 ) \left(s_t, a_t, r_{t+1}, S_{t+1}, a_{t+1}\right) (st,at,rt+1,St+1,at+1)

    • Critic(value update):
      w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 , w t ) − q ( s t , a t , w t ) ] ∇ w q ( s t , a t , w t ) \begin{aligned} & w_{t+1}=w_t+\alpha_w\left[r_{t+1}+\gamma q\left(s_{t+1}, a_{t+1}, w_t\right)-q\left(s_t, a_t, w_t\right)\right] \nabla_w q\left(s_t, a_t, w_t\right) \end{aligned} wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)q(st,at,wt)]wq(st,at,wt)

    • Actor(policy update):
      θ t + 1 = θ t + α θ ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q ( s t , a t , w t + 1 ) \theta_{t+1}=\theta_t+\alpha_\theta \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) q\left(s_t, a_t, w_{t+1}\right) θt+1=θt+αθθlnπ(atst,θt)q(st,at,wt+1)

【Advantage actor-critic(A2C)】

在QAC的基础上增加一个片质量来减少估计的方差

∇ θ J ( θ ) = E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] = E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) ( q π ( S , A ) − b ( S ) ) ] \begin{aligned} \nabla_\theta J(\theta) & =\mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) q_\pi(S, A)\right] \\ & =\mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right)\left(q_\pi(S, A)-b(S)\right)\right] \end{aligned} θJ(θ)=ESη,Aπ[θlnπ(AS,θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS,θt)(qπ(S,A)b(S))]

  • 在其后增加一个 b ( S ) b(S) b(S)是一个偏执量,关于 S S S的函数其俩是相同的

问题1:为什么成立?

回答:因为:
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) b ( S ) ] = 0 \mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) b(S)\right]=0 ESη,Aπ[θlnπ(AS,θt)b(S)]=0

E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) b ( S ) ] = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s , θ t ) ∇ θ ln ⁡ π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ ∑ a ∈ A π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned} \mathbb{E}_{S \sim \eta, A \sim \pi}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) b(S)\right] & =\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi\left(a \mid s, \theta_t\right) \nabla_\theta \ln \pi\left(a \mid s, \theta_t\right) b(s) \\ & =\sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi\left(a \mid s, \theta_t\right) b(s) \\ & =\sum_{s \in \mathcal{S}} \eta(s) b(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi\left(a \mid s, \theta_t\right) \\ & =\sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi\left(a \mid s, \theta_t\right) \\ & =\sum_{s \in \mathcal{S}} \eta(s) b(s) {{\nabla_\theta 1 }}=0 \end{aligned} ESη,Aπ[θlnπ(AS,θt)b(S)]=sSη(s)aAπ(as,θt)θlnπ(as,θt)b(s)=sSη(s)aAθπ(as,θt)b(s)=sSη(s)b(s)aAθπ(as,θt)=sSη(s)b(s)θaAπ(as,θt)=sSη(s)b(s)θ1=0

问题2:为什么考虑偏执量?

回答:因为对方差有影响,想找最好的偏执使得方差最少。
b ( s ) = E A ∼ π [ q ( s , A ) ] = v π ( s ) b(s)=\mathbb{E}_{A \sim \pi}[q(s, A)]=v_\pi(s) b(s)=EAπ[q(s,A)]=vπ(s)

✌将偏置用到actor-critic中:

θ t + 1 = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) [ q π ( S , A ) − v π ( S ) ] ] ≐ θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) δ π ( S , A ) ] \begin{aligned} \theta_{t+1} & =\theta_t+\alpha \mathbb{E}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right)\left[q_\pi(S, A)-v_\pi(S)\right]\right] \\ & \doteq \theta_t+\alpha \mathbb{E}\left[\nabla_\theta \ln \pi\left(A \mid S, \theta_t\right) \delta_\pi(S, A)\right] \end{aligned} θt+1=θt+αE[θlnπ(AS,θt)[qπ(S,A)vπ(S)]]θt+αE[θlnπ(AS,θt)δπ(S,A)]

其中:
δ π ( S , A ) ≐ q π ( S , A ) − v π ( S ) \delta_\pi(S, A) \doteq q_\pi(S, A)-v_\pi\left(S\right) δπ(S,A)qπ(S,A)vπ(S)
于是:
θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = θ t + α ∇ θ π ( a t ∣ s t , θ t ) π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = θ t + α ( δ t ( s t , a t ) π ( a t ∣ s t , θ t ) ) ⏟ step size  ∇ θ π ( a t ∣ s t , θ t ) \begin{aligned} \theta_{t+1} & =\theta_t+\alpha \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) \delta_t\left(s_t, a_t\right) \\ & =\theta_t+\alpha \frac{\nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right)}{\pi\left(a_t \mid s_t, \theta_t\right)} \delta_t\left(s_t, a_t\right) \\ & =\theta_t+\alpha \underbrace{\left(\frac{\delta_t\left(s_t, a_t\right)}{\pi\left(a_t \mid s_t, \theta_t\right)}\right)}_{\text {step size }} \nabla_\theta \pi\left(a_t \mid s_t, \theta_t\right) \end{aligned} θt+1=θt+αθlnπ(atst,θt)δt(st,at)=θt+απ(atst,θt)θπ(atst,θt)δt(st,at)=θt+αstep size  (π(atst,θt)δt(st,at))θπ(atst,θt)
最优策略是逼近TD error:
δ t = q t ( s t , a t ) − v t ( s t ) → r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \delta_t=q_t\left(s_t, a_t\right)-v_t\left(s_t\right) \rightarrow r_{t+1}+\gamma v_t\left(s_{t+1}\right)-v_t\left(s_t\right) δt=qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)

✌A2C(TD actor-critic)伪代码:

  • 目标:为了 maximizing J ( θ ) J(\theta) J(θ). 选择最优的策略

  • 在每一步 t t t,首先依据 π ( a ∣ s t , θ t ) \pi\left(a \mid s_t, \theta_t\right) π(ast,θt)生成 a t a_t at,获得 r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1

    • TD error(advantage function):
      δ t = r t + 1 + γ v ( s t + 1 , w t ) − v ( s t , w t ) \delta_t=r_{t+1}+\gamma v\left(s_{t+1}, w_t\right)-v\left(s_t, w_t\right) δt=rt+1+γv(st+1,wt)v(st,wt)

    • Critic(value update):
      w t + 1 = w t + α w δ t ∇ w v ( s t , w t ) w_{t+1}=w_t+\alpha_w \delta_t \nabla_w v\left(s_t, w_t\right) wt+1=wt+αwδtwv(st,wt)

    • Actor(policy update):
      θ t + 1 = θ t + α θ δ t ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) \theta_{t+1}=\theta_t+\alpha_\theta \delta_t \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right) θt+1=θt+αθδtθlnπ(atst,θt)

【Off-policy actor-critic】

✌重要性采样:

E X ∼ p 0 [ X ] = ∑ x p 0 ( x ) x = ∑ x p 1 ( x ) p 0 ( x ) p 1 ( x ) ⏟ f ( x ) x = E X ∼ p 1 [ f ( X ) ] \mathbb{E}_{X \sim p_0}[X]=\sum_x p_0(x) x=\sum_x p_1(x) \underbrace{\frac{p_0(x)}{p_1(x)}}_{f(x)} x=\mathbb{E}_{X \sim p_1}[f(X)] EXp0[X]=xp0(x)x=xp1(x)f(x) p1(x)p0(x)x=EXp1[f(X)]

  • 如何求 E X ∼ p 1 [ f ( X ) ] \mathbb{E}_{X \sim p_1}[f(X)] EXp1[f(X)]
    f ˉ ≐ 1 n ∑ i = 1 n f ( x i ) ,  where  x i ∼ p 1 \bar{f} \doteq \frac{1}{n} \sum_{i=1}^n f\left(x_i\right), \quad \text { where } x_i \sim p_1 fˉn1i=1nf(xi), where xip1

于是:
E X ∼ p 1 [ f ˉ ] = E X ∼ p 1 [ f ( X ) ] var ⁡ X ∼ p 1 [ f ˉ ] = 1 n var ⁡ X ∼ p 1 [ f ( X ) ] \begin{aligned} \mathbb{E}_{X \sim p_1}[\bar{f}] & =\mathbb{E}_{X \sim p_1}[f(X)] \\ \operatorname{var}_{X \sim p_1}[\bar{f}] & =\frac{1}{n} \operatorname{var}_{X \sim p_1}[f(X)] \end{aligned} EXp1[fˉ]varXp1[fˉ]=EXp1[f(X)]=n1varXp1[f(X)]
于是我们要求的可以写成:
E X ∼ p 0 [ X ] ≈ f ˉ = 1 n ∑ i = 1 n f ( x i ) = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) x i \mathbb{E}_{X \sim p_0}[X] \approx \bar{f}=\frac{1}{n} \sum_{i=1}^n f\left(x_i\right)=\frac{1}{n} \sum_{i=1}^n \frac{p_0\left(x_i\right)}{p_1\left(x_i\right)} x_i EXp0[X]fˉ=n1i=1nf(xi)=n1i=1np1(xi)p0(xi)xi

  • p 0 ( x i ) p 1 ( x i ) \frac{p_0\left(x_i\right)}{p_1\left(x_i\right)} p1(xi)p0(xi):叫做重要性权重
    • p 1 ( x i ) = p 0 ( x i ) p_1\left(x_i\right)=p_0\left(x_i\right) p1(xi)=p0(xi):重要性权重是1, f ˉ \bar{f} fˉ f ˉ \bar{f} fˉ x ˉ \bar{x} xˉ
    • p 0 ( x i ) ≥ p 1 ( x i ) p_0\left(x_i\right) \geq p_1\left(x_i\right) p0(xi)p1(xi):给它重要性权重

✌off-policy gradient:

∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S , θ ) β ( A ∣ S ) ∇ θ ln ⁡ π ( A ∣ S , θ ) q π ( S , A ) ] \nabla_\theta J(\theta)=\mathbb{E}_{S \sim \rho, A \sim \beta}\left[\frac{\pi(A \mid S, \theta)}{\beta(A \mid S)} \nabla_\theta \ln \pi(A \mid S, \theta) q_\pi(S, A)\right] θJ(θ)=ESρ,Aβ[β(AS)π(AS,θ)θlnπ(AS,θ)qπ(S,A)]

✌off-policy actor-critic:

θ t + 1 = θ t + α θ π ( a t ∣ s t , θ t ) β ( a t ∣ s t ) ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) ( q t ( s t , a t ) − v t ( s t ) ) \theta_{t+1}=\theta_t+\alpha_\theta \frac{\pi\left(a_t \mid s_t, \theta_t\right)}{\beta\left(a_t \mid s_t\right)} \nabla_\theta \ln \pi\left(a_t \mid s_t, \theta_t\right)\left(q_t\left(s_t, a_t\right)-v_t\left(s_t\right)\right) θt+1=θt+αθβ(atst)π(atst,θt)θlnπ(atst,θt)(qt(st,at)vt(st))

其中:
q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ≐ δ t ( s t , a t ) q_t\left(s_t, a_t\right)-v_t\left(s_t\right) \approx r_{t+1}+\gamma v_t\left(s_{t+1}\right)-v_t\left(s_t\right) \doteq \delta_t\left(s_t, a_t\right) qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)δt(st,at)
于是可以重写为:

在这里插入图片描述

【Deterministic actor-critic(DPG)】

之前:

在这里插入图片描述

现在:

π ( a ∣ s , θ ) ∈ [ 0 , 1 ] \pi(a \mid s, \theta) \in[0,1] π(as,θ)[0,1]这个策略可以是随机的也可以是确定的,现在我们将deterministic策略定义为:
a = μ ( s , θ ) ≐ μ ( s ) a=\mu(s, \theta) \doteq \mu(s) a=μ(s,θ)μ(s)
这个呢是从状态空间到动作空间的映射,有时也写成 μ ( s ) \mu(s) μ(s)

在这里插入图片描述

J ( θ ) = E [ v μ ( s ) ] = ∑ s ∈ S d 0 ( s ) v μ ( s ) J(\theta)=\mathbb{E}\left[v_\mu(s)\right]=\sum_{s \in \mathcal{S}} d_0(s) v_\mu(s) J(θ)=E[vμ(s)]=sSd0(s)vμ(s)

  • 第一种是 d 0 ( s 0 ) = 1 d_0\left(s_0\right)=1 d0(s0)=1 and d 0 ( s ≠ s 0 ) = 0 d_0\left(s \neq s_0\right)=0 d0(s=s0)=0,这种情况下从一种状态开始进行优化
  • 第二种是 d 0 d_0 d0是平稳分布的,它是off-policy的

使用梯度上升的方法:
θ t + 1 = θ t + α θ ( E S ∼ ρ μ ) [ ∇ θ μ ( S ) ( ∇ a q μ ( S , a ) ) ∣ a = μ ( S ) ] \theta_{t+1}=\theta_t+\alpha_\theta\left(\mathbb{E}_{S \sim \rho_\mu}\right)\left[\left.\nabla_\theta \mu(S)\left(\nabla_a q_\mu(S, a)\right)\right|_{a=\mu(S)}\right] θt+1=θt+αθ(ESρμ)[θμ(S)(aqμ(S,a))a=μ(S)]
这里面的均值变为:
θ t + 1 = θ t + α θ ∇ θ μ ( s t ) ( ∇ a q μ ( s t , a ) ) ∣ a = μ ( s t ) \theta_{t+1}=\theta_t+\left.\alpha_\theta \nabla_\theta \mu\left(s_t\right)\left(\nabla_a q_\mu\left(s_t, a\right)\right)\right|_{a=\mu\left(s_t\right)} θt+1=θt+αθθμ(st)(aqμ(st,a))a=μ(st)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值