policy gradient推导和问题

Policy Gradient的推导以及存在的问题

Policy Gradient Theory

策略梯度法广泛应用在model-free类型的强化学习算法中,其基本思想是通过梯度迭代方法对策略动作分布 π ( a ∣ s ) \pi(a|s) π(as)更新,使其达到最优策略 π ∗ ( a ∣ s ) \pi^*(a|s) π(as)。其中最优策略满足条件:
V π ∗ ( s ) ≥ V π ( s ) , ∀ π ∈ Π , ∀ s ∈ S V^{\pi^*}(s)\ge V^{\pi}(s), \quad \forall\pi\in\Pi,\forall s\in\mathcal{S} Vπ(s)Vπ(s),πΠ,sS通常使用参数化的策略 π θ \pi_\theta πθ来计算不同状态下的动作分布,强化学习的目标可以表示为一个关于策略参数 θ \theta θ的函数 J ( θ ) J(\theta) J(θ),通常用采样轨迹 τ \tau τ的长期奖励 R ( τ ) = ∑ t = 0 T r t R(\tau)=\sum^T_{t=0}r_t R(τ)=t=0Trt的期望值作为目标函数,即 J ( θ ) = E τ [ R ( τ ) ∣ π θ ] J(\theta)=\mathbb{E}_\tau[R(\tau)|\pi_\theta] J(θ)=Eτ[R(τ)πθ]。那么对目标函数 J ( θ ) J(\theta) J(θ)关于策略参数 θ \theta θ进行求导有:
∇ θ J ( θ ) = ∇ θ E τ [ R ( τ ) ∣ π θ ] = ∇ θ ∫ τ p ( τ ∣ π θ ) R ( τ ) d τ = ∫ τ p ( τ ∣ π θ ) ∇ θ log ⁡ p ( τ ∣ π θ ) R ( τ ) d τ = E τ [ ∇ θ log ⁡ p ( τ ∣ π θ ) R ( τ ) ] \begin{aligned} \nabla_\theta J(\theta)&=\nabla_\theta\mathbb{E}_\tau[R(\tau)|\pi_\theta]\\ &=\nabla_\theta\int_\tau p(\tau|\pi_\theta)R(\tau)d\tau\\ &=\int_\tau p(\tau|\pi_\theta)\nabla_\theta\log p(\tau|\pi_\theta)R(\tau)d\tau\\ &=\mathbb{E}_\tau[\nabla_\theta\log p(\tau|\pi_\theta)R(\tau)] \end{aligned} θJ(θ)=θEτ[R(τ)πθ]=θτp(τπθ)R(τ)dτ=τp(τπθ)θlogp(τπθ)R(τ)dτ=Eτ[θlogp(τπθ)R(τ)]因为有 p ( τ ∣ π θ ) = p ( s 0 ) ∏ t = 0 T − 1 p ( s t + 1 ∣ s t , a t ) π θ ( a t ∣ s t ) p(\tau|\pi_\theta)=p(s_0)\prod_{t=0}^{T-1} p(s_{t+1}|s_t, a_t)\pi_\theta(a_t|s_t) p(τπθ)=p(s0)t=0T1p(st+1st,at)πθ(atst)所以上式继续化简:
∇ θ J ( θ ) = E τ [ ∇ θ log ⁡ p ( τ ∣ π θ ) R ( τ ) ] = E [ ∑ t = 0 T − 1 ∇ θ log ⁡ π θ ( a t ∣ s t ) ∑ t = 0 T − 1 r t ] = E [ ∑ t = 0 T − 1 ∇ θ log ⁡ π θ ( a t ∣ s t ) ( ∑ t ′ = 0 t − 1 r t ′ + ∑ t ′ = t T − 1 r t ′ ) ] = E [ ∑ t = 0 T − 1 ∇ θ log ⁡ π θ ( a t ∣ s t ) ∑ t ′ = t T − 1 r t ′ ] \begin{aligned} \nabla_\theta J(\theta)&=\mathbb{E}_\tau[\nabla_\theta\log p(\tau|\pi_\theta)R(\tau)]\\ &=\mathbb{E}\left[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t|s_t)\sum_{t=0}^{T-1}r_t\right]\\ &=\mathbb{E}\left[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t|s_t)\left(\sum_{t'=0}^{t-1}r_{t'}+\sum^{T-1}_{t'=t}r_{t'}\right)\right]\\ &=\mathbb{E}\left[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t|s_t)\sum^{T-1}_{t'=t}r_{t'}\right]\\ \end{aligned} θJ(θ)=Eτ[θlogp(τπθ)R(τ)]=E[t=0T1θlogπθ(atst)t=0T1rt]=E[t=0T1θlogπθ(atst)(t=0t1rt+t=tT1rt)]=E[t=0T1θlogπθ(atst)t=tT1rt]最后一步是因为 t t t时刻之前的奖励与 a t a_t at无关。从另一个角度看,将 t t t时刻的奖励 r t r_t rt看做随机变量,那么其期望为:
∇ θ E τ [ r t ∣ π θ ] = ∇ θ ∫ p ( s 0 , a 0 , ⋯ s t , a t ∣ π θ ) r t d ( s 0 : t , a 0 : t ) = ∫ p ( s 0 , a 0 , ⋯ s t , a t ∣ π θ ) ∑ t ′ = 0 t ∇ θ log ⁡ π θ ( a t ′ ∣ s t ′ ) r t d ( s 0 : t , a 0 : t ) = E τ [ ∑ t ′ = 0 t ∇ θ log ⁡ π θ ( a t ′ ∣ s t ′ ) r t ] \begin{aligned} \nabla_\theta\mathbb{E}_\tau[r_t|\pi_\theta]&=\nabla_\theta\int p(s_0,a_0,\cdots s_t, a_t|\pi_\theta)r_td(s_{0:t},a_{0:t})\\ &=\int p(s_0,a_0,\cdots s_t, a_t|\pi_\theta)\sum^{t}_{t'=0}\nabla_\theta\log\pi_\theta(a_{t'}|s_{t'})r_td(s_{0:t},a_{0:t})\\ &=\mathbb{E}_\tau\left[\sum^{t}_{t'=0}\nabla_\theta\log\pi_\theta(a_{t'}|s_{t'})r_t\right] \end{aligned} θEτ[rtπθ]=θp(s0,a0,st,atπθ)rtd(s0:t,a0:t)=p(s0,a0,st,atπθ)t=0tθlogπθ(atst)rtd(s0:t,a0:t)=Eτ[t=0tθlogπθ(atst)rt]将其带入目标函数导数中有:
∇ θ E τ [ R ( τ ) ∣ π θ ] = ∇ θ E τ [ ∑ t = 0 T − 1 r t ∣ π θ ] = ∑ t = 0 T − 1 ∇ θ E τ [ r t ∣ π θ ] = E τ [ ∑ t = 0 T − 1 ( ∑ t ′ = 0 t ∇ θ log ⁡ π θ ( a t ′ ∣ s t ′ ) r t ) ] = E τ [ ∑ t = 0 T − 1 ∇ θ log ⁡ π θ ( a t ∣ s t ) ∑ t ′ = t T − 1 r t ′ ] \begin{aligned} \nabla_\theta\mathbb{E}_\tau[R(\tau)|\pi_\theta]&=\nabla_\theta\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}r_t|\pi_\theta\right]=\sum^{T-1}_{t=0}\nabla_\theta\mathbb{E}_\tau[r_t|\pi_\theta]\\ &=\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\left(\sum^{t}_{t'=0}\nabla_\theta\log\pi_\theta(a_{t'}|s_{t'})r_t\right)\right]\\ &=\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)\sum^{T-1}_{t'=t}r_{t'}\right] \end{aligned} θEτ[R(τ)πθ]=θEτ[t=0T1rtπθ]=t=0T1θEτ[rtπθ]=Eτ[t=0T1(t=0tθlogπθ(atst)rt)]=Eτ[t=0T1θlogπθ(atst)t=tT1rt]结果与上式保持一致。

存在的问题

求得了目标函数的梯度后,就可以进行迭代更新策略参数: θ ← θ + α ∇ θ J ( θ ) \theta\leftarrow\theta+\alpha \nabla_\theta J(\theta) θθ+αθJ(θ),这里存在的问题主要有:

  1. 步长 α \alpha α的选择很重要,过大可能导致更新后的策略性能下降,导致策略无法提升到最优策略。所以要合适地选择更新步长使得迭代策略使得目标函数值是递增的;
  2. 上式求解的目标函数的梯度通常利用采样得到,这会导致大的偏差,导致训练过程的不稳定;
  3. 通常还要考虑样本efficiency问题;

根据上式求解目标函数的梯度需要在采样轨迹 τ \tau τ上的积分,这是很困难的,所以通常采用MC采样方法对其进行估计。这一估计值是无偏估计,但是往往有很高的偏差,为了减少偏差,同时仍保持无偏估计,可以将上式目标函数梯度改写为:
E τ [ ∑ t = 0 T − 1 ∇ θ log ⁡ π θ ( a t ∣ s t ) ( ∑ t ′ = t T − 1 r t ′ − b ( s t ) ) ] \mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)\left(\sum^{T-1}_{t'=t}r_{t'}-b(s_t)\right)\right] Eτ[t=0T1θlogπθ(atst)(t=tT1rtb(st))]其中 b ( s t ) b(s_t) b(st)是baseline函数,函数值只与 s t s_t st有关。若上式满足无偏估计,则要求 E τ [ ∑ t = 0 T − 1 ∇ θ log ⁡ π θ ( a t ∣ s t ) b ( s t ) ] = 0 \mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)b(s_t)\right]=0 Eτ[t=0T1θlogπθ(atst)b(st)]=0,证明如下:
E τ [ ∑ t = 0 T − 1 ∇ θ log ⁡ π θ ( a t ∣ s t ) b ( s t ) ] = ∑ t = 0 T − 1 E τ [ ∇ θ log ⁡ π θ ( a t ∣ s t ) b ( s t ) ] = ∑ t = 0 T − 1 E s 0 : t , a 0 : t − 1 [ E s t + 1 : T − 1 , a t : T − 1 [ ∇ θ log ⁡ π θ ( a t ∣ s t ) b ( s t ) ] ] = ∑ t = 0 T − 1 E s 0 : t , a 0 : t − 1 [ b ( s t ) E s t + 1 : T − 1 , a t : T − 1 [ ∇ θ log ⁡ π θ ( a t ∣ s t ) ] ] = ∑ t = 0 T − 1 E s 0 : t , a 0 : t − 1 [ b ( s t ) ∫ a t ∇ θ π θ ( a t ∣ s t ) d a t ] = ∑ t = 0 T − 1 E s 0 : t , a 0 : t − 1 [ b ( s t ) ⋅ 0 ] = 0 \begin{aligned} &\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)b(s_t)\right]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_\tau[\nabla_\theta\log\pi_\theta(a_t|s_t)b(s_t)]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_{s_{0:t}, a_{0:t-1}}\left[\mathbb{E}_{s_{t+1:T-1},a_{t:T-1}}[\nabla_\theta\log\pi_\theta(a_t|s_t)b(s_t)]\right]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_{s_{0:t}, a_{0:t-1}}\left[b(s_t)\mathbb{E}_{s_{t+1:T-1},a_{t:T-1}}[\nabla_\theta\log\pi_\theta(a_t|s_t)]\right]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_{s_{0:t}, a_{0:t-1}}\left[b(s_t)\int_{a_t}\nabla_\theta\pi_\theta(a_t|s_t)da_t\right]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_{s_{0:t}, a_{0:t-1}}\left[b(s_t)\cdot 0\right]\\ &=0 \end{aligned} Eτ[t=0T1θlogπθ(atst)b(st)]=t=0T1Eτ[θlogπθ(atst)b(st)]=t=0T1Es0:t,a0:t1[Est+1:T1,at:T1[θlogπθ(atst)b(st)]]=t=0T1Es0:t,a0:t1[b(st)Est+1:T1,at:T1[θlogπθ(atst)]]=t=0T1Es0:t,a0:t1[b(st)atθπθ(atst)dat]=t=0T1Es0:t,a0:t1[b(st)0]=0

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值