强化学习精要-第三部分-基于策略梯度的算法(on-policy)


之前介绍的是基于最优价值的强化学习算法——值函数估计,通过最优值函数得到策略。
也就是说
a ∗ = a r g m a x a Q ( s , a ) , a ∗ = π ∗ ( s ) a^*=argmax_a Q(s,a),a^*=\pi^*(s) a=argmaxaQ(s,a),a=π(s)
存在模型训练稳定性问题等。


基于策略梯度直接计算策略可能更新的方向:将值函数表示成策略参数的某个函数,可以求出值函数关于策略参数的梯度,沿着梯度上升的方向更新。

  • 算法:策略梯度法->Actor Critic->A3C、A2C

策略梯度法基本原理

目标函数(值函数)
J ( θ ) = E τ ∼ π ( θ ) [ r ( τ ) ] = ∫ τ ∼ π ( θ ) π θ ( τ ) r ( τ ) d τ J(\theta)=E_{\tau \sim \pi(\theta)} [r(\tau)] = \int _{\tau \sim \pi(\theta)}\pi_\theta(\tau)r(\tau)d\tau J(θ)=Eτπ(θ)[r(τ)]=τπ(θ)πθ(τ)r(τ)dτ
可以表示成这样与 θ \theta θ相关的函数,因为积分和求导运算可以互换
∇ θ J ( θ ) = ∇ θ ∫ τ ∼ π ( θ ) π θ ( τ ) r ( τ ) d τ = ∫ τ ∼ π ( θ ) ∇ θ π θ ( τ ) r ( τ ) d τ \nabla_\theta J(\theta) = \nabla_\theta \int _{\tau \sim \pi(\theta)}\pi_\theta(\tau)r(\tau)d\tau= \int _{\tau \sim \pi(\theta)}\nabla_\theta\pi_\theta(\tau)r(\tau)d\tau θJ(θ)=θτπ(θ)πθ(τ)r(τ)dτ=τπ(θ)θπθ(τ)r(τ)dτ
又因为
∇ x l o g y = 1 y ∇ x y y ∇ x l o g y = ∇ x y ∇ θ π θ ( τ ) = π θ ( τ ) ∇ θ l o g π θ ( τ ) \nabla_xlogy=\frac 1 y \nabla_xy\\y\nabla_xlogy=\nabla_xy\\\nabla_\theta\pi_\theta(\tau)=\pi_\theta(\tau)\nabla_\theta log\pi_\theta(\tau) xlogy=y1xyyxlogy=xyθπθ(τ)=πθ(τ)θlogπθ(τ)
所以 ∇ θ J ( θ ) = ∫ τ ∼ π ( θ ) ∇ θ π θ ( τ ) r ( τ ) d τ = ∫ τ ∼ π ( θ ) π θ ( τ ) ∇ θ l o g π θ ( τ ) r ( τ ) d τ \nabla_\theta J(\theta) = \int _{\tau \sim \pi(\theta)}\nabla_\theta\pi_\theta(\tau)r(\tau)d\tau\\ = \int _{\tau \sim \pi(\theta)}\pi_\theta(\tau)\nabla_\theta log\pi_\theta(\tau)r(\tau)d\tau θJ(θ)=τπ(θ)θπθ(τ)r(τ)dτ=τπ(θ)πθ(τ)θlogπθ(τ)r(τ)dτ
因为 π θ ( τ ) = π ( s 0 , a 0 , . . . , s T , a T ) = p ( s 0 ) ∏ t = 0 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) 所以 ∇ θ l o g [ π ( τ ) ] = ∇ θ l o g [ p ( s 0 ) ∏ t = 0 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) ] = ∇ θ [ l o g p ( s 0 ) + ∑ t = 0 T l o g π θ ( a t ∣ s t ) + ∑ t = 0 T l o g p ( s t + 1 ∣ s t , a t ) ] 因为前后两项与 θ 无 关 = ∇ θ [ ∑ t = 0 T l o g π θ ( a t ∣ s t ) ] = ∑ t = 0 T ∇ θ l o g π θ ( a t ∣ s t ) \pi_\theta(\tau)=\pi(s_0,a_0,...,s_T,a_T)\\=p(s_0)\prod^T_{t=0}\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \text{所以}\\ \nabla_\theta log[\pi(\tau)]=\nabla_\theta log[p(s_0)\prod^T_{t=0}\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)]\\ =\nabla_\theta[logp(s_0)+\sum^T_{t=0}log\pi_\theta(a_t|s_t)+\sum^T_{t=0}logp(s_{t+1}|s_t,a_t)]\\ \text{因为前后两项与}\theta无关\\ =\nabla_\theta[\sum^T_{t=0}log\pi_\theta(a_t|s_t)]\\ =\sum^T_{t=0}\nabla_\theta log\pi_\theta(a_t|s_t) πθ(τ)=π(s0,a0,...,sT,aT)=p(s0)t=0Tπθ(atst)p(st

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值