之前介绍的是基于最优价值的强化学习算法——值函数估计,通过最优值函数得到策略。
也就是说
a ∗ = a r g m a x a Q ( s , a ) , a ∗ = π ∗ ( s ) a^*=argmax_a Q(s,a),a^*=\pi^*(s) a∗=argmaxaQ(s,a),a∗=π∗(s)
存在模型训练稳定性问题等。
基于策略梯度直接计算策略可能更新的方向:将值函数表示成策略参数的某个函数,可以求出值函数关于策略参数的梯度,沿着梯度上升的方向更新。
- 算法:策略梯度法->Actor Critic->A3C、A2C
策略梯度法基本原理
目标函数(值函数)
J ( θ ) = E τ ∼ π ( θ ) [ r ( τ ) ] = ∫ τ ∼ π ( θ ) π θ ( τ ) r ( τ ) d τ J(\theta)=E_{\tau \sim \pi(\theta)} [r(\tau)] = \int _{\tau \sim \pi(\theta)}\pi_\theta(\tau)r(\tau)d\tau J(θ)=Eτ∼π(θ)[r(τ)]=∫τ∼π(θ)πθ(τ)r(τ)dτ
可以表示成这样与 θ \theta θ相关的函数,因为积分和求导运算可以互换
∇ θ J ( θ ) = ∇ θ ∫ τ ∼ π ( θ ) π θ ( τ ) r ( τ ) d τ = ∫ τ ∼ π ( θ ) ∇ θ π θ ( τ ) r ( τ ) d τ \nabla_\theta J(\theta) = \nabla_\theta \int _{\tau \sim \pi(\theta)}\pi_\theta(\tau)r(\tau)d\tau= \int _{\tau \sim \pi(\theta)}\nabla_\theta\pi_\theta(\tau)r(\tau)d\tau ∇θJ(θ)=∇θ∫τ∼π(θ)πθ(τ)r(τ)dτ=∫τ∼π(θ)∇θπθ(τ)r(τ)dτ
又因为
∇ x l o g y = 1 y ∇ x y y ∇ x l o g y = ∇ x y ∇ θ π θ ( τ ) = π θ ( τ ) ∇ θ l o g π θ ( τ ) \nabla_xlogy=\frac 1 y \nabla_xy\\y\nabla_xlogy=\nabla_xy\\\nabla_\theta\pi_\theta(\tau)=\pi_\theta(\tau)\nabla_\theta log\pi_\theta(\tau) ∇xlogy=y1∇xyy∇xlogy=∇xy∇θπθ(τ)=πθ(τ)∇θlogπθ(τ)
所以 ∇ θ J ( θ ) = ∫ τ ∼ π ( θ ) ∇ θ π θ ( τ ) r ( τ ) d τ = ∫ τ ∼ π ( θ ) π θ ( τ ) ∇ θ l o g π θ ( τ ) r ( τ ) d τ \nabla_\theta J(\theta) = \int _{\tau \sim \pi(\theta)}\nabla_\theta\pi_\theta(\tau)r(\tau)d\tau\\ = \int _{\tau \sim \pi(\theta)}\pi_\theta(\tau)\nabla_\theta log\pi_\theta(\tau)r(\tau)d\tau ∇θJ(θ)=∫τ∼π(θ)∇θπθ(τ)r(τ)dτ=∫τ∼π(θ)πθ(τ)∇θlogπθ(τ)r(τ)dτ
因为 π θ ( τ ) = π ( s 0 , a 0 , . . . , s T , a T ) = p ( s 0 ) ∏ t = 0 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) 所以 ∇ θ l o g [ π ( τ ) ] = ∇ θ l o g [ p ( s 0 ) ∏ t = 0 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) ] = ∇ θ [ l o g p ( s 0 ) + ∑ t = 0 T l o g π θ ( a t ∣ s t ) + ∑ t = 0 T l o g p ( s t + 1 ∣ s t , a t ) ] 因为前后两项与 θ 无 关 = ∇ θ [ ∑ t = 0 T l o g π θ ( a t ∣ s t ) ] = ∑ t = 0 T ∇ θ l o g π θ ( a t ∣ s t ) \pi_\theta(\tau)=\pi(s_0,a_0,...,s_T,a_T)\\=p(s_0)\prod^T_{t=0}\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ \text{所以}\\ \nabla_\theta log[\pi(\tau)]=\nabla_\theta log[p(s_0)\prod^T_{t=0}\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)]\\ =\nabla_\theta[logp(s_0)+\sum^T_{t=0}log\pi_\theta(a_t|s_t)+\sum^T_{t=0}logp(s_{t+1}|s_t,a_t)]\\ \text{因为前后两项与}\theta无关\\ =\nabla_\theta[\sum^T_{t=0}log\pi_\theta(a_t|s_t)]\\ =\sum^T_{t=0}\nabla_\theta log\pi_\theta(a_t|s_t) πθ(τ)=π(s0,a0,...,sT,aT)=p(s0)t=0∏Tπθ(at∣st)p(st