强化学习:策略梯度Policy-gradient
这是一篇笔记文
1. value-based and policy-based
-
value-based基于价值的RL,倾向于选择价值最大的状态或者动作;value-based通过迭代计算最优值函数Q,并根据最优值函数改进策略。
-
policy-based基于策略的RL,常分为随机策略与确定性策略;无需定义价值函数,policy-based可以通过动作分配概率分布,并按照该分布来根据当前状态执行动作;其原理在于将策略参数化,即 π θ ( s ) \pi_{\theta}(s) πθ(s), 通过寻找最优参数 θ \theta θ,使得累计回报的期望最大: m a x E [ Σ t = 0 k R ( s t ) ∣ π θ ] \displaystyle maxE[\Sigma_{t=0}^{k}R(s_t)|\pi_{\theta}] maxE[Σt=0kR(st)∣πθ], 此时 π θ ( s ) \pi_{\theta}(s) πθ(s)即为最优策略。
区别:
1)相较于value-based直接将值函数参数化表示相比,policy-based直接将策略参数化表示,使得策略$ \pi_{\theta}(s) $更加简单高效、易收敛。
2)利用value-base适用于离散的动作空间,动作空间虽然可以离散化处理,但离散间距的选取不易确定,并且value function的微小变化对策略的影响很大。而policy-based适用于连续的动作空间,不用计算每个动作的概率,而是通过正态分布选择action。
3)policy-based常采用随机策略,随机策略将探索 ε \varepsilon ε 集成到了所学的策略中。
4)policy-based基于梯度的求解,容易陷入局部最优。
5)policy-based评估单个策略时并不充分,方差很大。
2. 策略梯度Policy-gradient
trajectory: τ : s 1 , a 1 , r 1 , . . . , s t , a t , r t \tau :s_1,a_1,r_1,...,s_t,a_t,r_t τ:s1,a1,r1,...,st,at,rt
trajectory回报: R ( τ ) = Σ t = 0 k R ( s t , a t ∣ θ ) R(\tau)=\Sigma_{t=0}^{k}R(s_t,a_t|\theta) R(τ)=Σt=0kR(st,at∣θ)
目标函数:
l
(
θ
)
=
E
[
Σ
t
=
0
k
R
(
s
t
,
a
t
)
∣
π
(
θ
)
]
=
Σ
τ
p
(
τ
∣
θ
)
R
(
θ
)
l(\theta)=E[\Sigma_{t=0}^{k}R(s_t,a_t)|\pi(\theta)]=\Sigma_{\tau}p(\tau|\theta)R(\theta)
l(θ)=E[Σt=0kR(st,at)∣π(θ)]=Στp(τ∣θ)R(θ)
eg.
p
(
τ
∣
θ
)
p(\tau|\theta)
p(τ∣θ)为轨迹的概率分布
梯度下降求解: θ n e w = θ o l d + α ▽ θ l ( θ ) \theta_{new}=\theta_{old}+\alpha\triangledown_\theta l(\theta) θnew=θold+α▽θl(θ)
▽
θ
l
(
θ
)
=
▽
θ
Σ
τ
p
(
τ
∣
θ
)
R
(
τ
)
=
Σ
τ
▽
θ
p
(
τ
∣
θ
)
R
(
τ
)
=
Σ
τ
p
(
τ
∣
θ
)
p
(
τ
∣
θ
)
▽
θ
p
(
τ
∣
θ
)
R
(
τ
)
=
Σ
τ
p
(
τ
∣
θ
)
▽
θ
p
(
τ
∣
θ
)
R
(
τ
)
p
(
τ
∣
θ
)
=
Σ
τ
p
(
τ
∣
θ
)
▽
θ
l
o
g
p
(
τ
∣
θ
)
R
(
τ
)
\triangledown_\theta l(\theta)=\triangledown_{\theta}\Sigma_{\tau}p(\tau|\theta)R(\tau) =\Sigma_{\tau}\triangledown_{\theta}p(\tau|\theta)R(\tau)\\ =\Sigma_{\tau}\frac{p(\tau|\theta)}{p(\tau|\theta)}\triangledown_{\theta}p(\tau|\theta)R(\tau)\\ =\Sigma_{\tau}p(\tau|\theta)\frac{\triangledown_{\theta}p(\tau|\theta)R(\tau)}{p(\tau|\theta)}\\ =\Sigma_{\tau}p(\tau|\theta)\triangledown_{\theta}logp(\tau|\theta)R(\tau)
▽θl(θ)=▽θΣτp(τ∣θ)R(τ)=Στ▽θp(τ∣θ)R(τ)=Στp(τ∣θ)p(τ∣θ)▽θp(τ∣θ)R(τ)=Στp(τ∣θ)p(τ∣θ)▽θp(τ∣θ)R(τ)=Στp(τ∣θ)▽θlogp(τ∣θ)R(τ)
其中,
▽
θ
l
o
g
p
(
τ
∣
θ
)
=
1
p
(
τ
∣
θ
)
▽
θ
p
(
τ
∣
θ
)
\triangledown_{\theta}logp(\tau|\theta)=\frac{1}{p(\tau|\theta)}\triangledown_{\theta}p(\tau|\theta)
▽θlogp(τ∣θ)=p(τ∣θ)1▽θp(τ∣θ), 这是由公式:
d
l
o
g
(
f
(
x
)
)
d
x
=
1
f
(
x
)
d
f
(
x
)
d
x
\frac{dlog(f(x))}{dx}=\frac{1}{f(x)}\frac{df(x)}{dx}
dxdlog(f(x))=f(x)1dxdf(x)得到的。
最终变成求解
▽
θ
l
o
g
p
(
τ
∣
θ
)
R
(
τ
)
\triangledown_{\theta}logp(\tau|\theta)R(\tau)
▽θlogp(τ∣θ)R(τ)的期望,
可以利用经验平均估算:
R
‾
θ
=
1
N
Σ
n
=
1
N
▽
θ
l
o
g
p
(
τ
n
∣
θ
)
R
(
τ
n
)
\overline R_{\theta}=\frac{1}{N}\Sigma_{n=1}^{N}\triangledown_{\theta}logp(\tau^{n}|\theta)R(\tau^{n})
Rθ=N1Σn=1N▽θlogp(τn∣θ)R(τn)
▽ θ l θ = 1 m Σ n = 1 m ▽ θ l o g p ( τ n ∣ θ ) R ( τ n ) \triangledown_{\theta} l_{\theta}=\frac{1}{m}\Sigma_{n=1}^{m}\triangledown_{\theta}logp(\tau^{n}|\theta)R(\tau^{n}) ▽θlθ=m1Σn=1m▽θlogp(τn∣θ)R(τn)
1、 那么如何求解 ▽ θ l o g p ( τ ∣ θ ) \triangledown_{\theta}logp(\tau|\theta) ▽θlogp(τ∣θ) ?
根据trajectory可以得到概率:
p
(
τ
∣
θ
)
=
p
(
s
1
)
p
(
a
1
∣
s
1
;
θ
)
p
(
r
1
,
s
2
∣
s
1
,
a
1
)
p
(
a
2
∣
s
2
,
θ
)
p
(
r
2
,
s
3
∣
s
2
,
a
2
)
.
.
.
p
(
a
t
∣
s
t
,
θ
)
p
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
=
p
(
s
1
)
∏
t
=
1
T
p
(
a
t
∣
s
t
,
θ
)
p
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
p(\tau|\theta)=p(s_1)p(a_1|s_1;\theta)p(r_1,s_2|s_1,a_1)p(a_2|s_2,\theta)p(r_2,s_3|s_2,a_2)...p(a_{t}|s_t,\theta)p(r_t,s_{t+1}|s_t,a_t)\\ =p(s_1)\prod_{t=1}^Tp(a_t|s_t,\theta)p(r_t,s_{t+1}|s_t,a_t)
p(τ∣θ)=p(s1)p(a1∣s1;θ)p(r1,s2∣s1,a1)p(a2∣s2,θ)p(r2,s3∣s2,a2)...p(at∣st,θ)p(rt,st+1∣st,at)=p(s1)t=1∏Tp(at∣st,θ)p(rt,st+1∣st,at)
从而推导求得:
l
o
g
p
(
τ
∣
θ
)
=
l
o
g
p
(
s
1
)
+
Σ
t
=
1
T
l
o
g
p
(
a
t
∣
s
t
.
θ
)
+
l
o
g
p
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
=
=
>
▽
θ
l
o
g
p
(
τ
∣
θ
)
=
Σ
t
=
1
T
▽
θ
l
o
g
p
(
a
t
∣
s
t
,
θ
)
logp(\tau|\theta)=logp(s_1)+\Sigma_{t=1}^{T}logp(a_t|s_t.\theta)+logp(r_t,s_{t+1}|s_t,a_t)\\ ==>\triangledown_{\theta}logp(\tau|\theta)=\Sigma_{t=1}^{T}\triangledown _{\theta}logp(a_t|s_t,\theta)
logp(τ∣θ)=logp(s1)+Σt=1Tlogp(at∣st.θ)+logp(rt,st+1∣st,at)==>▽θlogp(τ∣θ)=Σt=1T▽θlogp(at∣st,θ)
2、最终如何求解策略的梯度?
θ n e w = θ o l d + α ▽ θ R ‾ ( θ ) o l d \theta_{new}=\theta_{old}+\alpha\triangledown_\theta \overline R(\theta)_{old} θnew=θold+α▽θR(θ)old
求解过程如下:
▽
θ
R
‾
θ
≈
1
N
Σ
n
=
1
N
▽
θ
p
(
τ
n
∣
θ
)
R
(
τ
n
)
=
1
N
Σ
n
=
1
N
R
(
τ
n
)
Σ
t
=
1
T
▽
θ
l
o
g
p
(
a
t
n
∣
s
t
n
,
θ
)
=
1
N
Σ
n
=
1
N
Σ
t
=
1
T
R
(
τ
n
)
▽
θ
l
o
g
p
(
a
t
n
∣
s
t
n
,
θ
)
\triangledown_{\theta}\overline R_{\theta}\approx \frac{1}{N}\Sigma_{n=1}^{N}\triangledown_{\theta}p(\tau^{n}|\theta)R(\tau ^n)\\ =\frac{1}{N}\Sigma_{n=1}^{N}R(\tau ^n)\Sigma_{t=1}^{T} \triangledown_{\theta}logp(a_t^{n}|s_t^n,\theta)\\ =\frac{1}{N}\Sigma_{n=1}^{N}\Sigma_{t=1}^{T}R(\tau ^n) \triangledown_{\theta}logp(a_t^{n}|s_t^n,\theta)\\
▽θRθ≈N1Σn=1N▽θp(τn∣θ)R(τn)=N1Σn=1NR(τn)Σt=1T▽θlogp(atn∣stn,θ)=N1Σn=1NΣt=1TR(τn)▽θlogp(atn∣stn,θ)
参考:李宏毅(Reinforcement Learning)