MaxEnt的一些推导和理解
策略概率分布最大熵的角度
从策略概率分布最大熵的角度理解,我们对最优策略的要求不仅仅是最大化长期奖励
η
(
π
)
=
E
τ
[
∑
t
=
0
∞
γ
t
r
(
s
t
,
a
t
)
]
\eta(\pi)=\mathbb{E}_{\tau}\left[\sum_{t=0}^\infty\gamma^tr(s_t,a_t)\right]
η(π)=Eτ[∑t=0∞γtr(st,at)];同时要求策略随机性尽可能的大,也就是策略的熵
H
(
π
)
\mathcal{H}(\pi)
H(π)尽可能大:
J
(
π
)
=
∑
t
=
0
∞
E
(
s
t
,
a
t
)
∼
ρ
(
π
)
[
r
(
s
t
,
a
t
)
+
α
H
(
⋅
∣
s
t
)
]
=
E
τ
[
∑
t
=
0
∞
r
(
s
t
,
a
t
)
−
α
log
π
(
⋅
∣
s
t
)
]
\begin{aligned} J(\pi)&=\sum_{t=0}^\infty\mathbb{E}_{(s_t,a_t)\sim\rho(\pi)}\left[ r(s_t,a_t)+\alpha\mathcal{H}(\cdot|s_t)\right]\\ &=\mathbb{E}_\tau\left[\sum^\infty_{t=0}r(s_t,a_t)-\alpha\log\pi(\cdot|s_t)\right] \end{aligned}
J(π)=t=0∑∞E(st,at)∼ρ(π)[r(st,at)+αH(⋅∣st)]=Eτ[t=0∑∞r(st,at)−αlogπ(⋅∣st)]这里没有考虑折扣系数
γ
\gamma
γ。最终的目标是
max
π
J
(
π
)
\max_\pi \; J(\pi)
maxπJ(π)
采样轨迹分布匹配的角度
给定策略
π
(
a
∣
s
)
\pi(a|s)
π(a∣s),那么根据这个策略进行交互采样得到不同的采样轨迹
τ
=
(
s
1
,
a
1
,
s
2
,
a
2
,
⋯
)
\tau=(s_1,a_1,s_2,a_2,\cdots)
τ=(s1,a1,s2,a2,⋯)的概率分布可以表示为:
q
(
τ
)
=
p
1
(
s
1
)
∏
t
=
1
p
(
s
t
+
1
∣
s
t
,
a
t
)
π
(
a
t
∣
s
t
)
q(\tau)=p_1(s_1)\prod_{t=1}p(s_{t+1}|s_t,a_t)\pi(a_t|s_t)
q(τ)=p1(s1)t=1∏p(st+1∣st,at)π(at∣st)假如存在一个确定的奖励函数
r
(
s
t
,
a
t
)
r(s_t, a_t)
r(st,at)(通常来说MDP存在这样的函数,但是不知道具体的形式,只是交互过程能采样得到采样的奖励值)。那么MaxEnt框架下理想的目标轨迹分布是与采样过程的奖励值的指数函数值成正比的,即:
p
(
τ
)
=
1
Z
p
1
(
s
1
)
∏
t
=
1
p
(
s
t
+
1
∣
s
t
,
a
t
)
e
r
(
s
t
,
a
t
)
p(\tau)=\frac{1}{Z}p_1(s_1)\prod_{t=1}p(s_{t+1}|s_t,a_t)e^{r(s_t, a_t)}
p(τ)=Z1p1(s1)t=1∏p(st+1∣st,at)er(st,at)其中
Z
=
∫
p
1
(
s
1
)
∏
t
=
1
p
(
s
t
+
1
∣
s
t
,
a
t
)
e
r
(
s
t
,
a
t
)
d
τ
Z=\int p_1(s_1)\prod_{t=1}p(s_{t+1}|s_t,a_t)e^{r(s_t, a_t)}d\tau
Z=∫p1(s1)∏t=1p(st+1∣st,at)er(st,at)dτ是正则项;那么MaxEnt RL的目标就是尽可能的使策略下的轨迹分布更加接近目标轨迹分布,一般用KL散度度量的话目标函数为:
max
π
−
D
K
L
(
q
(
τ
)
∣
∣
p
(
τ
)
)
=
max
π
−
∫
q
(
τ
)
log
∏
t
=
1
π
(
a
t
∣
s
t
)
∏
t
=
1
e
r
(
s
t
,
a
t
)
d
τ
−
log
Z
=
max
π
E
τ
∼
q
(
τ
)
[
∑
t
=
1
r
(
s
t
,
a
t
)
−
log
π
(
a
t
∣
s
t
)
]
\begin{aligned} &\quad \max_\pi\; -D_{KL}(q(\tau)||p(\tau))\\ &=\max_\pi\;-\int q(\tau)\log\frac{\prod_{t=1}\pi(a_t|s_t)}{\prod_{t=1}e^{r(s_t,a_t)}}d\tau-\log Z\\ &=\max_\pi\;\mathbb{E}_{\tau\sim q(\tau)}\left[\sum_{t=1}r(s_t,a_t)-\log\pi(a_t|s_t)\right] \end{aligned}
πmax−DKL(q(τ)∣∣p(τ))=πmax−∫q(τ)log∏t=1er(st,at)∏t=1π(at∣st)dτ−logZ=πmaxEτ∼q(τ)[t=1∑r(st,at)−logπ(at∣st)]这里最后忽略了
log
Z
\log Z
logZ,因为常数,可以看到和从策略最大熵角度的公式推导是一样的。
MaxEnt RL框架下Bellman等式的推导
和标准的Bellman等式类似,MaxEnt RL下也满足类似的Bellman等式:
- 首先是根据上面的MaxEnt RL的目标函数
J
(
π
)
J(\pi)
J(π)可以类比定义soft值函数(以下均默认
α
=
1
\alpha=1
α=1):
V s o f t π ( s t = s ) = E τ [ ∑ T = t ∞ γ T − t ( r ( s T , a T ) + H ( ⋅ ∣ s T ) ) ∣ s t = s ] V^\pi_{soft}(s_t=s)=\mathbb{E}_\tau\left[\sum_{T=t}^\infty\gamma^{T-t}(r(s_T,a_T)+\mathcal{H}(\cdot|s_T)) |s_t = s\right] Vsoftπ(st=s)=Eτ[T=t∑∞γT−t(r(sT,aT)+H(⋅∣sT))∣st=s]这里注意的是下标 T T T是变量, t t t是常量。这里用的是熵 H \mathcal{H} H不是 log π ( ⋅ ∣ s ) \log \pi(\cdot|s) logπ(⋅∣s)这是把关于 a T a_T aT的期望写进去了。最重要的这里考虑折扣奖励的同时策略熵值也乘以折扣系数。(参见SAC原文Appendix.A的说明) - 同理soft动作值函数也可定义并进一步改写:
Q s o f t π ( s t = s , a t = a ) = E τ [ r ( s t , a t ) + γ ∑ T = t + 1 ∞ γ T − t − 1 ( r ( s T , a T ) + H ( ⋅ ∣ s T ) ) ∣ s t = s , a t = a ] = r ( s t , a t ) + γ T s , a s ′ V s o f t π ( s t + 1 = s ′ ) \begin{aligned} Q^\pi_{soft}(s_t=s,a_t=a)&=\mathbb{E}\tau\left[r(s_t,a_t)+\gamma\sum_{T=t+1}^\infty\gamma^{T-t-1}(r(s_T,a_T)+\mathcal{H}(\cdot|s_T))|s_t=s,a_t=a\right]\\ &=r(s_t,a_t)+\gamma T^{s'}_{s,a}V^\pi_{soft}(s_{t+1}=s') \end{aligned} Qsoftπ(st=s,at=a)=Eτ[r(st,at)+γT=t+1∑∞γT−t−1(r(sT,aT)+H(⋅∣sT))∣st=s,at=a]=r(st,at)+γTs,as′Vsoftπ(st+1=s′)这里因为 s t , a t s_t,a_t st,at都是已知的,所以第一行的熵值是从 t + 1 t+1 t+1开始算起的。所以上式其实就是MaxEnt下修正后的Bellman算子:
T π Q s o f t π ≜ r ( s t , a t ) + γ E s t + 1 [ V s o f t π ( s t + 1 ) ] \mathcal{T}^\pi Q^\pi_{soft}\triangleq r(s_t,a_t)+\gamma \mathbb{E}_{s_{t+1}}\left[V^\pi_{soft}(s_{t+1})\right] TπQsoftπ≜r(st,at)+γEst+1[Vsoftπ(st+1)] - 同样的
V
s
o
f
t
π
V^\pi_{soft}
Vsoftπ也可以改写为由
Q
s
o
f
t
π
Q^\pi_{soft}
Qsoftπ表示的等式:
V s o f t π ( s t = s ) = E τ [ ∑ T = t ∞ γ T − t ( r ( s T , a T ) + H ( ⋅ ∣ s T ) ) ∣ s t = s ] = E τ [ ∑ T = t ∞ γ T − t r ( s T , a T ) + ∑ T = t + 1 ∞ γ T − t H ( ⋅ ∣ s T ) + H ( ⋅ ∣ s t ) ∣ s t = s ] = E τ [ r ( s t , a t ) + γ ∑ T = t + 1 ∞ γ T − t − 1 ( r ( s T , a T ) + H ( ⋅ ∣ s T ) ) + H ( ⋅ ∣ s t ) ∣ s t = s ] = E τ [ Q s o f t π ( s t , a t ) + H ( ⋅ ∣ s t ) ∣ s t = s ] = E a t [ Q s o f t π ( s t , a t ) − log π ( ⋅ ∣ s t ) ∣ s t = s ] \begin{aligned} V^\pi_{soft}(s_t=s)&=\mathbb{E}_\tau\left[\sum_{T=t}^\infty\gamma^{T-t}(r(s_T,a_T)+\mathcal{H}(\cdot|s_T)) |s_t = s\right]\\ &=\mathbb{E}_\tau\left[\sum_{T=t}^\infty\gamma^{T-t}r(s_T,a_T)+\sum^\infty_{T=t+1}\gamma^{T-t}\mathcal{H}(\cdot|s_T)+\mathcal{H}(\cdot|s_t) |s_t = s\right]\\ &=\mathbb{E}_\tau\left[r(s_t,a_t)+\gamma\sum_{T=t+1}^\infty\gamma^{T-t-1}(r(s_T,a_T)+\mathcal{H}(\cdot|s_T))+\mathcal{H}(\cdot|s_t) |s_t = s\right]\\ &=\mathbb{E}_\tau\left[Q^\pi_{soft}(s_t,a_t)+\mathcal{H}(\cdot|s_t) |s_t = s\right]\\ &=\mathbb{E}_{a_t}\left[Q^\pi_{soft}(s_t,a_t)-\log\pi(\cdot|s_t)|s_t=s\right] \end{aligned} Vsoftπ(st=s)=Eτ[T=t∑∞γT−t(r(sT,aT)+H(⋅∣sT))∣st=s]=Eτ[T=t∑∞γT−tr(sT,aT)+T=t+1∑∞γT−tH(⋅∣sT)+H(⋅∣st)∣st=s]=Eτ[r(st,at)+γT=t+1∑∞γT−t−1(r(sT,aT)+H(⋅∣sT))+H(⋅∣st)∣st=s]=Eτ[Qsoftπ(st,at)+H(⋅∣st)∣st=s]=Eat[Qsoftπ(st,at)−logπ(⋅∣st)∣st=s] - 将3中的等式带入2中有:
Q s o f t π ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 , a t + 1 [ Q s o f t π ( s t + 1 , a t + 1 ) − log π ( ⋅ ∣ s t + 1 ) ] = r ( s t , a t ) + γ E s t + 1 [ H ( π ( ⋅ ∣ s t + 1 ) ) ] + γ E s t + 1 , a t + 1 [ Q s o f t π ( s t + 1 , a t + 1 ) ] = r s o f t π ( s t , a t ) + γ E s t + 1 , a t + 1 [ Q s o f t π ( s t + 1 , a t + 1 ) ] \begin{aligned} Q^\pi_{soft}(s_t,a_t)&=r(s_t,a_t)+\gamma\mathbb{E}_{s_{t+1},a_{t+1}}\left[Q^\pi_{soft}(s_{t+1},a_{t+1})-\log\pi(\cdot|s_{t+1})\right]\\ &=r(s_t,a_t)+\gamma\mathbb{E}_{s_{t+1}}\left[\mathcal{H}(\pi(\cdot|s_{t+1}))\right]+\gamma\mathbb{E}_{s_{t+1},a_{t+1}}\left[Q^\pi_{soft}(s_{t+1},a_{t+1})\right]\\ &=r^\pi_{soft}(s_t,a_t)+\gamma\mathbb{E}_{s_{t+1},a_{t+1}}\left[Q^\pi_{soft}(s_{t+1},a_{t+1})\right]\\ \end{aligned} Qsoftπ(st,at)=r(st,at)+γEst+1,at+1[Qsoftπ(st+1,at+1)−logπ(⋅∣st+1)]=r(st,at)+γEst+1[H(π(⋅∣st+1))]+γEst+1,at+1[Qsoftπ(st+1,at+1)]=rsoftπ(st,at)+γEst+1,at+1[Qsoftπ(st+1,at+1)]可见上式就类似标准Bellman等式,不同的是修正奖励函数 r s o f t π ( s t , a t ) ≜ r ( s t , a t ) + γ E s t + 1 [ H ( π ( ⋅ ∣ s t + 1 ) ) ] r^\pi_{soft}(s_t,a_t)\triangleq r(s_t,a_t)+\gamma\mathbb{E}_{s_{t+1}}\left[\mathcal{H}(\pi(\cdot|s_{t+1}))\right] rsoftπ(st,at)≜r(st,at)+γEst+1[H(π(⋅∣st+1))]
SAC中的一些注意点
- 同标准的bellman方程推导的TD算法一样,SAC就利用上面推导的soft Bellman方程来构建TD学习目标,同样用参数化函数估计器来近似值函数:
V
s
o
f
t
π
(
s
)
≈
V
ψ
(
s
)
,
Q
s
o
f
t
π
≈
Q
ϕ
(
s
,
a
)
V^\pi_{soft}(s)\approx V_\psi(s),Q^\pi_{soft}\approx Q_\phi(s,a)
Vsoftπ(s)≈Vψ(s),Qsoftπ≈Qϕ(s,a),那么根据上式soft Bellman方程有值函数
V
ψ
(
s
)
V_\psi(s)
Vψ(s)目标函数:
J V ( ψ ) = E s ∼ D [ 1 2 ( V ψ ( s ) − E a ∼ π θ [ Q ϕ ( s , a ) − log π θ ( a ∣ s ) ] ) 2 ] J_V(\psi)=\mathbb{E}_{s\sim\mathcal{D}}\left[\frac{1}{2}\left(V_\psi(s)-\mathbb{E}_{a\sim\pi_\theta}\left[Q_\phi(s,a)-\log\pi_\theta(a|s)\right]\right)^2\right] JV(ψ)=Es∼D[21(Vψ(s)−Ea∼πθ[Qϕ(s,a)−logπθ(a∣s)])2]注意这里的动作 a a a重新从策略 π θ \pi_\theta πθ中采集,而不是使用transition样本中的值。而动作值函数的目标函数则为:
J Q ( ϕ ) = E s , a ∼ D [ 1 2 ( Q ϕ ( s , a ) − r ( s , a ) − γ E s ′ ∣ s , a [ V ψ ( s ′ ) ] ) 2 ] = E s , a ∼ D [ 1 2 ( Q ϕ ( s , a ) − r ( s , a ) − γ E s ′ ∣ s , a [ E a ′ ∼ π θ [ Q ϕ ( s ′ , a ′ ) − log π θ ( a ′ ∣ s ′ ) ] ] ) 2 ] \begin{aligned} J_Q(\phi)&=\mathbb{E}_{s,a\sim\mathcal{D}}\left[\frac{1}{2}\left(Q_\phi(s,a)-r(s,a)-\gamma \mathbb{E}_{s'|s,a}\left[V_\psi(s')\right]\right)^2\right]\\ &=\mathbb{E}_{s,a\sim\mathcal{D}}\left[\frac{1}{2}\left(Q_\phi(s,a)-r(s,a)-\gamma \mathbb{E}_{s'|s,a}\left[\mathbb{E}_{a'\sim\pi_\theta}\left[Q_\phi(s',a')-\log\pi_\theta(a'|s')\right]\right]\right)^2\right] \end{aligned} JQ(ϕ)=Es,a∼D[21(Qϕ(s,a)−r(s,a)−γEs′∣s,a[Vψ(s′)])2]=Es,a∼D[21(Qϕ(s,a)−r(s,a)−γEs′∣s,a[Ea′∼πθ[Qϕ(s′,a′)−logπθ(a′∣s′)]])2]这里 s , a , r ( s , a ) , s ′ s,a,r(s,a),s' s,a,r(s,a),s′均来自replaybuffer中的transition样本 ⟨ s , a , r , s ′ \langle s,a,r,s' ⟨s,a,r,s′,而 a ′ a' a′则来自当前策略的采样。第一行的目标是独立维护一个 V ψ V_\psi Vψ函数时的动作值函数目标, a ′ a' a′的采样在上一步值函数求解时已经做了。第二行的目标函数是只维护动作值函数 Q ϕ Q_\phi Qϕ时候的形式,可看到,和一般的Bellamn方程得到的TD error目标差了一个熵。在构造动作值函数的target时,需要采样 a ′ a' a′,如果采样这个 a ′ a' a′的策略和采样当前样本 ⟨ s , a , r , s ′ ⟩ \langle s,a,r,s'\rangle ⟨s,a,r,s′⟩的策略不一致,那么就是off-policy模式,一致的话就是on-policy。SAC中因为从维护的replaybuffer中采样来更新 Q ϕ Q_\phi Qϕ,所以采样策略是过去训练过程中的不同策略,而构造target的则是当前策略,所以是off-policy的。 - 还有一个容易迷惑的地方,就是策略更新的时候,其目标是最小化以下的KL散度:
J π ( θ ) = E s ∼ D [ D K L ( π ( a ∣ s ) ∥ exp ( Q ϕ ( s , a ) ) Z ϕ ( s ) ) ] = E s , a ∼ D [ log π θ ( a ∣ s ) − Q ϕ ( s , a ) ] \begin{aligned} J_\pi(\theta)&=\mathbb{E}_{s\sim\mathcal{D}}\left[D_{KL}\left(\pi(a|s)\Vert\frac{\exp(Q_\phi(s,a))}{Z_\phi(s)}\right)\right]\\ &=\mathbb{E}_{s,a\sim\mathcal{D}}\left[\log\pi_\theta(a|s)-Q_\phi(s,a)\right] \end{aligned} Jπ(θ)=Es∼D[DKL(π(a∣s)∥Zϕ(s)exp(Qϕ(s,a)))]=Es,a∼D[logπθ(a∣s)−Qϕ(s,a)]通常我们采用高斯策略,这里涉及到重采样技术,即 a = f θ ( s ) = μ θ ( s ) + ϵ ⋅ σ θ ( s ) a=f_\theta(s)=\mu_\theta(s)+\epsilon\cdot\sigma_\theta(s) a=fθ(s)=μθ(s)+ϵ⋅σθ(s),同时 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(a∣s)则代表是一个概率值,也与 θ \theta θ有关。带入上述目标中有:
J π ( θ ) = E s ∼ D , ϵ ∼ N [ log π θ ( f θ ( s , ϵ ) ∣ s ) − Q ϕ ( s , f θ ( s , ϵ ) ) ] J_\pi(\theta)=\mathbb{E}_{s\sim\mathcal{D},\epsilon\sim\mathcal{N}}\left[\log\pi_\theta(f_\theta(s,\epsilon)|s)-Q_\phi(s,f_\theta(s,\epsilon))\right] Jπ(θ)=Es∼D,ϵ∼N[logπθ(fθ(s,ϵ)∣s)−Qϕ(s,fθ(s,ϵ))]所以求到的时候一定注意 f θ f_\theta fθ和 π θ \pi_\theta πθ的区别,这里的第一项 log π θ ( f θ ( s , ϵ ) ∣ s ) \log\pi_\theta(f_\theta(s,\epsilon)|s) logπθ(fθ(s,ϵ)∣s)关于 θ \theta θ的导数有两条路径:
a. 是直接通过概率函数传递导数 π θ ( a ∣ s ) → θ \pi_\theta(a|s)\rightarrow\theta πθ(a∣s)→θ: ∇ θ log π θ ( a ∣ s ) \nabla_\theta\log\pi_\theta(a|s) ∇θlogπθ(a∣s);
b. 通过从 π θ ( a ∣ s ) → a → f θ ( s , ϵ ) → θ \pi_\theta(a|s)\rightarrow a\rightarrow f_\theta(s,\epsilon)\rightarrow\theta πθ(a∣s)→a→fθ(s,ϵ)→θ: ∇ a log π θ ( a ∣ s ) ∇ θ f θ ( s , ϵ ) \nabla_a\log\pi_\theta(a|s)\nabla_\theta f_\theta(s,\epsilon) ∇alogπθ(a∣s)∇θfθ(s,ϵ);
第二项 Q ϕ ( s , f θ ( s , ϵ ) ) Q_\phi(s,f_\theta(s,\epsilon)) Qϕ(s,fθ(s,ϵ))的导数只有一条路径 Q ϕ ( s , a ) → a → f θ ( s , ϵ ) → θ Q_\phi(s,a)\rightarrow a\rightarrow f_\theta(s,\epsilon)\rightarrow\theta Qϕ(s,a)→a→fθ(s,ϵ)→θ: ∇ a Q ( s , a ) ∇ θ f θ ( s , ϵ ) \nabla_a Q(s,a)\nabla_\theta f_\theta(s,\epsilon) ∇aQ(s,a)∇θfθ(s,ϵ),以上三项加起来就是:
∇ θ J π ( θ ) = E s ∼ D , ϵ ∼ N [ ∇ θ log π θ ( a ∣ s ) + ( ∇ a log π θ ( a ∣ s ) − ∇ a Q ( s , a ) ) ∇ θ f θ ( s , ϵ ) ] \nabla_\theta J_\pi(\theta)=\mathbb{E}_{s\sim\mathcal{D},\epsilon\sim\mathcal{N}}\left[\nabla_\theta\log\pi_\theta(a|s)+\left(\nabla_a\log\pi_\theta(a|s)-\nabla_a Q(s,a)\right)\nabla_\theta f_\theta(s,\epsilon)\right] ∇θJπ(θ)=Es∼D,ϵ∼N[∇θlogπθ(a∣s)+(∇alogπθ(a∣s)−∇aQ(s,a))∇θfθ(s,ϵ)]