强化学习—— 基于baseline的策略梯度(Reinforce算法与A2C)
1. baseline的推导
- 策略网络为: π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ)
- 状态价值函数为: V π ( s ) = E A ∼ π [ Q π ( A , s ) ] = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( a , s ) V_\pi(s)=E_{A\sim\pi}[Q_\pi(A,s)]\\=\sum_a\pi(a|s;\theta)\cdot Q_\pi(a,s) Vπ(s)=EA∼π[Qπ(A,s)]=a∑π(a∣s;θ)⋅Qπ(a,s)
- 策略梯度为: ∂ V π ( s ) ∂ θ = E A ∼ π [ Q π ( s , a ) ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] \frac{\partial V_\pi(s)}{\partial \theta}=E_{A\sim\pi}[Q_\pi(s,a)\cdot\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}] ∂θ∂Vπ(s)=EA∼π[Qπ(s,a)⋅∂θ∂log(π(a∣s;θ))]
- 设b为不依赖于动作A的任何函数,则: E A ∼ π [ b ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = b ⋅ E A ∼ π [ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = b ⋅ ∑ a π ( a ∣ s ; θ ) ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ = b ⋅ ∑ a π ( a ∣ s ; θ ) ⋅ 1 π ( a ∣ s ; θ ) ⋅ ∂ π ( a ∣ s ; θ ) ∂ θ = b ⋅ ∂ ∑ a π ( a ∣ s ; θ ) ∂ θ = b ⋅ ∂ 1 ∂ θ = 0 E_{A\sim\pi}[b\cdot \frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]\\=b\cdot E_{A\sim\pi}[\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]\\ = b\cdot \sum_a \pi(a|s;\theta)\cdot \frac{\partial log(\pi(a|s;\theta))}{\partial \theta}\\=b\cdot \sum_a \pi(a|s;\theta)\cdot \frac{1}{\pi(a|s;\theta)}\cdot \frac{\partial \pi(a|s;\theta)}{\partial \theta}\\ =b\cdot \frac{\partial \sum_a \pi (a|s;\theta)}{\partial \theta}\\=b\cdot\frac{\partial1}{\partial \theta}\\=0 EA∼π[b⋅∂θ∂log(π(a∣s;θ))]=b⋅EA∼π[∂θ∂log(π(a∣s;θ))]=b⋅a∑π(a∣s;θ)⋅∂θ∂log(π(a∣s;θ))=b⋅a∑π(a∣s;θ)⋅π(a∣s;θ)1⋅∂θ∂π(a∣s;θ)=b⋅∂θ∂∑aπ(a∣s;θ)=b⋅∂θ∂1=0因此,如果b独立于动作A,则: E A ∼ π [ b ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = 0 E_{A\sim\pi}[b\cdot\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]=0 EA∼π[b⋅∂θ∂log(π(a∣s;θ))]=0
- 则带baseline的策略梯度为: ∂ V π ( s ) ∂ θ = E A ∼ π [ Q π ( A , s ) ⋅ ∂ l o g ( π ( A ∣ s , θ ) ) ∂ θ ] − E A ∼ π [ b ⋅ ∂ l o g ( π ( A ∣ s , θ ) ) ∂ θ ] = E A ∼ π [ ∂ l o g ( π ( A ∣ s ; θ ) ) ∂ θ ⋅ ( Q π ( A , s ) − b ) ] \frac{\partial V_\pi(s)}{\partial \theta}=E_{A\sim\pi}[Q_\pi(A,s)\cdot\frac{\partial log(\pi(A|s,\theta))}{\partial \theta}]-E_{A\sim\pi}[b\cdot\frac{\partial log(\pi(A|s,\theta))}{\partial \theta}]\\=E_{A\sim\pi}[\frac{\partial log(\pi(A|s;\theta))}{\partial \theta}\cdot(Q_\pi(A,s)-b)] ∂θ∂Vπ(s)=EA∼π[Qπ(A,s)⋅∂θ∂log(π(A∣s,θ))]−EA∼π[b⋅∂θ∂log(π(A∣s,θ))]=EA∼π[∂θ∂log(π(A∣s;θ))⋅(Qπ(A,s)−b)]b不会影响期望,但合适的b会降低蒙特卡洛近似的方差,加快模型收敛。
2. 策略梯度的蒙特卡洛近似
- 基于baselin的策略梯度为: ∂ V π ( s t ) ∂ θ = = E A t ∼ π [ ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − b ) ] g ( A t ) = ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − b ) \frac{\partial V_\pi(s_t)}{\partial \theta}==E_{A_t\sim\pi}[\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-b)]\\g(A_t)=\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-b) ∂θ∂Vπ(st)==EAt∼π[∂θ∂log(π(At∣st;θ))⋅(Qπ(At,st)−b)]g(At)=∂θ∂log(π(At∣st;θ))⋅(Qπ(At,st)−b)
- 依据策略函数随机抽样得到t时刻的动作: a t ∼ π ( ⋅ ∣ s t ; θ ) a_t\sim\pi(\cdot|s_t;\theta) at∼π(⋅∣st;θ)
- 则策略梯度的无偏估计为: g ( a t ) g(a_t) g(at)
- 随机策略梯度: g ( a t ) = ( Q π ( s t , a t ) − b ) ⋅ ( ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ) g(a_t)=(Q_\pi(s_t,a_t)-b)\cdot(\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}) g(at)=(Qπ(st,at)−b)⋅(∂θ∂log(π(at∣st;θ)))
- 做梯度上升: θ ← θ + β ⋅ g ( a t ) \theta\gets\theta+\beta\cdot g(a_t) θ←θ+β⋅g(at)
3. baseline的选取
- 标准策略梯度: b = 0 b=0 b=0
- 使用状态价值函数,因为其与动作A_t无关,且接近动作价值函数: b = V π ( s t ) V π ( s t ) = E A t [ Q ( A t , s t ) ] b=V_{\pi} (s_t)\\V_\pi(s_t)=E_{A_t}[Q(A_t,s_t)] b=Vπ(st)Vπ(st)=EAt[Q(At,st)]
4. Reinforce算法
4.1 基本概念
- 折扣回报: U t = R t + γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 − . . . U_t=R_t+\gamma\cdot R_{t+1}+\gamma^2\cdot R_{t+2}- ... Ut=Rt+γ⋅Rt+1+γ2⋅Rt+2−...
- 动作价值函数: Q π ( s t , a t ) = E [ U t ∣ s t , a t ] Q_\pi(s_t,a_t)=E[U_t|s_t,a_t] Qπ(st,at)=E[Ut∣st,at]
- 状态价值函数: V π ( s t ) = E A t [ Q π ( s t , A t ) ∣ s t ] V_\pi(s_t)=E_{A_t}[Q_\pi(s_t,A_t)|s_t] Vπ(st)=EAt[Qπ(st,At)∣st]
- 带baseline的策略梯度为: ∂ V π ( s t ) ∂ θ = E A t ∼ π [ g ( A t ) ] = E A t ∼ π [ ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − V π ( s t ) ) ] \frac{\partial V_\pi(s_t)}{\partial\theta}=E_{A_t\sim\pi}[g(A_t)]\\=E_{A_t\sim\pi}[\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-V_\pi(s_t))] ∂θ∂Vπ(st)=EAt∼π[g(At)]=EAt∼π[∂θ∂log(π(At∣st;θ))⋅(Qπ(At,st)−Vπ(st))]
- 对动作进行抽样,做蒙特卡洛近似,为无偏估计: a t ∼ π ( ⋅ ∣ s t ; θ ) g ( a t ) = ( Q π ( s t , a t ) − b ) ⋅ ( ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ) a_t\sim\pi(\cdot|s_t;\theta)\\g(a_t)=(Q_\pi(s_t,a_t)-b)\cdot(\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}) at∼π(⋅∣st;θ)g(at)=(Qπ(st,at)−b)⋅(∂θ∂log(π(at∣st;θ)))
- 对动作价值函数做蒙特卡洛近似(Reinforce算法的关键): Q π ( s t , a t ) = E [ U t ∣ s t , a t ] Q π ( s t , a t ) ≈ u t 观 测 轨 迹 为 : s t , a t , r t , s t + 1 , a t + 1 , r t + 1 , . . . , s t + n , a t + n , r t + n u t = ∑ i = t n γ i − t r i Q_\pi(s_t,a_t)=E[U_t|s_t,a_t]\\Q_\pi(s_t,a_t)\approx u_t\\观测轨迹为:s_t,a_t,r_t,s_{t+1},a_{t+1},r_{t+1},...,s_{t+n},a_{t+n},r_{t+n}\\u_t=\sum_{i=t}^n\gamma^{i-t}r_i Qπ(st,at)=E[Ut∣st,at]Qπ(st,at)≈ut观测轨迹为:st,at,rt,st+1,at+1,rt+1,...,st+n,at+n,rt+nut=i=t∑nγi−tri
- 通过神经网络近似状态价值函数: v ( s t ; W ) ∼ V π ( s t ) v(s_t;W)\sim V_\pi(s_t) v(st;W)∼Vπ(st)
- 近似后的策略梯度为:
∂
V
π
(
s
t
)
∂
θ
=
∂
l
o
g
(
π
(
a
t
∣
s
t
;
θ
)
)
∂
θ
⋅
(
u
t
−
v
(
s
t
;
W
)
)
\frac{\partial V_\pi(s_t)}{\partial\theta}=\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}\cdot(u_t-v(s_t;W))
∂θ∂Vπ(st)=∂θ∂log(π(at∣st;θ))⋅(ut−v(st;W))
上述的推导做了三次近似:
- 动作的抽样为蒙特卡洛近似
- 动作价值函数的近似为蒙特卡洛近似
- 状态价值函数为神经网络近似
4.2 算法的训练流程
-
策略网络为: π ( a t ∣ s t ; θ ) \pi(a_t|s_t;\theta) π(at∣st;θ)
- 价值网络: v ( s t ; W ) v(s_t;W) v(st;W) -
两个网络可以进行参数共享
- 完成一局完整的游戏,得到一个轨迹: { ( s 1 , a 1 , r 1 ) ; ( s 2 , a 2 , r 2 ) ; . . . ; ( s n , a n , r n ) } \{(s_1,a_1,r_1);(s_2,a_2,r_2);...;(s_n,a_n,r_n)\} {(s1,a1,r1);(s2,a2,r2);...;(sn,an,rn)}
- 计算动作价值函数的近似: u t = ∑ i = t n γ i − t ⋅ r i δ t = v ( s t ; W ) − u t u_t=\sum_{i=t}^n \gamma^{i-t}\cdot r_i\\\delta_t=v(s_t;W)-u_t ut=i=t∑nγi−t⋅riδt=v(st;W)−ut
- 依据策略梯度更新策略网络的参数: θ ← θ + β ⋅ δ t ⋅ ∂ l o g ( π ( s t ∣ s t ; θ ) ) ∂ θ \theta\gets\theta+\beta\cdot\delta_t\cdot \frac{\partial log(\pi(s_t|s_t;\theta))}{\partial\theta} θ←θ+β⋅δt⋅∂θ∂log(π(st∣st;θ))
- 采用梯度下降更新价值网络的参数: W ← W − α ⋅ δ t ⋅ ∂ v ( s t ; W ) ∂ W W\gets W-\alpha\cdot \delta_t\cdot\frac{\partial v(s_t;W)}{\partial W} W←W−α⋅δt⋅∂W∂v(st;W)
- 由于轨迹的长度为n,可以对神经网络进行n次更新
5. A2C算法(Advantage Actor Critic)
5.1 网络结构及其训练过程
-
策略网络为(actor): π ( a t ∣ s t ; θ ) \pi(a_t|s_t;\theta) π(at∣st;θ)
- 价值网络(critic): v ( s t ; W ) v(s_t;W) v(st;W) -
两个网络可以进行参数共享
- 观测到一个transition: s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1
- TD Target: r t + v ( s t + 1 ; W ) r_t+v(s_{t+1};W) rt+v(st+1;W)
- TD error 为: δ t = v ( s t ; W ) − y t \delta_t = v(s_t;W)-y_t δt=v(st;W)−yt
- 更新策略网络: θ ← θ − β ⋅ δ t ⋅ ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ \theta\gets\theta-\beta\cdot\delta_t\cdot\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial\theta} θ←θ−β⋅δt⋅∂θ∂log(π(at∣st;θ))
- 更新价值网络: W ← W − α ⋅ δ t ⋅ ∂ v ( s t ; W ) ∂ W W\gets W-\alpha\cdot\delta_t\cdot\frac{\partial v(s_t;W)}{\partial W} W←W−α⋅δt⋅∂W∂v(st;W)
5.2 数学原理推导
5.2.1 概念定义
- 折扣回报: U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+... Ut=Rt+γRt+1+γ2Rt+2+...
- 动作价值函数: Q π ( s t , a t ) = E [ U t ∣ s t , a t ] Q_\pi(s_t,a_t)=E[U_t|s_t,a_t] Qπ(st,at)=E[Ut∣st,at]
- 状态价值函数: V π ( s t ) = E A t [ Q π ( s t , A t ) ∣ s t ] V_\pi(s_t)=E_{A_t}[Q_\pi(s_t,A_t)|s_t] Vπ(st)=EAt[Qπ(st,At)∣st]
5.2.2 定理1(动作价值与状态价值之间的关系)
Q
π
(
s
t
,
a
t
)
=
E
[
U
t
∣
s
t
,
a
t
]
=
E
A
t
+
1
,
S
t
+
1
[
R
t
+
γ
⋅
Q
π
(
S
t
+
1
,
A
t
+
1
)
]
=
E
S
t
+
1
[
R
t
+
γ
⋅
E
A
t
+
1
[
Q
π
(
S
t
+
1
,
A
t
+
1
)
]
]
=
E
S
t
+
1
[
R
t
+
V
π
(
S
t
+
1
)
]
Q_\pi(s_t,a_t)=E[U_t|s_t,a_t]\\=E_{A_{t+1},S_{t+1}}[R_t+\gamma\cdot Q_\pi(S_{t+1},A_{t+1})]\\=E_{S_{t+1}}[R_t+\gamma\cdot E_{A_{t+1}}[Q_\pi(S_{t+1},A_{t+1})]]\\ = E_{S_{t+1}}[R_t+V_\pi(S_{t+1})]
Qπ(st,at)=E[Ut∣st,at]=EAt+1,St+1[Rt+γ⋅Qπ(St+1,At+1)]=ESt+1[Rt+γ⋅EAt+1[Qπ(St+1,At+1)]]=ESt+1[Rt+Vπ(St+1)]
蒙特卡洛近似为:
Q
π
(
s
t
,
a
t
)
≈
r
t
+
γ
⋅
V
π
(
s
t
+
1
)
Q_\pi(s_t,a_t)\approx r_t+\gamma\cdot V_\pi(s_{t+1})
Qπ(st,at)≈rt+γ⋅Vπ(st+1)可用于训练策略网络
5.2.3 定理2(前后时刻状态价值之间的关系)
V
π
(
s
t
)
=
E
A
t
[
Q
π
(
s
t
,
A
t
)
]
=
E
A
t
[
E
S
t
+
1
[
R
t
+
γ
⋅
V
π
(
S
t
+
1
)
]
]
=
E
A
t
,
S
t
+
1
[
R
t
+
V
π
(
S
t
+
1
)
]
V_\pi(s_t)=E_{A_t}[Q_\pi(s_t,A_t)]\\=E_{A_t}[E_{S_{t+1}}[R_t+\gamma\cdot V_\pi(S_{t+1})]]\\=E_{A_t,S_{t+1}}[R_t+V_\pi(S_{t+1})]
Vπ(st)=EAt[Qπ(st,At)]=EAt[ESt+1[Rt+γ⋅Vπ(St+1)]]=EAt,St+1[Rt+Vπ(St+1)]
蒙特卡洛近似为:
V
π
(
s
t
)
≈
r
t
+
γ
⋅
V
π
(
s
t
+
1
)
V_\pi(s_t)\approx r_t+\gamma\cdot V_\pi(s_{t+1})
Vπ(st)≈rt+γ⋅Vπ(st+1)可用于训练价值网络
5.2.4 策略网络的更新:
- 随机策略梯度为: g ( a t ) = ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( s t , a t ) − V π ( s t ) ) Q π ( s t , a t ) ≈ r t + γ ⋅ V π ( s t ) = y t v ( s t ; W ) ∼ V π ( s t ) θ ← θ + β ⋅ ( y t − v ( s t ; W ) ) ⋅ ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ g(a_t)=\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(s_t,a_t)-V_\pi(s_t))\\Q_\pi(s_t,a_t)\approx r_t+\gamma\cdot V_\pi(s_t)=y_t\\v(s_t;W)\sim V_\pi(s_t)\\\theta\gets \theta +\beta\cdot(y_t-v(s_t;W))\cdot \frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta} g(at)=∂θ∂log(π(at∣st;θ))⋅(Qπ(st,at)−Vπ(st))Qπ(st,at)≈rt+γ⋅Vπ(st)=ytv(st;W)∼Vπ(st)θ←θ+β⋅(yt−v(st;W))⋅∂θ∂log(π(at∣st;θ))
5.2.5 价值网络的更新:
V π ( s t ) ≈ r t + γ ⋅ V π ( s t + 1 ) v ( s t ; W ) ≈ r t + γ ⋅ v ( s t + 1 ; W ) = y t V_\pi(s_t)\approx r_t+\gamma\cdot V_\pi(s_{t+1})\\v(s_t;W)\approx r_t+\gamma\cdot v(s_{t+1};W)=y_t Vπ(st)≈rt+γ⋅Vπ(st+1)v(st;W)≈rt+γ⋅v(st+1;W)=yt
- TD error为: δ t = v ( s t ; W ) − y t \delta_t=v(s_t;W)-y_t δt=v(st;W)−yt
- 梯度为: ∂ 1 2 ⋅ δ t 2 ∂ W = δ t ⋅ ∂ v ( s t ; W ) ∂ W \frac{\partial{\frac{1}{2}\cdot\delta_t^2}}{\partial W}=\delta_t\cdot \frac{\partial{v(s_t;W)}}{\partial W} ∂W∂21⋅δt2=δt⋅∂W∂v(st;W)
- 梯度更新:
W
←
W
−
α
⋅
δ
t
⋅
∂
v
(
s
t
;
W
)
∂
W
W\gets W-\alpha\cdot \delta_t\cdot \frac{\partial{v(s_t;W)}}{\partial W}
W←W−α⋅δt⋅∂W∂v(st;W)
本文内容为参考B站学习视频书写的笔记!
5.3 策略梯度的理解
g
(
a
t
)
=
(
t
+
γ
⋅
v
(
s
t
+
1
;
W
)
−
v
(
s
t
;
W
)
)
⋅
(
∂
l
o
g
(
π
(
a
t
∣
s
t
;
θ
)
)
∂
θ
)
g(a_t)=(_t+\gamma\cdot v(s_{t+1};W)-v(s_t;W))\cdot(\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta})
g(at)=(t+γ⋅v(st+1;W)−v(st;W))⋅(∂θ∂log(π(at∣st;θ)))
两者之差反映了执行动作后的优势(回报)
6. Reinforce算法 V.S. A2C算法
6.1 A2C
6.1.1 one step TD Target
- 观测到一个transition: ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1)
- y t = r t + γ ⋅ v ( s t + 1 ; W ) y_t=r_t+\gamma\cdot v(s_{t+1;W}) yt=rt+γ⋅v(st+1;W)
6.1.2 multi-step TD Target
- 观测到m个transition: { ( s t + i , a t + i , r t + i , s t + i + 1 ) } i = 0 m − 1 \{(s_{t+i},a_{t+i},r_{t+i},s_{t+i+1})\}_{i=0}^{m-1} {(st+i,at+i,rt+i,st+i+1)}i=0m−1
- y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ v ( s t + 1 ; W ) y_t = \sum_{i=0}^{m-1}\gamma^i\cdot r_{t+i}+\gamma^m\cdot v(s_{t+1};W) yt=i=0∑m−1γi⋅rt+i+γm⋅v(st+1;W)
6.2 Reinforce 算法
- 回报: u t = ∑ i = t n γ t − i ⋅ r i u_t = \sum_{i=t}^n \gamma^{t-i}\cdot r_i ut=i=t∑nγt−i⋅ri
- error: δ t = v ( s t ; W ) − u t \delta_t=v(s_t;W)-u_t δt=v(st;W)−ut
6.3 Reinforce 算法为A2C的特殊形式
multi-step A2C的TD Target为:
y
t
=
∑
i
=
0
m
−
1
γ
i
⋅
r
t
+
i
+
γ
m
⋅
v
(
s
t
+
1
;
W
)
y_t = \sum_{i=0}^{m-1}\gamma^i\cdot r_{t+i}+\gamma^m\cdot v(s_{t+1};W)
yt=i=0∑m−1γi⋅rt+i+γm⋅v(st+1;W)
当使用所有奖励时,则:
y
t
=
u
t
=
∑
i
=
t
n
γ
t
−
i
⋅
r
i
y_t=u_t=\sum_{i=t}^n \gamma^{t-i}\cdot r_i
yt=ut=i=t∑nγt−i⋅ri
所以 Reinforce 算法为A2C的特例。
本文内容为参考B站学习视频书写的笔记!
by CyrusMay 2022 04 11