1.Sarsa算法
每次使用五元组 ( s t , a t , r t , s t + 1 , a t + 1 ) (s_t,a_t,r_t,s_{t+1},a_{t+1}) (st,at,rt,st+1,at+1)更新参数,即State-Action-Reward-State-Action (SARSA)
1.0.Derive TD Target
Discounted Return,
R
t
R_t
Rt depends on
(
S
t
,
A
t
,
S
t
+
1
)
(S_t,A_t,S_{t+1})
(St,At,St+1)
U
t
=
R
t
+
γ
⋅
R
t
+
1
+
γ
2
⋅
R
t
+
2
⋅
⋅
⋅
=
R
t
+
γ
⋅
(
R
t
+
1
+
γ
⋅
R
t
+
2
⋅
⋅
⋅
)
=
R
t
+
γ
⋅
U
t
+
1
\begin{aligned} U_t&=R_t+\gamma·R_{t+1}+\gamma^2·R_{t+2}··· \\&=R_t+\gamma·(R_{t+1}+\gamma·R_{t+2}···) \\&=R_t+\gamma·U_{t+1} \end{aligned}
Ut=Rt+γ⋅Rt+1+γ2⋅Rt+2⋅⋅⋅=Rt+γ⋅(Rt+1+γ⋅Rt+2⋅⋅⋅)=Rt+γ⋅Ut+1
Q π ( s t , a t ) = E [ U t ∣ s t , a t ] = E [ R t + γ ⋅ U t + 1 ∣ s t , a t ] = E [ R t ∣ s t , a t ] + γ ⋅ E [ U t + 1 ∣ s t , a t ] = E [ R t ∣ s t , a t ] + γ ⋅ E [ Q π ( S t + 1 , A t + 1 ) ∣ s t , a t ] \begin{aligned} Q_\pi(s_t,a_t)&=E[U_t|s_t,a_t] \\&=E[R_t+\gamma·U_{t+1}|s_t,a_t] \\&=E[R_t|s_t,a_t]+\gamma·E[U_{t+1}|s_t,a_t] \\&=E[R_t|s_t,a_t]+\gamma·E[Q_\pi(S_{t+1},A_{t+1})|s_t,a_t] \end{aligned} Qπ(st,at)=E[Ut∣st,at]=E[Rt+γ⋅Ut+1∣st,at]=E[Rt∣st,at]+γ⋅E[Ut+1∣st,at]=E[Rt∣st,at]+γ⋅E[Qπ(St+1,At+1)∣st,at]
Identity: Q π ( s t , a t ) = E [ R t + γ ⋅ Q π ( S t + 1 , A t + 1 ) ] Q_\pi(s_t,a_t)=E[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})] Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)], for all π \pi π.
蒙特卡洛近似: Q π ( s t , a t ) ≈ r t + γ Q π ( s t + 1 , a t + 1 ) = y t Q_\pi(s_t,a_t)\approx r_t+\gamma Q_\pi(s_{t+1},a_{t+1})=y_t Qπ(st,at)≈rt+γQπ(st+1,at+1)=yt
y t y_t yt为TD Target
1.1.Tabular Version
适用于规模较小、表格较小的问题,由状态和动作组成 Q Q Q表,使用Sarsa算法更新表格。
- 观测得到transition s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1
- 根据策略 π ( ⋅ ∣ S t + 1 ) \pi(·|S_{t+1}) π(⋅∣St+1)采样得到动作 a t + 1 a_{t+1} at+1
- TD target: y t = r t + γ ⋅ Q π ( s t + 1 , a t + 1 ) y_t=r_t+\gamma·Q_\pi(s_{t+1},a_{t+1}) yt=rt+γ⋅Qπ(st+1,at+1),其中 Q π ( s t + 1 , a t + 1 ) Q_\pi(s_{t+1},a_{t+1}) Qπ(st+1,at+1)查表得到
- TD error: δ t = Q π ( s t , a t ) − y t \delta_t=Q_\pi(s_t,a_t)-y_t δt=Qπ(st,at)−yt
- 更新 Q Q Q表: Q π ( s t , a t ) ← Q π ( s t , a t ) − α ⋅ δ t Q_\pi(s_t,a_t)\leftarrow Q_\pi(s_t,a_t)-\alpha·\delta_t Qπ(st,at)←Qπ(st,at)−α⋅δt
1.2.Neural Network Version
使用神经网络value network q ( s , a ; w ) q(s,a;w) q(s,a;w)近似计算 Q π ( s , a ) Q_\pi(s,a) Qπ(s,a)
-
TD target: y t = r t + γ ⋅ q ( s t + 1 , a t + 1 ; w ) y_t=r_t+\gamma·q(s_{t+1},a_{t+1};w) yt=rt+γ⋅q(st+1,at+1;w)
-
TD error: δ t = q ( s t , a t ; w ) − y t \delta_t=q(s_t,a_t;w)-y_t δt=q(st,at;w)−yt
-
Loss: δ t 2 / 2 \delta_t^2/2 δt2/2
-
Gradient: ∂ δ t 2 / 2 ∂ w = δ t ⋅ ∂ q ( s t , a t ; w ) ∂ w \frac{\partial \delta^2_t/2}{\partial w}=\delta_t·\frac{\partial q(s_t,a_t;w)}{\partial w} ∂w∂δt2/2=δt⋅∂w∂q(st,at;w)
-
Gradient descent: w ← w − α ⋅ δ ⋅ ∂ q ( s t , a t ; w ) ∂ w w\leftarrow w-\alpha·\delta·\frac{\partial q(s_t,a_t;w)}{\partial w} w←w−α⋅δ⋅∂w∂q(st,at;w)
2.Q-Learning
比较Q-Learning和Sarsa:
Sarse | Q-Learning | |
---|---|---|
目标函数 | Q π ( s , a ) Q_\pi(s,a) Qπ(s,a) | Q ∗ ( s , a ) Q^*(s,a) Q∗(s,a) |
TD target | y t = r t + γ ⋅ Q π ( s t + 1 , a t + 1 ) y_t=r_t+\gamma·Q_\pi(s_{t+1},a_{t+1}) yt=rt+γ⋅Qπ(st+1,at+1) | y t = r t + γ ⋅ max a Q ∗ ( s t + 1 , a ) y_t=r_t+\gamma·\max_a Q^*(s_{t+1},a) yt=rt+γ⋅maxaQ∗(st+1,a) |
参数更新 | value network; critic | DQN |
2.0.TD Target
在1.0中已经计算,对于策略
π
\pi
π:
Q
π
(
s
t
,
a
t
)
=
E
[
R
t
+
γ
⋅
Q
π
(
S
t
+
1
,
A
t
+
1
)
]
Q_\pi(s_t,a_t)=E[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})]
Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)]
对于最优策略optimal policy
π
∗
\pi^*
π∗:
Q
∗
(
s
t
,
a
t
)
=
E
[
R
t
+
γ
⋅
Q
∗
(
S
t
+
1
,
A
t
+
1
)
]
Q^*(s_t,a_t)=E[R_t+\gamma·Q^*(S_{t+1},A_{t+1})]
Q∗(st,at)=E[Rt+γ⋅Q∗(St+1,At+1)]
取行动
A
t
+
1
A_{t+1}
At+1为
A
t
+
1
=
arg
max
a
Q
∗
(
S
t
+
1
,
a
)
A_{t+1}=\arg\max_a Q^*(S_{t+1},a)
At+1=argmaxaQ∗(St+1,a),
则
Q
∗
(
S
t
+
1
,
A
t
+
1
)
=
max
a
Q
∗
(
S
t
+
1
,
a
)
Q^*(S_{t+1},A_{t+1})=\max_a Q^*(S_t+1,a)
Q∗(St+1,At+1)=maxaQ∗(St+1,a)
Q
∗
(
s
t
,
a
t
)
=
E
[
R
t
+
γ
⋅
max
a
Q
∗
(
S
t
+
1
,
a
)
]
Q^*(s_t,a_t)=E[R_t+\gamma·\max_a Q^*(S_{t+1},a)]
Q∗(st,at)=E[Rt+γ⋅amaxQ∗(St+1,a)]
使用蒙特卡洛近似,得到TD target
y
t
y_t
yt:
Q
∗
(
s
t
,
a
t
)
≈
r
t
+
γ
⋅
max
a
Q
∗
(
s
t
+
1
,
a
)
=
y
t
Q^*(s_t,a_t)\approx r_t+\gamma·\max_a Q^*(s_{t+1},a)=y_t
Q∗(st,at)≈rt+γ⋅amaxQ∗(st+1,a)=yt
2.1.Tabular Version
适用于规模较小、表格较小的问题,由状态和动作组成 Q ∗ Q^* Q∗表,使用Q-Learning算法更新表格。
- 观测得到transition s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1
- TD target: y t = r t + γ ⋅ max a Q ∗ ( s t + 1 , a ) y_t=r_t+\gamma·\max_a Q^*(s_{t+1},a) yt=rt+γ⋅maxaQ∗(st+1,a),在 s t + 1 s_{t+1} st+1对应的行动中,找到表格值最大的一项
- TD error: δ t = Q ∗ ( s t , a t ) − y t \delta_t=Q^*(s_t,a_t)-y_t δt=Q∗(st,at)−yt
- 更新 Q Q Q表: Q ∗ ( s t , a t ) ← Q ∗ ( s t , a t ) − α ⋅ δ t Q^*(s_t,a_t)\leftarrow Q^*(s_t,a_t)-\alpha·\delta_t Q∗(st,at)←Q∗(st,at)−α⋅δt
2.2.DQN Version
使用DQN网络 Q ( s , a ; w ) Q(s,a;w) Q(s,a;w)近似计算 Q ∗ ( s , a ) Q^*(s,a) Q∗(s,a),控制agent执行行动 a t = arg max a Q ( s t , a ; w ) a_t=\arg\max_a Q(s_t,a;w) at=argmaxaQ(st,a;w)
可使用Q-Learning算法训练DQN:
-
观测得到transition s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1
-
TD target: y t = r t + γ ⋅ max a Q ( s t + 1 , a ; w ) y_t=r_t+\gamma·\max_a Q(s_{t+1},a;w) yt=rt+γ⋅maxaQ(st+1,a;w)
-
TD error: δ t = Q ( s t , a t ; w ) − y t \delta_t=Q(s_t,a_t;w)-y_t δt=Q(st,at;w)−yt
-
Loss: δ t 2 / 2 \delta_t^2/2 δt2/2
-
Gradient: ∂ δ t 2 / 2 ∂ w = δ t ⋅ ∂ Q ( s t , a t ; w ) ∂ w \frac{\partial \delta^2_t/2}{\partial w}=\delta_t·\frac{\partial Q(s_t,a_t;w)}{\partial w} ∂w∂δt2/2=δt⋅∂w∂Q(st,at;w)
-
Gradient descent: w ← w − α ⋅ δ ⋅ ∂ Q ( s t , a t ; w ) ∂ w w\leftarrow w-\alpha·\delta·\frac{\partial Q(s_t,a_t;w)}{\partial w} w←w−α⋅δ⋅∂w∂Q(st,at;w)
3.Multi-Step TD Target
3.0
之前的算法中,只使用了一步的Reward进行训练,如果使用多步的Reward,可以得到更好的效果。
3.1.Multi-Step Return
U t = R t + γ ⋅ U t + 1 U_t=R_t+\gamma·U_{t+1} Ut=Rt+γ⋅Ut+1
对上式进行递归,得到:
U
t
=
R
t
+
γ
⋅
(
R
t
+
1
+
γ
⋅
U
t
+
2
)
=
R
t
+
γ
⋅
R
t
+
1
+
γ
2
⋅
U
t
+
2
\begin{aligned} U_t&=R_t+\gamma·(R_{t+1}+\gamma·U_{t+2}) \\&=R_t+\gamma·R_{t+1}+\gamma^2·U_{t+2} \end{aligned}
Ut=Rt+γ⋅(Rt+1+γ⋅Ut+2)=Rt+γ⋅Rt+1+γ2⋅Ut+2
继续递归:
U
t
=
∑
i
=
0
m
−
1
γ
i
⋅
R
t
+
i
+
γ
m
⋅
U
t
+
m
U_t=\sum_{i=0}^{m-1}\gamma^i·R_{t+i}+\gamma^m·U_{t+m}
Ut=i=0∑m−1γi⋅Rt+i+γm⋅Ut+m
3.2.Multi-Step TD Target
-
m-step TD target for Sarsa:
y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ Q π ( s t + m , a t + m ) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·Q_\pi(s_{t+m},a_{t+m}) yt=i=0∑m−1γi⋅rt+i+γm⋅Qπ(st+m,at+m) -
m-step TD target for Q-Learning
y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ max a Q ∗ ( s t + m , a ) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·\max_aQ^*(s_{t+m},a) yt=i=0∑m−1γi⋅rt+i+γm⋅amaxQ∗(st+m,a)
s_{t+m},a_{t+m})
$$ -
m-step TD target for Q-Learning
y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ max a Q ∗ ( s t + m , a ) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·\max_aQ^*(s_{t+m},a) yt=i=0∑m−1γi⋅rt+i+γm⋅amaxQ∗(st+m,a)