问题
- 强化学习四元组 E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>, x ∈ X x\in X x∈X是状态, a ∈ A a\in A a∈A是动作, P : X × A × X → R P:X\times A\times X\rightarrow R P:X×A×X→R是状态转移概率, R : X × A × X → R R:X\times A\times X\rightarrow R R:X×A×X→R是奖赏。
- π \pi π是策略, π ( x , a ) \pi(x,a) π(x,a)表示策略 π \pi π在状态 x x x时选择动作 a a a的概率,有 ∑ a π ( x , a ) = 1 \sum_a\pi(x,a)=1 ∑aπ(x,a)=1。
- 强化学习任务是学习策略 π \pi π,就能计算执行的动作 a = π ( x ) a=\pi(x) a=π(x)。学习的目标是积累奖赏最大化,常用的累积奖赏有: T T T步累积奖赏 = E ( 1 T ∑ t = 1 T r t ) =E(\frac{1}{T}\sum_{t=1}^{T}r_t) =E(T1∑t=1Trt), γ \gamma γ折扣累积奖赏 = E ( ∑ t = 0 ∞ γ t r t + 1 ) =E(\sum_{t=0}^{\infty}\gamma^tr_{t+1}) =E(∑t=0∞γtrt+1),其中 r t r_t rt为第 t t t步获得的奖赏值。
摇臂赌博机
ϵ \epsilon ϵ-贪心算法
输入:摇臂数 K K K,奖赏 R R R,尝试数 T T T,探索概率 ϵ \epsilon ϵ
过程:
r = 0 , Q ( i ) = 0 , c n t ( i ) = 0 r=0,Q(i)=0,cnt(i)=0 r=0,Q(i)=0,cnt(i)=0
for t = 1 , 2 , ⋯ , T t=1,2,\cdots,T t=1,2,⋯,T:
\quad if r a n d ( ) < ϵ rand() < \epsilon rand()<ϵ:
k = r a n d i n t ( 1 , K ) \qquad k=randint(1,K) k=randint(1,K) #仅探索
\quad else:
k = arg max i Q ( i ) \qquad k=\argmax_iQ(i) k=iargmaxQ(i) #仅利用
v = R ( k ) \quad v=R(k) v=R(k)
r = r + v \quad r=r+v r=r+v
Q ( k ) = Q ( k ) ∗ c n t ( k ) + v c n t ( k ) + 1 \quad Q(k) =\frac{Q(k)*cnt(k)+v}{cnt(k)+1} Q(k)=cnt(k)+1Q(k)∗cnt(k)+v
c n t ( k ) = c n t ( k ) + 1 \quad cnt(k)=cnt(k)+1 cnt(k)=cnt(k)+1
输出: r r r
随着时间推移,策略越来越好,需要探索的概率
ϵ
\epsilon
ϵ可以随时间减小,比如取
ϵ
=
1
t
\epsilon=\frac{1}{\sqrt{t}}
ϵ=t1。也可以直接根据
Q
(
k
)
Q(k)
Q(k)的概率进行采样,即softmax算法,采到
k
k
k的概率:
P
(
k
)
=
e
Q
(
k
)
/
τ
∑
i
=
1
K
e
Q
(
i
)
/
τ
P(k)=\frac{e^{Q(k)/\tau}}{\sum_{i=1}^K e^{Q(i)/\tau}}
P(k)=∑i=1KeQ(i)/τeQ(k)/τ
softmax算法
输入:摇臂数 K K K,奖赏 R R R,尝试数 T T T,温度参数 τ \tau τ
过程:
for t = 1 , 2 , ⋯ , T t=1,2,\cdots,T t=1,2,⋯,T:
\quad 根据 P ( k ) P(k) P(k)采样得到 k k k:
v = R ( k ) \quad v=R(k) v=R(k)
r = r + v \quad r=r+v r=r+v
Q ( k ) = Q ( k ) ∗ c n t ( k ) + v c n t ( k ) + 1 \quad Q(k) =\frac{Q(k)*cnt(k)+v}{cnt(k)+1} Q(k)=cnt(k)+1Q(k)∗cnt(k)+v
c n t ( k ) = c n t ( k ) + 1 \quad cnt(k)=cnt(k)+1 cnt(k)=cnt(k)+1
输出: r r r
有模型学习
四元组
E
=
<
X
,
A
,
P
,
R
>
E=<X,A,P,R>
E=<X,A,P,R>已知,有模型学习。状态值函数:
{
V
T
π
(
x
)
=
E
π
[
1
T
∑
t
=
1
T
r
t
∣
x
0
=
x
]
V
γ
π
(
x
)
=
E
π
[
∑
t
=
1
∞
γ
t
r
t
+
1
∣
x
0
=
x
]
\left\{\begin{array}{l} V_T^\pi(x)=\Bbb E_\pi[\frac{1}{T}\sum_{t=1}^Tr_t|x_0=x]\\ V_\gamma^\pi(x)=\Bbb E_\pi[\sum_{t=1}^\infty \gamma^tr_{t+1}|x_0=x] \end{array} \right.
{VTπ(x)=Eπ[T1∑t=1Trt∣x0=x]Vγπ(x)=Eπ[∑t=1∞γtrt+1∣x0=x]
状态-动作值函数:
{
Q
T
π
(
x
,
a
)
=
E
π
[
1
T
∑
t
=
1
T
r
t
∣
x
0
=
x
,
a
0
=
a
]
Q
γ
π
(
x
,
a
)
=
E
π
[
∑
t
=
1
∞
γ
t
r
t
+
1
∣
x
0
=
x
,
a
0
=
a
]
\left\{\begin{array}{l} Q_T^\pi(x,a)=\Bbb E_\pi[\frac{1}{T}\sum_{t=1}^Tr_t|x_0=x,a_0=a]\\ Q_\gamma^\pi(x,a)=\Bbb E_\pi[\sum_{t=1}^\infty \gamma^tr_{t+1}|x_0=x,a_0=a] \end{array} \right.
{QTπ(x,a)=Eπ[T1∑t=1Trt∣x0=x,a0=a]Qγπ(x,a)=Eπ[∑t=1∞γtrt+1∣x0=x,a0=a]
Bellman等式(以
γ
\gamma
γ折扣为例):
{
Q
(
x
,
a
)
=
∑
x
′
P
x
→
x
′
a
(
R
x
→
x
′
a
+
γ
V
(
x
′
)
)
V
(
x
)
=
∑
a
π
(
x
,
a
)
Q
(
x
,
a
)
\left\{\begin{array}{l} Q(x,a)=\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x'))\\ V(x)=\sum_a\pi(x,a)Q(x,a) \end{array}\right.
{Q(x,a)=∑x′Px→x′a(Rx→x′a+γV(x′))V(x)=∑aπ(x,a)Q(x,a)
策略评估算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>;被评估策略 π \pi π;
过程:
∀ x , V ( x ) = 0 \forall x,V(x)=0 ∀x,V(x)=0
for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,⋯:
∀ x , V ′ ( x ) = ∑ a π ( x , a ) ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \quad \forall x,V'(x)=\sum_a\pi(x,a)\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) ∀x,V′(x)=∑aπ(x,a)∑x′Px→x′a(Rx→x′a+γV(x′))
\quad if max x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxx∣V′(x)−V(x)∣<thr:
b r e a k \qquad break break
输出:状态值函数 V V V
最优Bellman等式(以
γ
\gamma
γ折扣为例):
{
V
(
x
)
=
max
a
Q
(
x
,
a
)
Q
(
x
,
a
)
=
∑
x
′
P
x
→
x
′
a
(
R
x
→
x
′
a
+
γ
max
a
′
Q
(
x
′
,
a
′
)
)
\left\{\begin{array}{l} V(x)=\max_aQ(x,a)\\ Q(x,a)=\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma \max_{a'}Q(x',a')) \end{array}\right.
{V(x)=maxaQ(x,a)Q(x,a)=∑x′Px→x′a(Rx→x′a+γmaxa′Q(x′,a′))
策略改进:
π
′
(
x
)
=
arg max
a
Q
(
x
,
a
)
\pi'(x)=\argmax_aQ(x,a)
π′(x)=aargmaxQ(x,a)
策略迭代算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>;
过程:
∀ x , V ( x ) = 0 , π ( x , a ) = 1 / ∣ A ∣ \forall x,V(x)=0,\pi(x,a)=1/|A| ∀x,V(x)=0,π(x,a)=1/∣A∣
while True:
\quad for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,⋯:
∀ x , V ′ ( x ) = ∑ a π ( x , a ) ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \qquad \forall x,V'(x)=\sum_a\pi(x,a)\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) ∀x,V′(x)=∑aπ(x,a)∑x′Px→x′a(Rx→x′a+γV(x′))
\qquad if max x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxx∣V′(x)−V(x)∣<thr:
b r e a k \quad\qquad break break
∀ x , π ′ ( x ) = arg max a Q ( x , a ) \quad \forall x,\pi'(x)=\argmax_aQ(x,a) ∀x,π′(x)=aargmaxQ(x,a) #用Bellman等式计算 Q Q Q
\quad if π ′ ( x ) = π ( x ) , ∀ x \pi'(x)=\pi(x),\forall x π′(x)=π(x),∀x:
b r e a k \qquad break break
π = π ′ \quad \pi=\pi' π=π′
输出:最优策略 π \pi π
策略迭代算法中,策略的更新太慢。策略的迭代可以和值函数的迭代一起进行:
值迭代算法
输入: E = < X , A , P , R > E=<X,A,P,R> E=<X,A,P,R>;被评估策略 π \pi π;
过程:
∀ x , V ( x ) = 0 \forall x,V(x)=0 ∀x,V(x)=0
while True:
\quad for t = 1 , 2 , ⋯ t=1,2,\cdots t=1,2,⋯:
∀ x , V ′ ( x ) = max a ∑ x ′ P x → x ′ a ( R x → x ′ a + γ V ( x ′ ) ) \qquad \forall x,V'(x)=\max_a\sum_{x'}P_{x\to x'}^a(R_{x\to x'}^a+\gamma V(x')) ∀x,V′(x)=maxa∑x′Px→x′a(Rx→x′a+γV(x′))
\qquad if max x ∣ V ′ ( x ) − V ( x ) ∣ < t h r \max_x|V'(x)-V(x)|<thr maxx∣V′(x)−V(x)∣<thr:
b r e a k \quad\qquad break break
输出:最优策略 π = arg max a Q ( x , a ) \pi=\argmax_aQ(x,a) π=aargmaxQ(x,a)
免模型学习
实际中, P , R P,R P,R很难知道,而且有多少状态也很难得知,此时学习算法不依赖于环境建模,为免模型学习。模型未知,我们从起始状态出发,使用某种策略采样,得到: < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯ , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,⋯,xT−1,aT−1,rT,xT>
蒙特卡洛强化学习
同策略(on-policy)蒙特卡洛算法
输入: A , x 0 , T A,x_0,T A,x0,T
过程:
Q ( x , a ) = 0 , c n t ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ Q(x,a)=0,cnt(x,a)=0,\pi(x,a)=\frac{1}{|A|} Q(x,a)=0,cnt(x,a)=0,π(x,a)=∣A∣1
for s = 1 , 2 , ⋯ : s=1,2,\cdots: s=1,2,⋯:
\quad 执行策略 π ϵ \pi^\epsilon πϵ,得到轨迹 < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯ , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,⋯,xT−1,aT−1,rT,xT>
\quad for t = 0 , ⋯ , T − 1 : t=0,\cdots,T-1: t=0,⋯,T−1:
R = 1 T − t ∑ i = t + 1 T r i \qquad R=\frac{1}{T-t}\sum_{i=t+1}^{T}r_i R=T−t1∑i=t+1Tri
Q ( x t , a t ) = Q ( x t , a t ) ∗ c n t ( x t , a t ) + R c n t ( x t , a t ) + 1 \qquad Q(x_t,a_t)=\frac{Q(x_t,a_t)*cnt(x_t,a_t) + R}{cnt(x_t,a_t)+1} Q(xt,at)=cnt(xt,at)+1Q(xt,at)∗cnt(xt,at)+R
c n t ( x t , a t ) = c n t ( x t , a t ) + 1 \qquad cnt(x_t,a_t)=cnt(x_t,a_t)+1 cnt(xt,at)=cnt(xt,at)+1
π ϵ ( x ) = { arg max a Q ( x , a ) r a n d ( ) > ϵ 1 / ∣ A ∣ r a n d ( ) < ϵ \quad \pi^\epsilon(x)=\left\{\begin{array}{ll} \argmax_aQ(x,a) & rand()>\epsilon\\ 1/|A| & rand() < \epsilon \end{array}\right. πϵ(x)={aargmaxQ(x,a)1/∣A∣rand()>ϵrand()<ϵ
输出:策略 π ϵ \pi^\epsilon πϵ
同策略蒙特卡洛产生的是
ϵ
\epsilon
ϵ-贪心策略,我们需要在评估时引入
ϵ
\epsilon
ϵ-贪心策略,而在改进时改进原始策略。
函数
f
f
f在概率
p
p
p下的期望:
E
(
f
)
=
∫
x
f
(
x
)
p
(
x
)
d
x
=
∫
x
f
(
x
)
p
(
x
)
q
(
x
)
q
(
x
)
d
x
\Bbb E(f)=\int_xf(x)p(x)dx=\int_xf(x)\frac{p(x)}{q(x)}q(x)dx
E(f)=∫xf(x)p(x)dx=∫xf(x)q(x)p(x)q(x)dx
用概率
p
p
p采样得到:
(
x
1
,
x
2
,
⋯
,
x
m
)
(x_1,x_2,\cdots,x_m)
(x1,x2,⋯,xm),则可估计
f
f
f在概率
p
p
p下的期望:
E
^
(
f
)
=
1
m
∑
i
=
1
m
f
(
x
i
)
\hat{\Bbb E}(f)=\frac{1}{m}\sum_{i=1}^mf(x_i)
E^(f)=m1i=1∑mf(xi)
用概率
q
q
q采样得到:
(
x
1
′
,
x
2
′
,
⋯
,
x
m
′
)
(x_1',x_2',\cdots,x_m')
(x1′,x2′,⋯,xm′),则可估计
f
f
f在概率
p
p
p下的期望(重要性采样,importance sampling):
E
^
(
f
)
=
1
m
∑
i
=
1
m
f
(
x
i
′
)
p
(
x
i
′
)
q
(
x
i
′
)
\hat{\Bbb E}(f)=\frac{1}{m}\sum_{i=1}^mf(x_i')\frac{p(x_i')}{q(x_i')}
E^(f)=m1i=1∑mf(xi′)q(xi′)p(xi′)
同理,可以用
π
ϵ
\pi^\epsilon
πϵ采样,去估计
π
\pi
π下
Q
Q
Q的期望:
Q
(
x
,
a
)
=
1
m
∑
i
=
1
m
R
i
P
i
π
p
i
π
ϵ
Q(x,a)=\frac{1}{m}\sum_{i=1}^mR_i\frac{P_i^\pi}{p_i^{\pi^\epsilon}}
Q(x,a)=m1i=1∑mRipiπϵPiπ
P
π
=
∏
i
=
0
T
−
1
π
(
x
i
,
a
i
)
P
x
i
→
x
i
+
1
a
i
,
P
π
ϵ
=
∏
i
=
0
T
−
1
π
ϵ
(
x
i
,
a
i
)
P
x
i
→
x
i
+
1
a
i
P^\pi=\prod_{i=0}^{T-1}\pi(x_i,a_i)P_{x_i\to x_{i+1}}^{a_i},P^{\pi^\epsilon}=\prod_{i=0}^{T-1}\pi^\epsilon(x_i,a_i)P_{x_i\to x_{i+1}}^{a_i}
Pπ=∏i=0T−1π(xi,ai)Pxi→xi+1ai,Pπϵ=∏i=0T−1πϵ(xi,ai)Pxi→xi+1ai,所以:
P
π
P
π
ϵ
=
∏
i
=
0
T
−
1
π
(
x
i
,
a
i
)
π
ϵ
(
x
i
,
a
i
)
\frac{P^\pi}{P^{\pi^\epsilon}}=\prod_{i=0}^{T-1}\frac{\pi(x_i,a_i)}{\pi^\epsilon(x_i,a_i)}
PπϵPπ=i=0∏T−1πϵ(xi,ai)π(xi,ai)
其中, π ( x i , a i ) = I ( a i = π ( x i ) ) , π ϵ ( x i , a i ) = { 1 − ϵ + ϵ ∣ A ∣ a i = π ( x i ) ϵ ∣ A ∣ a i ≠ π ( x i ) \pi(x_i,a_i)=\Bbb I(a_i=\pi(x_i)), \pi^\epsilon(x_i,a_i)=\left\{\begin{array}{ll} 1-\epsilon+\frac{\epsilon}{|A|} & a_i=\pi(x_i)\\ \frac{\epsilon}{|A|} & a_i\ne\pi(x_i) \end{array}\right. π(xi,ai)=I(ai=π(xi)),πϵ(xi,ai)={1−ϵ+∣A∣ϵ∣A∣ϵai=π(xi)ai=π(xi),所以,这边的连乘计算很容易为0,下面的异策略蒙特卡洛算法只是参考,实际不能这样计算。
异策略(off-policy)蒙特卡洛算法
输入: A , x 0 , T A,x_0,T A,x0,T
过程:
Q ( x , a ) = 0 , c n t ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ Q(x,a)=0,cnt(x,a)=0,\pi(x,a)=\frac{1}{|A|} Q(x,a)=0,cnt(x,a)=0,π(x,a)=∣A∣1
for s = 1 , 2 , ⋯ : s=1,2,\cdots: s=1,2,⋯:
\quad 执行策略 π ϵ \pi^\epsilon πϵ,得到轨迹 < x 0 , a 0 , r 1 , x 1 , a 1 , r 2 , ⋯ , x T − 1 , a T − 1 , r T , x T > <x_0,a_0,r_1,x_1,a_1,r_2,\cdots,x_{T-1},a_{T-1},r_T,x_T> <x0,a0,r1,x1,a1,r2,⋯,xT−1,aT−1,rT,xT>
\quad for t = 0 , ⋯ , T − 1 : t=0,\cdots,T-1: t=0,⋯,T−1:
R = ( 1 T − t ∑ i = t + 1 T r i ) ( ∏ i = t T − 1 π ( x i , a i ) π ϵ ( x i , a i ) ) \qquad R=(\frac{1}{T-t}\sum_{i=t+1}^{T}r_i)(\prod_{i=t}^{T-1}\frac{\pi(x_i,a_i)}{\pi^\epsilon(x_i,a_i)}) R=(T−t1∑i=t+1Tri)(∏i=tT−1πϵ(xi,ai)π(xi,ai))
Q ( x t , a t ) = Q ( x t , a t ) ∗ c n t ( x t , a t ) + R c n t ( x t , a t ) + 1 \qquad Q(x_t,a_t)=\frac{Q(x_t,a_t)*cnt(x_t,a_t) + R}{cnt(x_t,a_t)+1} Q(xt,at)=cnt(xt,at)+1Q(xt,at)∗cnt(xt,at)+R
c n t ( x t , a t ) = c n t ( x t , a t ) + 1 \qquad cnt(x_t,a_t)=cnt(x_t,a_t)+1 cnt(xt,at)=cnt(xt,at)+1
π ( x ) = arg max a Q ( x , a ) \quad \pi(x)=\argmax_aQ(x,a) π(x)=aargmaxQ(x,a)
输出:策略 π \pi π
时序差分学习
蒙特卡洛算法没有利用MDP,效率比较低,时序差分(TD)结合了动态规划和蒙特卡洛思想,更加高效。蒙特卡洛中
Q
Q
Q的迭代可写为:
Q
(
x
,
a
)
=
Q
(
x
,
a
)
+
1
c
+
1
(
R
−
Q
(
x
,
a
)
)
=
Q
(
x
,
a
)
+
α
c
(
R
−
Q
(
x
,
a
)
)
Q(x,a)=Q(x,a)+\frac{1}{c+1}(R-Q(x,a))=Q(x,a)+\alpha_c(R-Q(x,a))
Q(x,a)=Q(x,a)+c+11(R−Q(x,a))=Q(x,a)+αc(R−Q(x,a))
可令
α
c
=
α
\alpha_c=\alpha
αc=α,且采样
<
x
,
a
,
r
,
x
′
,
a
′
>
<x,a,r,x',a'>
<x,a,r,x′,a′>,则:
Q
(
x
,
a
)
=
Q
(
x
,
a
)
+
α
(
r
+
γ
Q
(
x
′
,
a
′
)
−
Q
(
x
,
a
)
)
Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a))
Q(x,a)=Q(x,a)+α(r+γQ(x′,a′)−Q(x,a))
Sarsa(on-policy)算法
输入: A , x 0 , γ , α A,x_0,\gamma, \alpha A,x0,γ,α
过程:
Q ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ , x = x 0 , a = π ( x ) Q(x,a)=0,\pi(x,a)=\frac{1}{|A|},x=x_0,a=\pi(x) Q(x,a)=0,π(x,a)=∣A∣1,x=x0,a=π(x)
for t = 1 , 2 , ⋯ : t=1,2,\cdots: t=1,2,⋯:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' 执行a⇒r,x′
a ′ = π ϵ ( x ′ ) \quad a'=\pi^\epsilon(x') a′=πϵ(x′)
Q ( x , a ) = Q ( x , a ) + α ( r + γ Q ( x ′ , a ′ ) − Q ( x , a ) ) \quad Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a)) Q(x,a)=Q(x,a)+α(r+γQ(x′,a′)−Q(x,a))
π ( x ) = arg max a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=a′′argmaxQ(x,a′′)
x = x ′ , a = a ′ \quad x=x', a=a' x=x′,a=a′
输出:策略 π \pi π
Q-Learning(off-policy)算法
输入: A , x 0 , γ , α A,x_0,\gamma, \alpha A,x0,γ,α
过程:
Q ( x , a ) = 0 , π ( x , a ) = 1 ∣ A ∣ , x = x 0 , a = π ( x ) Q(x,a)=0,\pi(x,a)=\frac{1}{|A|},x=x_0,a=\pi(x) Q(x,a)=0,π(x,a)=∣A∣1,x=x0,a=π(x)
for t = 1 , 2 , ⋯ : t=1,2,\cdots: t=1,2,⋯:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' 执行a⇒r,x′
a ′ = π ( x ′ ) \quad a'=\pi(x') a′=π(x′)
Q ( x , a ) = Q ( x , a ) + α ( r + γ Q ( x ′ , a ′ ) − Q ( x , a ) ) \quad Q(x,a)=Q(x,a)+\alpha(r+\gamma Q(x',a')-Q(x,a)) Q(x,a)=Q(x,a)+α(r+γQ(x′,a′)−Q(x,a))
π ( x ) = arg max a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=a′′argmaxQ(x,a′′)
x = x ′ , a = π ϵ ( x ′ ) \quad x=x', a=\pi^\epsilon(x') x=x′,a=πϵ(x′)
输出:策略 π \pi π
Policy Gradient
以下的算法都是Deep RL,(PPT),actor network
a
=
π
θ
(
x
)
a=\pi_\theta(x)
a=πθ(x),
a
a
a看成动作的概率分布向量,与环境互动得到一条轨迹后,可以获得训练数据:
{
{
x
t
,
a
t
}
,
A
t
∣
t
=
0
,
⋯
,
T
−
1
}
\{\{x_t,a_t\},A_t|t=0,\cdots,T-1\}
{{xt,at},At∣t=0,⋯,T−1}
其中,
A
t
=
∑
i
=
t
T
−
1
γ
i
−
t
r
i
+
1
−
b
A_t=\sum_{i=t}^{T-1}\gamma^{i-t}r_{i+1}-b
At=∑i=tT−1γi−tri+1−b,用累计奖赏表示这条样本的权重。
a
t
a_t
at看成one-hot的表示形式,交叉熵
e
t
=
C
E
(
π
θ
(
x
t
)
,
a
t
)
e_t=CE(\pi_\theta(x_t),a_t)
et=CE(πθ(xt),at),则loss:
L
=
∑
t
=
0
T
−
1
A
t
e
t
L=\sum_{t=0}^{T-1}A_te_t
L=t=0∑T−1Atet
求偏导:
▽
θ
L
=
−
∑
t
=
0
T
−
1
A
t
▽
ln
(
π
θ
(
x
t
,
a
t
)
)
\triangledown_\theta L=-\sum_{t=0}^{T-1}A_t \triangledown\ln(\pi_\theta(x_t,a_t))
▽θL=−t=0∑T−1At▽ln(πθ(xt,at))
Policy Gradient(on-policy)算法
过程:
初始化 θ = θ 0 \theta=\theta_0 θ=θ0
for i = 1 , 2 , ⋯ , N : i=1,2,\cdots,N: i=1,2,⋯,N:
\quad 训练数据: π θ \pi_\theta πθ与环境互动得到 { { x t , a t } , A t ∣ t = 0 , ⋯ , T − 1 } \{\{x_t,a_t\},A_t|t=0,\cdots,T-1\} {{xt,at},At∣t=0,⋯,T−1}
\quad 计算Loss: L = ∑ t = 0 T − 1 A t e t L=\sum_{t=0}^{T-1}A_te_t L=∑t=0T−1Atet
\quad 更新参数: θ = θ − η ▽ θ L = θ + η ∑ t = 0 T − 1 A t ▽ θ ln ( π θ ( x t , a t ) ) \theta=\theta-\eta\triangledown_\theta L=\theta+\eta\sum_{t=0}^{T-1}A_t \triangledown_\theta\ln(\pi_\theta(x_t,a_t)) θ=θ−η▽θL=θ+η∑t=0T−1At▽θln(πθ(xt,at))
输出:网络参数 θ \theta θ
Proximal Policy Optimization
PPO=Policy Gradient的off-policy形式+参数约束
off-policy PG
令
p
θ
(
a
∣
x
)
=
π
θ
(
x
,
a
)
,
θ
p_\theta(a|x)=\pi_\theta(x,a),\theta
pθ(a∣x)=πθ(x,a),θ为更新的策略参数,
θ
′
\theta'
θ′为采样的策略参数,则:
−
▽
θ
L
=
E
x
,
a
∼
π
θ
[
A
(
x
,
a
)
▽
ln
(
π
θ
(
x
,
a
)
)
]
=
E
x
,
a
∼
π
θ
′
[
p
θ
(
x
,
a
)
p
θ
′
(
x
,
a
)
A
(
x
,
a
)
▽
ln
(
π
θ
(
x
,
a
)
)
]
=
E
x
,
a
∼
π
θ
′
[
p
θ
(
a
∣
x
)
p
θ
′
(
a
∣
x
)
A
(
x
,
a
)
▽
ln
(
p
θ
(
a
∣
x
)
)
]
=
E
x
,
a
∼
π
θ
′
[
▽
p
θ
(
a
∣
x
)
p
θ
′
(
a
∣
x
)
A
(
x
,
a
)
]
=
▽
θ
J
θ
′
(
θ
)
-\triangledown_\theta L=\Bbb E_{x,a\sim\pi_\theta}[A(x,a)\triangledown \ln(\pi_\theta(x,a))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(x,a)}{p_{\theta'}(x,a)}A(x,a)\triangledown \ln(\pi_\theta(x,a))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)\triangledown \ln(p_\theta(a|x))]\\ =\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{\triangledown p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)]\\ =\triangledown_\theta J^{\theta'}(\theta)
−▽θL=Ex,a∼πθ[A(x,a)▽ln(πθ(x,a))]=Ex,a∼πθ′[pθ′(x,a)pθ(x,a)A(x,a)▽ln(πθ(x,a))]=Ex,a∼πθ′[pθ′(a∣x)pθ(a∣x)A(x,a)▽ln(pθ(a∣x))]=Ex,a∼πθ′[pθ′(a∣x)▽pθ(a∣x)A(x,a)]=▽θJθ′(θ)
得到优化的目标函数:
J
θ
′
(
θ
)
=
E
x
,
a
∼
π
θ
′
[
p
θ
(
a
∣
x
)
p
θ
′
(
a
∣
x
)
A
(
x
,
a
)
]
J^{\theta'}(\theta)=\Bbb E_{x,a\sim\pi_{\theta'}}[\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a)]
Jθ′(θ)=Ex,a∼πθ′[pθ′(a∣x)pθ(a∣x)A(x,a)]
constraint
在优化时,不希望
θ
\theta
θ与
θ
′
\theta'
θ′相差太多,可以增加参数约束:
J
P
P
O
θ
′
(
θ
)
=
J
θ
′
(
θ
)
−
β
K
L
(
θ
,
θ
′
)
J_{PPO}^{\theta'}(\theta)=J^{\theta'}(\theta)-\beta KL(\theta,\theta')
JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)
另外一种用clip来限制参数的方法:
J
P
P
O
2
θ
′
(
θ
)
=
E
x
,
a
∼
π
θ
′
[
min
(
p
θ
(
a
∣
x
)
p
θ
′
(
a
∣
x
)
A
(
x
,
a
)
,
c
l
i
p
(
p
θ
(
a
∣
x
)
p
θ
′
(
a
∣
x
)
,
1
−
ϵ
,
1
+
ϵ
)
A
(
x
,
a
)
)
]
J_{PPO2}^{\theta'}(\theta)=\Bbb E_{x,a\sim\pi_{\theta'}}[\min(\frac{p_\theta(a|x)}{p_{\theta'}(a|x)}A(x,a),clip(\frac{p_\theta(a|x)}{p_{\theta'}(a|x)},1-\epsilon,1+\epsilon)A(x,a))]
JPPO2θ′(θ)=Ex,a∼πθ′[min(pθ′(a∣x)pθ(a∣x)A(x,a),clip(pθ′(a∣x)pθ(a∣x),1−ϵ,1+ϵ)A(x,a))]
PPO算法
PPO(off-policy)算法
过程:
初始化 θ = θ 0 , θ ′ = θ \theta=\theta_0,\theta'=\theta θ=θ0,θ′=θ
for i = 1 , 2 , ⋯ , N : i=1,2,\cdots,N: i=1,2,⋯,N:
\quad 训练数据: π θ ′ \pi_{\theta'} πθ′与环境互动得到 { { x t , a t } , A t ∣ t = 0 , ⋯ , T − 1 } \{\{x_t,a_t\},A_t|t=0,\cdots,T-1\} {{xt,at},At∣t=0,⋯,T−1}
\quad 更新参数: θ = arg max θ J P P O θ ′ ( θ ) \theta=\argmax_\theta J_{PPO}^{\theta'}(\theta) θ=θargmaxJPPOθ′(θ)
θ ′ = θ \quad \theta'=\theta θ′=θ
输出:网络参数 θ \theta θ
这边有一个代码实现
Actor-Critic
状态值函数估计
状态空间离散时,前面有值函数策略评估算法来计算 V V V;当状态空间连续时,用网络 V ϕ ( x ) V^\phi(x) Vϕ(x)来表示值函数,有蒙特卡洛(MC)和时序差分(TD)两种方法来评估得到 V ϕ ( x ) V^\phi(x) Vϕ(x):
- MC:采样得到训练数据 { x t , R t ∣ t = 0 , ⋯ , T − 1 } , V ϕ ( x t ) = R t = ∑ i = t T − 1 γ i − t r i + 1 \{x_t, R_t|t=0,\cdots,T-1\},V^\phi(x_t)=R_t=\sum_{i=t}^{T-1}\gamma^{i-t}r_{i+1} {xt,Rt∣t=0,⋯,T−1},Vϕ(xt)=Rt=∑i=tT−1γi−tri+1,训练 V ϕ ( x ) V^\phi(x) Vϕ(x)网络
- TD:用 V ϕ ( x t ) = r t + 1 + γ V ϕ ( x t + 1 ) V^\phi(x_t)=r_{t+1}+\gamma V^\phi(x_{t+1}) Vϕ(xt)=rt+1+γVϕ(xt+1)来训练
MC方差大,精度高,TD方差小,精度低。
Actor-Critic
将
A
t
A_t
At中的
b
b
b换成
V
ϕ
(
x
t
)
V^\phi(x_t)
Vϕ(xt),表示当前的动作获得的奖赏比平均值大多少,如果大于平均值,则当前动作应受到鼓励,弱小于平均值,则当前动作不可采取。在TD中,
A
t
A_t
At中的第一项累计奖赏可以用
r
t
+
1
+
γ
V
ϕ
(
x
t
+
1
)
r_{t+1}+\gamma V^\phi(x_{t+1})
rt+1+γVϕ(xt+1)来近似,所以可得:
A
t
=
r
t
+
1
+
γ
V
ϕ
(
x
t
+
1
)
−
V
ϕ
(
x
t
)
A_t=r_{t+1}+\gamma V^\phi(x_{t+1})-V^\phi(x_t)
At=rt+1+γVϕ(xt+1)−Vϕ(xt)
π θ \pi^\theta πθ是Actor,Loss函数如上 L θ = A t e t L^\theta=A_te_t Lθ=Atet; V ϕ V^\phi Vϕ是Critic,Loss函数为 L ϕ = 1 2 ∣ A t ∣ 2 L^\phi=\frac{1}{2}|A_t|^2 Lϕ=21∣At∣2
Actor-Critic算法
过程:
初始化 θ = θ 0 , ϕ = ϕ 0 , x = x 0 \theta=\theta_0,\phi=\phi_0,x=x_0 θ=θ0,ϕ=ϕ0,x=x0
for i = 1 , 2 , ⋯ : i=1,2,\cdots: i=1,2,⋯:
\quad 选择动作 a ∼ π θ ( x ) a\sim\pi^\theta(x) a∼πθ(x)
\quad 得到奖赏和下一个状态: r , x ′ r,x' r,x′
A = r + γ V ϕ ( x ′ ) − V ϕ ( x ) \quad A=r+\gamma V^\phi(x')-V^\phi(x) A=r+γVϕ(x′)−Vϕ(x)
\quad 更新参数: ϕ = ϕ − η ▽ ϕ L ϕ = ϕ + η A ▽ ϕ V ϕ ( x ) \phi=\phi-\eta\triangledown_\phi L^\phi=\phi+\eta A\triangledown_\phi V^\phi(x) ϕ=ϕ−η▽ϕLϕ=ϕ+ηA▽ϕVϕ(x)
\quad 更新参数: θ = θ − η ▽ θ L θ = θ + η A ▽ θ ln ( π θ ( x , a ) ) \theta=\theta-\eta\triangledown_\theta L^\theta=\theta+\eta A\triangledown_\theta\ln(\pi^\theta(x,a)) θ=θ−η▽θLθ=θ+ηA▽θln(πθ(x,a))
x = x ′ \quad x=x' x=x′
输出:网络参数 θ , ϕ \theta,\phi θ,ϕ
这边有一个代码实现
DQN
DQN算法
过程:
初始化 Q , Q ^ Q,\hat Q Q,Q^的参数 θ = θ 0 , θ ^ = θ \theta=\theta_0,\hat\theta=\theta θ=θ0,θ^=θ,队列 q q q
for i = 1 , 2 , ⋯ , : i=1,2,\cdots,: i=1,2,⋯,:
执 行 a ⇒ r , x ′ \quad 执行a\Rightarrow r,x' 执行a⇒r,x′
q . a p p e n d ( ( x , a , r , x ′ ) ) \quad q.append((x,a,r,x')) q.append((x,a,r,x′))
\quad 从 q q q中采样 { ( x t , a t , r t , x t ′ ) ∣ t = 1 , ⋯ , B } \{(x_t,a_t,r_t,x_t')|t=1,\cdots,B\} {(xt,at,rt,xt′)∣t=1,⋯,B}
y t = r t + γ max a Q ^ ( x t ′ , a ) \quad y_t=r_t+\gamma \max_a\hat Q(x_t',a) yt=rt+γmaxaQ^(xt′,a)
θ = θ + α ∑ t ▽ θ Q ( x t , a t ) ∗ ( y t − Q ( x t , a t ) ) \quad \theta=\theta+\alpha\sum_t\triangledown_\theta Q(x_t,a_t)*(y_t-Q(x_t,a_t)) θ=θ+α∑t▽θQ(xt,at)∗(yt−Q(xt,at))
\quad 每隔 C C C步更新: θ ^ = θ \hat\theta=\theta θ^=θ
π ( x ) = arg max a ′ ′ Q ( x , a ′ ′ ) \quad \pi(x)=\argmax_{a''}Q(x,a'') π(x)=a′′argmaxQ(x,a′′)
x = x ′ , a = π ϵ ( x ′ ) \quad x=x', a=\pi^\epsilon(x') x=x′,a=πϵ(x′)
输出:网络参数 θ \theta θ
Tips:
- Double DQN
- DQN倾向于高估Q值
- DDQN只改动一行: y t = r t + γ Q ^ ( x t ′ , arg max a Q ( x t ′ , a ) ) y_t=r_t+\gamma \hat Q(x_t',\argmax_aQ(x_t',a)) yt=rt+γQ^(xt′,aargmaxQ(xt′,a))
- Dueling DQN
- only change the network structure
- only change the network structure
- Prioritized Reply
- 队列中更大TD error( y t − Q ( x t , a t ) y_t-Q(x_t,a_t) yt−Q(xt,at))的样本被选择概率更高
- Multi-step
- 采样 ( x t , a t , r t , ⋯ , x t + N , a t + N ) (x_t,a_t,r_t,\cdots,x_{t+N},a_{t+N}) (xt,at,rt,⋯,xt+N,at+N)
- Q ( x t , a t ) = ∑ i = 0 N − 1 γ i r t + i + Q ^ ( x t + N , a t + N ) Q(x_t,a_t)=\sum_{i=0}^{N-1}\gamma^i r_{t+i}+\hat Q(x_{t+N},a_{t+N}) Q(xt,at)=∑i=0N−1γirt+i+Q^(xt+N,at+N)
- Noisy Net
- noisy on action (Epsilon Greedy)
- noisy on parameters
- a = arg max a Q ~ ( x , a ) a=\argmax_a\tilde Q(x,a) a=aargmaxQ~(x,a)
- Distributed DQN
- Rainbow
- 综合所有的tips