文章目录
From Markov chains to Markov decision process (MDP):
M
=
S
,
A
,
T
,
r
\mathcal{M}=\mathcal{S},\mathcal{A},\mathcal{T},r
M=S,A,T,r
T
\mathcal{T}
T now is a tensor of 3 division:
T
i
,
j
,
k
=
p
(
s
t
+
1
=
i
∣
s
t
=
j
,
a
t
=
k
)
\mathcal{T}_{i,j,k}=p(s_{t+1}=i|s_t=j,a_t=k)
Ti,j,k=p(st+1=i∣st=j,at=k)
r
r
r-reward funtion
r
:
S
×
A
→
R
r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}
r:S×A→R,
partially observed Markov decision peocess
M
=
S
,
A
,
O
,
T
,
E
,
r
\mathcal{M}=\mathcal{S},\mathcal{A},\mathcal{O},\mathcal{T},\mathcal{E},r
M=S,A,O,T,E,r
E
\mathcal{E}
E-emmision probobality
p
(
o
t
∣
s
t
)
p(o_t|s_t)
p(ot∣st)
由于
a
t
a_t
at是根据
s
t
s_t
st和
π
θ
\pi_{\theta}
πθ共同推出的:
p
(
(
s
t
+
1
,
a
t
+
1
)
∣
s
t
,
a
t
)
=
p
(
s
t
+
1
∣
s
t
,
a
t
)
π
θ
(
a
t
+
1
∣
s
t
+
1
)
p((s_{t+1},a_{t+1})|s_{t},a_{t})=p(s_{t+1}|s_t,a_t)\pi_{\theta}(a_{t+1}|s_{t+1})
p((st+1,at+1)∣st,at)=p(st+1∣st,at)πθ(at+1∣st+1)
statinary: the same before and after transition:
μ
=
T
μ
\mu=\mathcal{T}\mu
μ=Tμ,因此
μ
\mu
μ是
T
\mathcal{T}
T的特征值为1的特征向量!
for statinary distribution,
μ
=
p
θ
(
s
,
a
)
\mu=p_{\theta}(s,a)
μ=pθ(s,a)
利用期望,是为了将不连续的奖励变成连续的变量,从而适用于梯度下降等算法
Value Functions
Q-function
Q π ( s t , a t ) = ∑ t ′ = t T E π θ [ r ( s t ′ , a t ′ ) ∣ s t , a t ] Q^{\pi}(s_t,a_t)=\sum^T_{t'=t}E_{\pi_{\theta}}[r(s_{t'},a_{t'})|s_t,a_t] Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]: 在 s t s_t st状态下执行动作 a t a_t at的所有奖励
value function
V
π
(
s
t
)
=
∑
t
′
=
t
T
E
π
θ
[
r
(
s
t
′
,
a
t
′
)
∣
s
t
]
V^{\pi}(s_t)=\sum^T_{t'=t}E_{\pi_{\theta}}[r(s_{t'},a_{t'})|s_t]
Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]:
s
t
s_t
st状态下的所有奖励(注意看自变量和它对什么去了取了期望)
So,
V
π
(
s
t
)
=
E
a
t
∼
π
(
a
t
∣
s
t
)
[
Q
π
(
s
t
,
a
t
)
]
V^{\pi}(s_t)=E_{a_t\sim\pi(a_t|s_t)}[Q^{\pi}(s_{t},a_{t})]
Vπ(st)=Eat∼π(at∣st)[Qπ(st,at)],如果我们对状态再求一次期望呢?:
E
s
1
∼
p
(
s
1
)
[
V
π
(
s
1
)
]
E_{s_1\sim p(s_1)}[V^{\pi}(s_1)]
Es1∼p(s1)[Vπ(s1)] is the RL objective (对所有可能的出发的状态进行求期望,就是我们上面说的goal)
Using Q π Q^\pi Qπ and V π V^\pi Vπ
- if we have policy
π
\pi
π, and we know
Q
π
(
s
,
a
)
Q^\pi(s,a)
Qπ(s,a), then we can improve
π
\pi
π:
- set π ′ ( a ∣ s ) = 1 \pi'(a|s)=1 π′(a∣s)=1 if a = arg max a Q π ( s , a ) a=\arg\max_aQ^\pi(s,a) a=argmaxaQπ(s,a)
- this policy is better than π \pi π
- compute gradient to increase probobality of good actions a:
- if Q π ( s , a ) > V π ( s ) Q^\pi(s,a)>V^\pi(s) Qπ(s,a)>Vπ(s), then a is better than average 因为 V π V^\pi Vπ是平均的各种动作奖励值
- modify
π
(
a
∣
s
)
\pi(a|s)
π(a∣s) to increase probobality of a if
Q
π
(
s
,
a
)
>
V
π
(
s
)
Q^\pi(s,a)>V^\pi(s)
Qπ(s,a)>Vπ(s)
Q-value和V-value都是用来评价策略好坏的
Types of RL algorithms
Our goal:
θ
∗
=
arg
max
θ
E
τ
∼
p
θ
(
τ
)
[
∑
t
r
(
s
t
,
a
t
)
]
\theta^*=\arg\max_\theta E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]
θ∗=argθmaxEτ∼pθ(τ)[t∑r(st,at)]
[!NOTE] Types of RL algorithms
- Policy gradients: 直接计算目标函数对 θ \theta θ的导数,然后执行梯度下降
- Value-based: 为最优策略去模拟(神经网络去模拟)值函数或者Q函数。(没有明确的策略)
- Actor-critic: 上面两种的结合,模拟值/Q函数,然后去更新策略
- Model-based: estimate the transition model T \mathcal{T} T
- 用它来planning(没有明确的策略)
- 用来更新策略
Examples of algorithms
- Value function fitting
- Q-learning, DQN
- Temporal difference learning
- Fitted value iteration
- Policy gradient methods
- Reinforce
- Natual policy gradient
- Trust region policy optimization
- Actor-critic
- A3C
- SAC
- DDPG
- Model-based RL
- Dyna
- Guided policy search
Policy gradient
我们对着这个幻灯片来讨论讨论:
首先这个
J
(
θ
)
J(\theta)
J(θ)表示某个特定的策略的轨迹奖励,也就按照这个策略,跑完整个的所有reward
因为最终的目标就是找到最佳策略使得
J
(
θ
)
J(\theta)
J(θ)的期望最大,也就是说它就是优化函数,因此我们要对
J
(
θ
)
J(\theta)
J(θ)求导。
积分和求导符号可以换位置。要求期望就得知道概率分布
p
θ
(
τ
)
p_\theta(\tau)
pθ(τ),而这个是不知道的。因此利用了蓝框里的,对
p
θ
(
τ
)
p_\theta(\tau)
pθ(τ)求导变成对
log
p
θ
(
τ
)
\log p_\theta(\tau)
logpθ(τ)求导,有了下面的:
几个和策略
θ
\theta
θ无关项求导之后都等于零,
p
θ
(
τ
)
p_\theta(\tau)
pθ(τ)的展开又可以利用log转换为求和
因此得到最后一行,也就是剩下的参数我们都是知道的(
π
\pi
π、
r
r
r)
由于
J
(
θ
)
J(\theta)
J(θ)是个期望,那么就可以通过多次实验采样然后平均得到:
∇
θ
J
(
θ
)
≈
1
N
∑
i
=
1
N
(
∑
t
=
1
T
∇
θ
log
π
θ
(
a
i
,
t
∣
s
i
,
t
)
)
(
∑
t
=
1
T
r
(
s
i
,
t
,
a
i
,
t
)
)
\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^Tr(s_{i,t},a_{i,t})\right)
∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Tr(si,t,ai,t))
我们再对应着那三个box的颜色看看:
这里的
π
θ
(
a
t
∣
s
t
)
\pi_\theta(a_t|s_t)
πθ(at∣st)是什么?
可以是神经网络输出的概率(离散),
也可以是动作的概率分布(连续):
π
θ
(
a
t
∣
s
t
)
=
N
(
f
neural network
(
s
t
)
;
Σ
)
\pi_\theta(a_t|s_t)=\mathcal{N}(f_{\text{neural network}(s_t)};\Sigma)
πθ(at∣st)=N(fneural network(st);Σ) (高斯分布例子)
把上面的式子稍微简化一点:
∇
θ
J
(
θ
)
≈
1
N
∑
i
=
1
N
(
∑
t
=
1
T
∇
θ
log
π
θ
(
a
i
,
t
∣
s
i
,
t
)
)
(
∑
t
=
1
T
r
(
s
i
,
t
,
a
i
,
t
)
)
=
1
N
∑
i
=
1
N
∇
θ
log
π
θ
(
τ
i
)
r
(
τ
i
)
\begin{align}\nabla_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^Tr(s_{i,t},a_{i,t})\right)\\&=\frac{1}{N}\sum_{i=1}^N\nabla_\theta \log\pi_\theta(\tau_i)r(\tau_i)\end{align}
∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Tr(si,t,ai,t))=N1i=1∑N∇θlogπθ(τi)r(τi)
考虑到因果性:现在的决策不会影响过去的reward,绿色框中可以改写一下:
Q
^
π
(
x
t
,
u
t
)
=
∑
t
′
=
t
T
r
(
x
t
′
,
u
t
′
)
\hat{Q}^\pi(x_t,u_t)=\sum_{t'=t}^Tr(x_{t'},u_{t'})
Q^π(xt,ut)=∑t′=tTr(xt′,ut′)
我们把轨迹
τ
\tau
τ写开,再利用因果性的
Q
^
\hat{Q}
Q^:
∇
θ
J
(
θ
)
≈
1
N
∑
i
=
1
N
∑
t
=
1
T
∇
θ
log
π
θ
(
a
i
,
t
∣
s
i
,
t
)
Q
^
i
,
t
π
\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\hat{Q}^\pi_{i,t}
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)Q^i,tπ
Q
^
i
,
t
π
\hat{Q}^\pi_{i,t}
Q^i,tπ表示如果你在
s
i
,
t
s_{i,t}
si,t状态采取
a
i
,
t
a_{i,t}
ai,t动作,并按照
π
\pi
π策略继续下去跑完整个轨迹的奖励的估计。
重点就在于求和的下标,这样做的最大好处就是能够减少有限样本带来的方差
其实policy gradient的原理就是在最大似然估计的基础上,按照
Q
^
π
(
x
t
,
u
t
)
\hat{Q}^\pi(x_t,u_t)
Q^π(xt,ut)进行加权:
J ~ ( θ ) ≈ 1 N ∑ i = 1 N ∑ i = 1 N log π θ ( a i , t ∣ s i , t ) Q ^ i , t \tilde{J} (\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{i=1}^N \log\pi_\theta(a_{i,t}|s_{i,t})\hat{Q}_{i,t} J~(θ)≈N1i=1∑Ni=1∑Nlogπθ(ai,t∣si,t)Q^i,t
大量有趣的数学技巧,包括为梯度和=增加约束
在L5,gradient policy的最后一节,Trust region policy optimization
Actor-critic
来个baseline,接着我们上面的说,正因为
Q
^
\hat{Q}
Q^只是多次采样的对期望的模拟,因此假设我们有一个理想的reward的期望:
Q
(
s
t
,
a
t
)
=
∑
t
′
=
t
T
E
π
θ
[
r
(
s
t
′
,
a
t
′
)
∣
s
t
,
a
t
]
Q(s_t,a_t)=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t]
Q(st,at)=t′=t∑TEπθ[r(st′,at′)∣st,at]
然后考虑在一个特定的状态
s
t
s_t
st下,预期的所有动作的平均reward,就是值函数的定义(注意自变量和它是对谁取了期望)
V
(
s
t
)
=
E
a
t
∼
π
θ
(
a
t
∣
s
t
)
[
Q
(
s
t
,
a
t
)
]
V(s_t)=E_{a_t\sim\pi_\theta}(a_t|s_t)[Q(s_t,a_t)]
V(st)=Eat∼πθ(at∣st)[Q(st,at)]
这个就可以作为我们的baseline,代表就是平均的行动回报:(代替原本的恒定的
b
b
b)
[!NOTE] AC和policy gradient的区别
所以以前的 ∑ t = 1 T r ( s i , t , a i , t ) \sum_{t=1}^Tr(s_{i,t},a_{i,t}) ∑t=1Tr(si,t,ai,t)其实是一种蒙特卡洛估计,虽然无偏但是方差很大,我们用些许的偏差换来方差的巨大减小,也就是要直接去fit Q π , V π , o r A π Q^\pi,V^\pi,or A^\pi Qπ,Vπ,orAπ. 因为后者本质是期望,哪怕是我们拟合的期望函数,也比采样得到的 r ( ) r() r()更好
第一项实际上是确定的不是随机变量,因为是当下的状态和动作。
Q
π
(
s
t
,
a
t
)
≈
r
(
s
t
,
a
t
)
+
V
π
(
s
t
+
1
)
Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})
Qπ(st,at)≈r(st,at)+Vπ(st+1) 这里进行了小的近似:从t到t+1的过程相当于用单样本进行了近似,因为理论上来说这里的状态转移也是要取期望的,随后就可以得到:
A
π
(
s
t
,
a
t
)
≈
r
(
s
t
,
a
t
)
+
V
π
(
s
t
+
1
)
−
V
π
(
s
t
)
A^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_{t})
Aπ(st,at)≈r(st,at)+Vπ(st+1)−Vπ(st)
A
π
(
s
t
,
a
t
)
A^\pi(s_t,a_t)
Aπ(st,at)意义是动作
a
t
a_t
at比根据策略
π
\pi
π产生的平均动作reward好多少:Advantage
A和Q都取决于两个变量:状态和动作,而V只取决于状态,因此更好去模拟fit——接下来就用基于V函数的critic
Estimate V
V
π
(
s
t
)
≈
∑
t
′
=
t
T
r
(
s
t
′
,
a
t
′
)
V^\pi(s_t)\approx\sum_{t'=t}^T r(s_{t'},a_{t'})
Vπ(st)≈∑t′=tTr(st′,at′)
not as good as:
V
π
(
s
t
)
≈
1
N
∑
n
=
1
N
∑
t
′
=
t
T
r
(
s
t
′
,
a
t
′
)
V^\pi(s_t)\approx\frac{1}{N}\sum_{n=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'})
Vπ(st)≈N1∑n=1N∑t′=tTr(st′,at′), but still pretty good
So, our training data:
{
(
s
i
,
t
,
∑
t
′
=
t
T
r
(
s
i
,
t
′
,
a
i
,
t
′
)
)
}
\left\{(s_{i,t},\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}))\right\}
{(si,t,∑t′=tTr(si,t′,ai,t′))} ,右边的就是
y
i
,
t
y_{i,t}
yi,t
supervised learning:
L
(
ϕ
)
=
1
2
∑
i
∣
∣
V
^
ϕ
π
(
s
i
)
−
y
i
∣
∣
2
\mathcal{L}(\phi)=\frac{1}{2}\sum_i||\hat{V}^\pi_\phi(s_i)-y_i||^2
L(ϕ)=21∑i∣∣V^ϕπ(si)−yi∣∣2
algorithms-with discount
上面的方法需要我们去使用两个神经网络去拟合函数:
s->
V
^
ϕ
π
\hat{V}^\pi_\phi
V^ϕπ和s->
π
θ
(
a
∣
s
)
\pi_\theta(a|s)
πθ(a∣s)