lec1
ML和RL之间的区别
ml | rl |
---|---|
iid data | 数据不iid,前面的数据会影响future input |
训练时有确定的groundtruth | 只知道succ/fail,不知道具体的label |
supervised learning需要人类给label | label可以是success, fail这样的 |
rl很长一段时间被feature困扰,不知道怎么选择feature更适合policy/value function,用deep RL可以解决feature的问题
几种RL分类
inverse reinforcement learning:learning reward functions from example
unsupervised learning:learning from obsering the world
meta-learning/transfer learning:learning to learn,根据历史的经验去学习
current challenges
- 人类学习很快,但DRL很慢
- human reuse past knowledge,RL用transfer learning
- 不知道reward function怎么设计
- 不知道role of prediction怎么设计
lec4
markov chain
定义:
M
=
{
S
,
T
}
M = \{S,T\}
M={S,T}
其中:
- S S S是state
- T T T是transition operator,假设 μ t \mu_t μt是一个prob vector,则有: μ t , i = p ( s t = i ) \mu_{t,i} = p(s_t=i) μt,i=p(st=i),因为 T i , j = p ( s t + 1 = i ∣ s t = j ) T_{i,j}=p(s_{t+1}=i|s_t=j) Ti,j=p(st+1=i∣st=j),所以 μ t + 1 = T μ t \mu_t+1=T\mu_t μt+1=Tμt
markov decision process
M
=
{
S
,
A
,
T
,
r
}
M = \{S,A, T, r\}
M={S,A,T,r}
其中:
3.
S
S
S是state
4.
T
T
T是transition operator
5.
A
A
A是action space,在上面的基础上加上action,有
T
i
,
j
,
k
=
p
(
s
t
+
1
=
i
∣
s
t
=
j
,
a
t
=
k
)
T_{i,j,k}=p(s_{t+1}=i|s_t=j,a_t=k)
Ti,j,k=p(st+1=i∣st=j,at=k)
6.
r
:
S
×
A
→
R
r: S \times A \rightarrow \mathbb{R}
r:S×A→R
partially observed markov decision process
和markov decision process相似,但是有一个observation限制,即:
M
=
{
S
,
A
,
O
,
T
,
E
,
r
}
M = \{S,A, O, T, E, r\}
M={S,A,O,T,E,r}
其中:
7.
S
S
S是state
8.
T
T
T是transition operator
9.
A
A
A是action space,在上面的基础上加上action,有
T
i
,
j
,
k
=
p
(
s
t
+
1
=
i
∣
s
t
=
j
,
a
t
=
k
)
T_{i,j,k}=p(s_{t+1}=i|s_t=j,a_t=k)
Ti,j,k=p(st+1=i∣st=j,at=k)
10.
r
:
S
×
A
→
R
r: S \times A \rightarrow \mathbb{R}
r:S×A→R
11.
E
E
E是emission prob,即
p
(
o
t
∣
s
t
)
p(o_t|s_t)
p(ot∣st)
RL’s goal
强化学习的goal function如下:
θ
∗
=
arg max
θ
E
τ
∼
p
θ
(
τ
)
[
∑
t
r
(
s
t
,
a
t
)
]
\theta^*=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]
θ∗=θargmaxEτ∼pθ(τ)[t∑r(st,at)]
transitions follow markov process
有限的markov,可以把目标函数进一步写成:
θ
∗
=
arg max
θ
E
τ
∼
p
θ
(
τ
)
[
∑
t
r
(
s
t
,
a
t
)
]
=
arg max
θ
∑
t
=
1
T
E
(
s
t
,
a
t
)
∼
p
θ
(
s
t
,
a
t
)
[
r
(
s
t
,
a
t
)
]
\begin{aligned} \theta^* &=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] \\ &= \argmax_{\theta}\sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \end{aligned}
θ∗=θargmaxEτ∼pθ(τ)[t∑r(st,at)]=θargmaxt=1∑TE(st,at)∼pθ(st,at)[r(st,at)]
变成在
s
t
,
a
t
s_t,a_t
st,at的边缘分布上计算期望
无限的markov上,
p
(
s
t
,
a
t
)
p(s_t,a_t)
p(st,at)会收敛到一个stationary distribution上,于是上面的目标函数可以进一步写成:
θ
∗
=
arg max
θ
E
τ
∼
p
θ
(
τ
)
[
∑
t
r
(
s
t
,
a
t
)
]
=
arg max
θ
∑
t
=
1
T
E
(
s
t
,
a
t
)
∼
p
θ
(
s
t
,
a
t
)
[
r
(
s
t
,
a
t
)
]
=
arg max
θ
1
T
∑
t
=
1
T
E
(
s
t
,
a
t
)
∼
p
θ
(
s
t
,
a
t
)
[
r
(
s
t
,
a
t
)
]
→
E
(
s
t
,
a
t
)
∼
p
θ
(
s
t
,
a
t
)
[
r
(
s
t
,
a
t
)
]
\begin{aligned} \theta^* &=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] \\ &= \argmax_{\theta}\sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \\ &= \argmax_{\theta} \frac{1}{T} \sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \rightarrow E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \end{aligned}
θ∗=θargmaxEτ∼pθ(τ)[t∑r(st,at)]=θargmaxt=1∑TE(st,at)∼pθ(st,at)[r(st,at)]=θargmaxT1t=1∑TE(st,at)∼pθ(st,at)[r(st,at)]→E(st,at)∼pθ(st,at)[r(st,at)]
RL的核心目标函数是优化期望,因为离散的distribution上期望也是连续的,所以可以用gradient descent等优化方法
algorithms
不同RL算法的框架都类似如下:
- generate samples:用自己的policy从trajectory distribution上sample出来一些trajectory
- fit a model
- improve policy
types of algorithms
value-based:预测value function或者q-function,如q-learning,DQN
policy-gradients:直接优化
θ
∗
=
arg max
θ
E
τ
∼
p
θ
(
τ
)
[
∑
t
r
(
s
t
,
a
t
)
]
\theta^* =\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]
θ∗=argmaxθEτ∼pθ(τ)[∑tr(st,at)],如REINFORCE,PPO/proximal policy optimization
actor-critic:两者结合,如A3C,SAC
model-based:预测transition model,然后用来planning或者improve policy,如Dyna
model-based algorithms
上图的options可以有:
- use model to plan
- backpropagate gradients into policy
- learn a value function
value-based algorithms
policy-based
actor-critic
trade-offs
要考虑的点:
- sample efficiency(off-policy vs on-policy),stability & ease of use(converge:很多rl不一定需要严格收敛)
- assumptions:stochasitc or determinitic,continuous or dicreate,episodic or infinite horizen
- policy更容易找到,还是 model更容易找到
sample efficiency具体情况
RL的assumptions:full observability, episodic learning, continuity or smoothness
value functions
q function,即从 s t s_t st采取行动 a t a_t at后能获得的总reward的期望: Q π ( s t , a t ) = ∑ t ′ = t T E π 0 [ r ( s t ′ , a t ′ ) ∣ s t , a t ] Q^\pi(s_t, a_t) = \sum_{t'=t}^TE_{\pi_0}[r(s_{t'},a_{t'}) | s_t, a_t] Qπ(st,at)=∑t′=tTEπ0[r(st′,at′)∣st,at]
value function,从 s t s_t st能获得的总reward的期望: V π ( s t ) = ∑ t ′ = t T E π 0 [ r ( s t ′ , a t ′ ) ∣ s t ] V^\pi(s_t) = \sum_{t'=t}^TE_{\pi_0}[r(s_{t'},a_{t'}) | s_t] Vπ(st)=∑t′=tTEπ0[r(st′,at′)∣st]
RL的目标函数就是 E s 1 ∼ p ( s 1 ) [ V π ( s 1 ) ] E_{s1\sim p(s_1)}[V^\pi(s_1)] Es1∼p(s1)[Vπ(s1)]
lec5 - policy gradient
详情见这里
lec6 - actor-cricit
详情见这里
lec7 - value based functions
详情见这里
Q & A
RL和MDP/markov decision process是什么关系?
RL是一个解决MDP问题的框架
如果一个问题可以被定义为MDP问题(能够给出transition prob和reward distribution),那么RL可能比较适合来解决这个问题。反过来,如果问题不能被定义为MDP,那么RL可能不能保证能找到useful solution
影响RL的一个关键因素是states是否具有markov property(一个随机过程在给定现在状态和过去所有状态的情况下,其未来状态的条件概率分布仅依赖于当前状态)
infinite RL为什么目标函数可以写成只有期望?/目标函数的推导过程