cs285学习笔记

lec1

ML和RL之间的区别

mlrl
iid data数据不iid,前面的数据会影响future input
训练时有确定的groundtruth只知道succ/fail,不知道具体的label
supervised learning需要人类给labellabel可以是success, fail这样的

rl很长一段时间被feature困扰,不知道怎么选择feature更适合policy/value function,用deep RL可以解决feature的问题

几种RL分类

inverse reinforcement learning:learning reward functions from example
unsupervised learning:learning from obsering the world
meta-learning/transfer learning:learning to learn,根据历史的经验去学习

current challenges

  1. 人类学习很快,但DRL很慢
  2. human reuse past knowledge,RL用transfer learning
  3. 不知道reward function怎么设计
  4. 不知道role of prediction怎么设计

lec4

markov chain

定义:
M = { S , T } M = \{S,T\} M={S,T}
其中:

  1. S S S是state
  2. T T T是transition operator,假设 μ t \mu_t μt是一个prob vector,则有: μ t , i = p ( s t = i ) \mu_{t,i} = p(s_t=i) μt,i=p(st=i),因为 T i , j = p ( s t + 1 = i ∣ s t = j ) T_{i,j}=p(s_{t+1}=i|s_t=j) Ti,j=p(st+1=ist=j),所以 μ t + 1 = T μ t \mu_t+1=T\mu_t μt+1=Tμt

markov decision process

M = { S , A , T , r } M = \{S,A, T, r\} M={S,A,T,r}
其中:
3. S S S是state
4. T T T是transition operator
5. A A A是action space,在上面的基础上加上action,有 T i , j , k = p ( s t + 1 = i ∣ s t = j , a t = k ) T_{i,j,k}=p(s_{t+1}=i|s_t=j,a_t=k) Ti,j,k=p(st+1=ist=j,at=k)
6. r : S × A → R r: S \times A \rightarrow \mathbb{R} r:S×AR

partially observed markov decision process

和markov decision process相似,但是有一个observation限制,即:
M = { S , A , O , T , E , r } M = \{S,A, O, T, E, r\} M={S,A,O,T,E,r}
其中:
7. S S S是state
8. T T T是transition operator
9. A A A是action space,在上面的基础上加上action,有 T i , j , k = p ( s t + 1 = i ∣ s t = j , a t = k ) T_{i,j,k}=p(s_{t+1}=i|s_t=j,a_t=k) Ti,j,k=p(st+1=ist=j,at=k)
10. r : S × A → R r: S \times A \rightarrow \mathbb{R} r:S×AR
11. E E E是emission prob,即 p ( o t ∣ s t ) p(o_t|s_t) p(otst)

RL’s goal

在这里插入图片描述

强化学习的goal function如下:
θ ∗ = arg max ⁡ θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] \theta^*=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] θ=θargmaxEτpθ(τ)[tr(st,at)]
transitions follow markov process

有限的markov,可以把目标函数进一步写成:
θ ∗ = arg max ⁡ θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] = arg max ⁡ θ ∑ t = 1 T E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t ) ] \begin{aligned} \theta^* &=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] \\ &= \argmax_{\theta}\sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \end{aligned} θ=θargmaxEτpθ(τ)[tr(st,at)]=θargmaxt=1TE(st,at)pθ(st,at)[r(st,at)]
变成在 s t , a t s_t,a_t st,at的边缘分布上计算期望

无限的markov上, p ( s t , a t ) p(s_t,a_t) p(st,at)会收敛到一个stationary distribution上,于是上面的目标函数可以进一步写成:
θ ∗ = arg max ⁡ θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] = arg max ⁡ θ ∑ t = 1 T E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t ) ] = arg max ⁡ θ 1 T ∑ t = 1 T E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t ) ] → E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t ) ] \begin{aligned} \theta^* &=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] \\ &= \argmax_{\theta}\sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \\ &= \argmax_{\theta} \frac{1}{T} \sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \rightarrow E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \end{aligned} θ=θargmaxEτpθ(τ)[tr(st,at)]=θargmaxt=1TE(st,at)pθ(st,at)[r(st,at)]=θargmaxT1t=1TE(st,at)pθ(st,at)[r(st,at)]E(st,at)pθ(st,at)[r(st,at)]

RL的核心目标函数是优化期望,因为离散的distribution上期望也是连续的,所以可以用gradient descent等优化方法

algorithms

不同RL算法的框架都类似如下:
在这里插入图片描述

  1. generate samples:用自己的policy从trajectory distribution上sample出来一些trajectory
  2. fit a model
  3. improve policy

types of algorithms

value-based:预测value function或者q-function,如q-learning,DQN
policy-gradients:直接优化 θ ∗ = arg max ⁡ θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] \theta^* =\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] θ=argmaxθEτpθ(τ)[tr(st,at)],如REINFORCE,PPO/proximal policy optimization
actor-critic:两者结合,如A3C,SAC
model-based:预测transition model,然后用来planning或者improve policy,如Dyna

model-based algorithms

在这里插入图片描述
上图的options可以有:

  1. use model to plan
  2. backpropagate gradients into policy
  3. learn a value function

value-based algorithms

在这里插入图片描述

policy-based

在这里插入图片描述

actor-critic

在这里插入图片描述

trade-offs

要考虑的点:

  1. sample efficiency(off-policy vs on-policy),stability & ease of use(converge:很多rl不一定需要严格收敛)
  2. assumptions:stochasitc or determinitic,continuous or dicreate,episodic or infinite horizen
  3. policy更容易找到,还是 model更容易找到

sample efficiency具体情况
在这里插入图片描述

RL的assumptions:full observability, episodic learning, continuity or smoothness

value functions

q function,即从 s t s_t st采取行动 a t a_t at后能获得的总reward的期望: Q π ( s t , a t ) = ∑ t ′ = t T E π 0 [ r ( s t ′ , a t ′ ) ∣ s t , a t ] Q^\pi(s_t, a_t) = \sum_{t'=t}^TE_{\pi_0}[r(s_{t'},a_{t'}) | s_t, a_t] Qπ(st,at)=t=tTEπ0[r(st,at)st,at]

value function,从 s t s_t st能获得的总reward的期望: V π ( s t ) = ∑ t ′ = t T E π 0 [ r ( s t ′ , a t ′ ) ∣ s t ] V^\pi(s_t) = \sum_{t'=t}^TE_{\pi_0}[r(s_{t'},a_{t'}) | s_t] Vπ(st)=t=tTEπ0[r(st,at)st]

RL的目标函数就是 E s 1 ∼ p ( s 1 ) [ V π ( s 1 ) ] E_{s1\sim p(s_1)}[V^\pi(s_1)] Es1p(s1)[Vπ(s1)]

lec5 - policy gradient

详情见这里

lec6 - actor-cricit

详情见这里

lec7 - value based functions

详情见这里

Q & A

RL和MDP/markov decision process是什么关系?
RL是一个解决MDP问题的框架

如果一个问题可以被定义为MDP问题(能够给出transition prob和reward distribution),那么RL可能比较适合来解决这个问题。反过来,如果问题不能被定义为MDP,那么RL可能不能保证能找到useful solution
影响RL的一个关键因素是states是否具有markov property(一个随机过程在给定现在状态和过去所有状态的情况下,其未来状态的条件概率分布仅依赖于当前状态)

infinite RL为什么目标函数可以写成只有期望?/目标函数的推导过程

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值