cs285学习笔记

KpLn_HJL

已于 2023-02-18 14:04:22 修改

阅读量670

点赞数

分类专栏：学习笔记 # 强化学习文章标签：学习 RL

于 2023-02-13 11:52:18 首次发布

本文链接：https://blog.csdn.net/sinat_41679123/article/details/129005390

版权

强化学习同时被 2 个专栏收录

20 篇文章 0 订阅

订阅专栏

学习笔记

16 篇文章 0 订阅

订阅专栏

lec1

ML和RL之间的区别

ml	rl
iid data	数据不iid，前面的数据会影响future input
训练时有确定的groundtruth	只知道succ/fail，不知道具体的label
supervised learning需要人类给label	label可以是success, fail这样的

rl很长一段时间被feature困扰，不知道怎么选择feature更适合policy/value function，用deep RL可以解决feature的问题

几种RL分类

inverse reinforcement learning：learning reward functions from example
unsupervised learning：learning from obsering the world
meta-learning/transfer learning：learning to learn，根据历史的经验去学习

current challenges

人类学习很快，但DRL很慢
human reuse past knowledge，RL用transfer learning
不知道reward function怎么设计
不知道role of prediction怎么设计

lec4

markov chain

定义：
$M = \{S,T\}$
其中：

$S$ 是state
$T$ 是transition operator，假设 $\mu_t$ 是一个prob vector，则有： $\mu_{t,i} = p(s_t=i)$ ，因为 $T_{i,j}=p(s_{t+1}=i|s_t=j)$ ，所以 $\mu_t+1=T\mu_t$

markov decision process

$M = \{S,A, T, r\}$
其中：
3. $S$ 是state
4. $T$ 是transition operator
5. $A$ 是action space，在上面的基础上加上action，有 $T_{i,j,k}=p(s_{t+1}=i|s_t=j,a_t=k)$
6. $\times A \rightarrow \mathbb{R}$

partially observed markov decision process

和markov decision process相似，但是有一个observation限制，即：
$M = \{S,A, O, T, E, r\}$
其中：
7. $S$ 是state
8. $T$ 是transition operator
9. $A$ 是action space，在上面的基础上加上action，有 $T_{i,j,k}=p(s_{t+1}=i|s_t=j,a_t=k)$
10. $\times A \rightarrow \mathbb{R}$
11. $E$ 是emission prob，即 $p(o_t|s_t)$

RL’s goal

在这里插入图片描述

强化学习的goal function如下：
$\theta^*=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]$
transitions follow markov process

有限的markov，可以把目标函数进一步写成：
$\begin{aligned} \theta^* &=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] \\ &= \argmax_{\theta}\sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \end{aligned}$
变成在 $s_t,a_t$ 的边缘分布上计算期望

无限的markov上， $p(s_t,a_t)$ 会收敛到一个stationary distribution上，于是上面的目标函数可以进一步写成：
$\begin{aligned} \theta^* &=\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] \\ &= \argmax_{\theta}\sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \\ &= \argmax_{\theta} \frac{1}{T} \sum_{t=1}^T E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \rightarrow E_{(s_t, a_t) \sim p_\theta(s_t, a_t)}[r(s_t, a_t)] \end{aligned}$

RL的核心目标函数是优化期望，因为离散的distribution上期望也是连续的，所以可以用gradient descent等优化方法

algorithms

不同RL算法的框架都类似如下：
在这里插入图片描述

generate samples：用自己的policy从trajectory distribution上sample出来一些trajectory
fit a model
improve policy

types of algorithms

value-based：预测value function或者q-function，如q-learning，DQN
policy-gradients：直接优化 $\theta^* =\argmax_{\theta}E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]$ ，如REINFORCE，PPO/proximal policy optimization
actor-critic：两者结合，如A3C，SAC
model-based：预测transition model，然后用来planning或者improve policy，如Dyna

model-based algorithms

在这里插入图片描述
上图的options可以有：

use model to plan
backpropagate gradients into policy
learn a value function

value-based algorithms

在这里插入图片描述

policy-based

在这里插入图片描述

actor-critic

在这里插入图片描述

trade-offs

要考虑的点：

sample efficiency（off-policy vs on-policy），stability & ease of use（converge：很多rl不一定需要严格收敛）
assumptions：stochasitc or determinitic，continuous or dicreate，episodic or infinite horizen
policy更容易找到，还是 model更容易找到

sample efficiency具体情况
在这里插入图片描述

RL的assumptions：full observability, episodic learning, continuity or smoothness

value functions

q function，即从 $s_t$ 采取行动 $a_t$ 后能获得的总reward的期望： $Q^\pi(s_t, a_t) = \sum_{t'=t}^TE_{\pi_0}[r(s_{t'},a_{t'}) | s_t, a_t]$

value function，从 $s_t$ 能获得的总reward的期望： $V^\pi(s_t) = \sum_{t'=t}^TE_{\pi_0}[r(s_{t'},a_{t'}) | s_t]$

RL的目标函数就是 $E_{s1\sim p(s_1)}[V^\pi(s_1)]$

lec5 - policy gradient

详情见这里

lec6 - actor-cricit

详情见这里

lec7 - value based functions

详情见这里

Q & A

RL和MDP/markov decision process是什么关系？
RL是一个解决MDP问题的框架

如果一个问题可以被定义为MDP问题（能够给出transition prob和reward distribution），那么RL可能比较适合来解决这个问题。反过来，如果问题不能被定义为MDP，那么RL可能不能保证能找到useful solution
影响RL的一个关键因素是states是否具有markov property（一个随机过程在给定现在状态和过去所有状态的情况下，其未来状态的条件概率分布仅依赖于当前状态）

infinite RL为什么目标函数可以写成只有期望？/目标函数的推导过程