DRL笔记系列一

最新推荐文章于 2024-04-13 12:07:48 发布

chenhch8

最新推荐文章于 2024-04-13 12:07:48 发布

阅读量396

点赞数 1

分类专栏： DRL 文章标签：算法

本文链接：https://blog.csdn.net/deepinC/article/details/108245688

版权

参考链接

基本概念

trial and error
DRL=RL+deep_learning
on-policy：所有数据都是当前agent与env交互后产生的，训练时不使用old data，即不使用以前agent产生的数据
- 缺点：these algorithms works weaker on sample efficiency
- 优点：these algorithms directly optimize the objective you care about—policy performance—and it works out mathematically that you need on-policy data to calculate the updates
- 总结：牺牲sample efficiency以获取训练时的stability和reliable
off-policy：old data可reused。通过 Bellman’s equations 来训练一个 Q-function
- 缺点：
  - there are no guarantees that doing a good job of satisfying Bellman’s equations leads to having great policy performance.
  - the absence of guarantees makes algorithms in this class potentially brittle and unstable.
- 优点：high sample efficiency

基础数学公式

log-derivate 技巧
$\bigtriangledown P_\theta(\tau|\theta) = P(\tau|\theta)\,\bigtriangledown\log P(\tau|\theta)$
Expect Grad-Log-Prob (EGLP) lemma：假定 $P_\theta$ 是关于随机变量 $X$ 的一个参数化概率分布，则关于 $\log P_\theta(x)$ 的导数的期望为 0，即
$E_{x\sim P_\theta}[\bigtriangledown\log P_\theta(x)]=0$
证明
全期望公式（law of iterated expectations）：设 $X, Y$ 为具有分布 $P_X,P_Y$ 的随机变量，则有
$E_{x\sim P_X}(x) = E_{y \sim P_Y}(E_{x \sim P_X}(x|y))$
证明

核心概念

In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

states: a state is a complete description of the state of the world. There is no information about the world which is hidden from the state.

observations: an observation is a partial description of a state, which may omit information.
environment: fully observed, partially observed
action spaces: the set of all valid actions in a given environment. discrete action spaces, continuous action space
policy: a policy is a rule used by an agent to decide what actions to take
- deterministic: denoted by $\mu$ , $a_t = \mu(s_t)$
- stochasitc: denoted by $\pi$ , $a_t \sim \pi(\cdot|s_t)$
  - categorical policies: used in discrete action spaces. 输入是state，输出是动作的概率分布
  - diagonal Gaussian policies: used in continous action spaces
    
    采样方式为 $\mu_\theta(s) + \sigma_\theta(s) \circ z$ ，其中 $\mu_\theta(\cdot),\sigma_\theta(\cdot)$ 为 mean action(向量) 和 covariance matrix(仅对角线有值，其余为0)， $\sim \mathcal N(0,1)$ 为噪声向量
通常可将 policy 替换 agent

parameterized policies: 使用参数化模型对policy进行建模，通过优化算法来调整参数从而实现policy 的 behavior 的改变。通常使用 $\theta$ 表示参数：
$\begin{aligned} a_t &= \mu_\theta(s_t) \\ a_t &\sim \pi_\theta(\cdot|s_t) \end{aligned}$
trajectories: 动作-状态序列 $\tau \sim (s_0,a_0,s_1,a_1,\dots)$ ，其中 $s_{t+1} \sim P(\cdot|s_t,a_t)$
reward: 每步决策时获取的反馈， $r_t = R(s_t, a_t, s_{t+1})$ ，常简化为 $r_t = R(s_t) / r_t = R(s_t,a_t)$

return: 即累计长期回报
- finite-horizon undiscounted return: $R(\tau) = \sum_{t=0}^T r_t$
- infinite-horizon discounted return: $R(\tau) = \sum_{t=0}^\infty \gamma^t r_t$ 。使用 discount factor $\gamma$ 的原因主要有两个：
  - 保证 return 收敛到一个有限的值
  - 越近时刻的 reward 影响越大
  注意：尽管这两种收益表述之间的界限在RL形式主义中是很明显的，但深入的RL做法往往会使这条线模糊不清-例如，我们经常设置算法来优化未衰减收益，但在评估价值函数时使用衰减因子
RL problem: RL的目标是调整policy以实现最大化累计长期回报的目的。给定当前策略 $\pi$ ，一条长度为 $T$ 的 trajectory $\tau$ 出现的概率为 $P(\tau|\pi) = \rho(s_0) \displaystyle\prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\,\pi(a_t|s_t)$ ，则 $\pi$ 对应的 expected return 为 $J(\pi) = \int_\tau P(\tau|\pi)\,R(\tau)$ ，因此 RL 的优化目标为 $\pi^* = \displaystyle\arg\max_\pi J(\pi)$
value functions
- on-policy value function: $V^{\pi}(s) = E_{\tau \sim \pi}[R(\tau)|s_0=s]$
- on-policy action-value function: $Q^{\pi}(s,a) = E_{\tau\sim\pi}[R(\tau)|s_0=s,a_0=a]$
- optimal value function: $V^*(s) = \max_\pi E_{\tau \sim \pi}[R(\tau)|s_0=s]$
- optimal action-value function: $Q^*(s,a) = \max_\pi E_{\tau\sim\pi}[R(\tau)|s_0=s,a_0=a]$

最低0.47元/天解锁文章

chenhch8

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
DRL笔记系列一

参考链接基本概念trial and errorDRL=RL+deep_learningon-policy：所有数据都是当前agent与env交互后产生的，训练时不使用old data，即不使用以前agent产生的数据缺点：these algorithms works weaker on sample efficiency优点：these algorithms directly optimize the objective you care about—policy performa
复制链接

扫一扫

专栏目录