DRL笔记系列一

参考链接

基本概念

  1. trial and error

  2. DRL=RL+deep_learning

  3. on-policy:所有数据都是当前agent与env交互后产生的,训练时不使用old data,即不使用以前agent产生的数据

    • 缺点:these algorithms works weaker on sample efficiency
    • 优点:these algorithms directly optimize the objective you care about—policy performance—and it works out mathematically that you need on-policy data to calculate the updates
    • 总结:牺牲sample efficiency以获取训练时的stability和reliable
  4. off-policy:old data可reused。通过 Bellman’s equations 来训练一个 Q-function

    • 缺点:
      • there are no guarantees that doing a good job of satisfying Bellman’s equations leads to having great policy performance.
      • the absence of guarantees makes algorithms in this class potentially brittle and unstable.
    • 优点:high sample efficiency

基础数学公式

  • log-derivate 技巧
    ▽ P θ ( τ ∣ θ ) = P ( τ ∣ θ )   ▽ log ⁡ P ( τ ∣ θ ) \bigtriangledown P_\theta(\tau|\theta) = P(\tau|\theta)\,\bigtriangledown\log P(\tau|\theta) Pθ(τθ)=P(τθ)logP(τθ)

  • Expect Grad-Log-Prob (EGLP) lemma:假定 P θ P_\theta Pθ 是关于随机变量 X X X 的一个参数化概率分布,则关于 log ⁡ P θ ( x ) \log P_\theta(x) logPθ(x) 的导数的期望为 0,即
    E x ∼ P θ [ ▽ log ⁡ P θ ( x ) ] = 0 E_{x\sim P_\theta}[\bigtriangledown\log P_\theta(x)]=0 ExPθ[logPθ(x)]=0

    证明
  • 全期望公式(law of iterated expectations):设 X , Y X,Y X,Y 为具有分布 P X , P Y P_X,P_Y PX,PY 的随机变量,则有
    E x ∼ P X ( x ) = E y ∼ P Y ( E x ∼ P X ( x ∣ y ) ) E_{x\sim P_X}(x) = E_{y \sim P_Y}(E_{x \sim P_X}(x|y)) ExPX(x)=EyPY(ExPX(xy))

    证明

核心概念

In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

  1. states: a state is a complete description of the state of the world. There is no information about the world which is hidden from the state.

    observations: an observation is a partial description of a state, which may omit information.

  2. environment: fully observed, partially observed

  3. action spaces: the set of all valid actions in a given environment. discrete action spaces, continuous action space

  4. policy: a policy is a rule used by an agent to decide what actions to take

    • deterministic: denoted by μ \mu μ, a t = μ ( s t ) a_t = \mu(s_t) at=μ(st)

    • stochasitc: denoted by π \pi π, a t ∼ π ( ⋅ ∣ s t ) a_t \sim \pi(\cdot|s_t) atπ(st)

      • categorical policies: used in discrete action spaces. 输入是state,输出是动作的概率分布

      • diagonal Gaussian policies: used in continous action spaces

        采样方式为 a = μ θ ( s ) + σ θ ( s ) ∘ z a = \mu_\theta(s) + \sigma_\theta(s) \circ z a=μθ(s)+σθ(s)z,其中 μ θ ( ⋅ ) , σ θ ( ⋅ ) \mu_\theta(\cdot),\sigma_\theta(\cdot) μθ(),σθ() 为 mean action(向量) 和 covariance matrix(仅对角线有值,其余为0), z ∼ N ( 0 , 1 ) z \sim \mathcal N(0,1) zN(0,1) 为噪声向量

    通常可将 policy 替换 agent

    parameterized policies: 使用参数化模型对policy进行建模,通过优化算法来调整参数从而实现policy 的 behavior 的改变。通常使用 θ \theta θ 表示参数:
    a t = μ θ ( s t ) a t ∼ π θ ( ⋅ ∣ s t ) \begin{aligned} a_t &= \mu_\theta(s_t) \\ a_t &\sim \pi_\theta(\cdot|s_t) \end{aligned} atat=μθ(st)πθ(st)

  5. trajectories: 动作-状态序列 τ ∼ ( s 0 , a 0 , s 1 , a 1 , …   ) \tau \sim (s_0,a_0,s_1,a_1,\dots) τ(s0,a0,s1,a1,),其中 s t + 1 ∼ P ( ⋅ ∣ s t , a t ) s_{t+1} \sim P(\cdot|s_t,a_t) st+1P(st,at)

  6. reward: 每步决策时获取的反馈, r t = R ( s t , a t , s t + 1 ) r_t = R(s_t, a_t, s_{t+1}) rt=R(st,at,st+1),常简化为 r t = R ( s t ) / r t = R ( s t , a t ) r_t = R(s_t) / r_t = R(s_t,a_t) rt=R(st)/rt=R(st,at)

    return: 即累计长期回报

    • finite-horizon undiscounted return: R ( τ ) = ∑ t = 0 T r t R(\tau) = \sum_{t=0}^T r_t R(τ)=t=0Trt

    • infinite-horizon discounted return: R ( τ ) = ∑ t = 0 ∞ γ t r t R(\tau) = \sum_{t=0}^\infty \gamma^t r_t R(τ)=t=0γtrt。使用 discount factor γ \gamma γ 的原因主要有两个:

      • 保证 return 收敛到一个有限的值
      • 越近时刻的 reward 影响越大

      注意:尽管这两种收益表述之间的界限在RL形式主义中是很明显的,但深入的RL做法往往会使这条线模糊不清-例如,我们经常设置算法来优化未衰减收益,但在评估价值函数时使用衰减因子

  7. RL problem: RL的目标是调整policy以实现最大化累计长期回报的目的。给定当前策略 π \pi π,一条长度为 T T T 的 trajectory τ \tau τ 出现的概率为 P ( τ ∣ π ) = ρ ( s 0 ) ∏ t = 0 T − 1 P ( s t + 1 ∣ s t , a t )   π ( a t ∣ s t ) P(\tau|\pi) = \rho(s_0) \displaystyle\prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\,\pi(a_t|s_t) P(τπ)=ρ(s0)t=0T1P(st+1st,at)π(atst),则 π \pi π 对应的 expected return 为 J ( π ) = ∫ τ P ( τ ∣ π )   R ( τ ) J(\pi) = \int_\tau P(\tau|\pi)\,R(\tau) J(π)=τP(τπ)R(τ),因此 RL 的优化目标为 π ∗ = arg ⁡ max ⁡ π J ( π ) \pi^* = \displaystyle\arg\max_\pi J(\pi) π=argπmaxJ(π)

  8. value functions

    • on-policy value function: V π ( s ) = E τ ∼ π [ R ( τ ) ∣ s 0 = s ] V^{\pi}(s) = E_{\tau \sim \pi}[R(\tau)|s_0=s] Vπ(s)=Eτπ[R(τ)s0=s]

    • on-policy action-value function: Q π ( s , a ) = E τ ∼ π [ R ( τ ) ∣ s 0 = s , a 0 = a ] Q^{\pi}(s,a) = E_{\tau\sim\pi}[R(\tau)|s_0=s,a_0=a] Qπ(s,a)=Eτπ[R(τ)s0=s,a0=a]

    • optimal value function: V ∗ ( s ) = max ⁡ π E τ ∼ π [ R ( τ ) ∣ s 0 = s ] V^*(s) = \max_\pi E_{\tau \sim \pi}[R(\tau)|s_0=s] V(s)=maxπEτπ[R(τ)s0=s]

    • optimal action-value function: Q ∗ ( s , a ) = max ⁡ π E τ ∼ π [ R ( τ ) ∣ s 0 = s , a 0 = a ] Q^*(s,a) = \max_\pi E_{\tau\sim\pi}[R(\tau)|s_0=s,a_0=a] Q(s,a)=maxπEτπ[R(τ)s0=s,a0<

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值