基本概念
-
trial and error
-
DRL=RL+deep_learning
-
on-policy:所有数据都是当前agent与env交互后产生的,训练时不使用old data,即不使用以前agent产生的数据
- 缺点:these algorithms works weaker on sample efficiency
- 优点:these algorithms directly optimize the objective you care about—policy performance—and it works out mathematically that you need on-policy data to calculate the updates
- 总结:牺牲sample efficiency以获取训练时的stability和reliable
-
off-policy:old data可reused。通过 Bellman’s equations 来训练一个 Q-function
- 缺点:
- there are no guarantees that doing a good job of satisfying Bellman’s equations leads to having great policy performance.
- the absence of guarantees makes algorithms in this class potentially brittle and unstable.
- 优点:high sample efficiency
- 缺点:
基础数学公式
-
log-derivate 技巧
▽ P θ ( τ ∣ θ ) = P ( τ ∣ θ ) ▽ log P ( τ ∣ θ ) \bigtriangledown P_\theta(\tau|\theta) = P(\tau|\theta)\,\bigtriangledown\log P(\tau|\theta) ▽Pθ(τ∣θ)=P(τ∣θ)▽logP(τ∣θ) -
Expect Grad-Log-Prob (EGLP) lemma:假定 P θ P_\theta Pθ 是关于随机变量 X X X 的一个参数化概率分布,则关于 log P θ ( x ) \log P_\theta(x) logPθ(x) 的导数的期望为 0,即
证明
E x ∼ P θ [ ▽ log P θ ( x ) ] = 0 E_{x\sim P_\theta}[\bigtriangledown\log P_\theta(x)]=0 Ex∼Pθ[▽logPθ(x)]=0 -
全期望公式(law of iterated expectations):设 X , Y X,Y X,Y 为具有分布 P X , P Y P_X,P_Y PX,PY 的随机变量,则有
证明
E x ∼ P X ( x ) = E y ∼ P Y ( E x ∼ P X ( x ∣ y ) ) E_{x\sim P_X}(x) = E_{y \sim P_Y}(E_{x \sim P_X}(x|y)) Ex∼PX(x)=Ey∼PY(Ex∼PX(x∣y))
核心概念
In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.
-
states: a state is a complete description of the state of the world. There is no information about the world which is hidden from the state.
observations: an observation is a partial description of a state, which may omit information.
-
environment: fully observed, partially observed
-
action spaces: the set of all valid actions in a given environment. discrete action spaces, continuous action space
-
policy: a policy is a rule used by an agent to decide what actions to take
-
deterministic: denoted by μ \mu μ, a t = μ ( s t ) a_t = \mu(s_t) at=μ(st)
-
stochasitc: denoted by π \pi π, a t ∼ π ( ⋅ ∣ s t ) a_t \sim \pi(\cdot|s_t) at∼π(⋅∣st)
-
categorical policies: used in discrete action spaces. 输入是state,输出是动作的概率分布
-
diagonal Gaussian policies: used in continous action spaces
采样方式为 a = μ θ ( s ) + σ θ ( s ) ∘ z a = \mu_\theta(s) + \sigma_\theta(s) \circ z a=μθ(s)+σθ(s)∘z,其中 μ θ ( ⋅ ) , σ θ ( ⋅ ) \mu_\theta(\cdot),\sigma_\theta(\cdot) μθ(⋅),σθ(⋅) 为 mean action(向量) 和 covariance matrix(仅对角线有值,其余为0), z ∼ N ( 0 , 1 ) z \sim \mathcal N(0,1) z∼N(0,1) 为噪声向量
-
通常可将 policy 替换 agent
parameterized policies: 使用参数化模型对policy进行建模,通过优化算法来调整参数从而实现policy 的 behavior 的改变。通常使用 θ \theta θ 表示参数:
a t = μ θ ( s t ) a t ∼ π θ ( ⋅ ∣ s t ) \begin{aligned} a_t &= \mu_\theta(s_t) \\ a_t &\sim \pi_\theta(\cdot|s_t) \end{aligned} atat=μθ(st)∼πθ(⋅∣st) -
-
trajectories: 动作-状态序列 τ ∼ ( s 0 , a 0 , s 1 , a 1 , … ) \tau \sim (s_0,a_0,s_1,a_1,\dots) τ∼(s0,a0,s1,a1,…),其中 s t + 1 ∼ P ( ⋅ ∣ s t , a t ) s_{t+1} \sim P(\cdot|s_t,a_t) st+1∼P(⋅∣st,at)
-
reward: 每步决策时获取的反馈, r t = R ( s t , a t , s t + 1 ) r_t = R(s_t, a_t, s_{t+1}) rt=R(st,at,st+1),常简化为 r t = R ( s t ) / r t = R ( s t , a t ) r_t = R(s_t) / r_t = R(s_t,a_t) rt=R(st)/rt=R(st,at)
return: 即累计长期回报
-
finite-horizon undiscounted return: R ( τ ) = ∑ t = 0 T r t R(\tau) = \sum_{t=0}^T r_t R(τ)=∑t=0Trt
-
infinite-horizon discounted return: R ( τ ) = ∑ t = 0 ∞ γ t r t R(\tau) = \sum_{t=0}^\infty \gamma^t r_t R(τ)=∑t=0∞γtrt。使用 discount factor γ \gamma γ 的原因主要有两个:
- 保证 return 收敛到一个有限的值
- 越近时刻的 reward 影响越大
注意:尽管这两种收益表述之间的界限在RL形式主义中是很明显的,但深入的RL做法往往会使这条线模糊不清-例如,我们经常设置算法来优化未衰减收益,但在评估价值函数时使用衰减因子
-
-
RL problem: RL的目标是调整policy以实现最大化累计长期回报的目的。给定当前策略 π \pi π,一条长度为 T T T 的 trajectory τ \tau τ 出现的概率为 P ( τ ∣ π ) = ρ ( s 0 ) ∏ t = 0 T − 1 P ( s t + 1 ∣ s t , a t ) π ( a t ∣ s t ) P(\tau|\pi) = \rho(s_0) \displaystyle\prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\,\pi(a_t|s_t) P(τ∣π)=ρ(s0)t=0∏T−1P(st+1∣st,at)π(at∣st),则 π \pi π 对应的 expected return 为 J ( π ) = ∫ τ P ( τ ∣ π ) R ( τ ) J(\pi) = \int_\tau P(\tau|\pi)\,R(\tau) J(π)=∫τP(τ∣π)R(τ),因此 RL 的优化目标为 π ∗ = arg max π J ( π ) \pi^* = \displaystyle\arg\max_\pi J(\pi) π∗=argπmaxJ(π)
-
value functions
-
on-policy value function: V π ( s ) = E τ ∼ π [ R ( τ ) ∣ s 0 = s ] V^{\pi}(s) = E_{\tau \sim \pi}[R(\tau)|s_0=s] Vπ(s)=Eτ∼π[R(τ)∣s0=s]
-
on-policy action-value function: Q π ( s , a ) = E τ ∼ π [ R ( τ ) ∣ s 0 = s , a 0 = a ] Q^{\pi}(s,a) = E_{\tau\sim\pi}[R(\tau)|s_0=s,a_0=a] Qπ(s,a)=Eτ∼π[R(τ)∣s0=s,a0=a]
-
optimal value function: V ∗ ( s ) = max π E τ ∼ π [ R ( τ ) ∣ s 0 = s ] V^*(s) = \max_\pi E_{\tau \sim \pi}[R(\tau)|s_0=s] V∗(s)=maxπEτ∼π[R(τ)∣s0=s]
-
optimal action-value function: Q ∗ ( s , a ) = max π E τ ∼ π [ R ( τ ) ∣ s 0 = s , a 0 = a ] Q^*(s,a) = \max_\pi E_{\tau\sim\pi}[R(\tau)|s_0=s,a_0=a] Q∗(s,a)=maxπEτ∼π[R(τ)∣s0=s,a0<
-