《DRL》P0-数学符号（基础符号、强化学习符号、强化学习术语）

yuri_yagn

已于 2024-07-16 16:39:39 修改

阅读量238

点赞数 5

文章标签：人工智能

于 2024-07-16 16:38:44 首次发布

本文链接：https://blog.csdn.net/yuri_sospiro/article/details/140470430

版权

文章目录

数学符号

数学符号

基础符号

$\text{x}$ scalar 标量

$\boldsymbol{x}$ vector 向量

$\mathbf{X}$ matrix 矩阵

$\mathbb{R}$ the set of real numbers 实数集

$\frac{\mathrm{d}y}{\mathrm{d}x}$ derivative of y with respect to x，标量的倒数

$\frac{\partial y}{\partial x}$ partial derivative of y with respect to x，标量的偏导数

$\nabla_{\boldsymbol{x}}y$ gradient of y with respect to x，向量的梯度

$\nabla_{\boldsymbol{X}}y$ matrix derivatives of y with respect to X，矩阵的导数

$P (X)$ a probability distribution over a discrete variable，离散变量的概率分布

$p (X)$ a probability distribution over a continuous variable, or over a variable whose type has not been specified，连续变量（或者未定义是连续或离散的变量）的概率分布

$X\sim p$ the random variable X has distribution，随机变量 $X$ 满足概率分布 $p$

$\mathbb{E}[X]$ expectation of a random variable，随机变量的期望

$\mathrm{Var}[X]$ variance of a random variable，随机变量的方差

$\mathrm{Cor}(X,Y)$ covariance of two random variables，两个随机变量的协方差

$D_{\mathrm{KL}}(P\|Q)$ Kullback-Leibler divergence of P and Q，两个概率分布的 KL散度

$\mathcal{N}(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma})$ Gaussian distribution over x with mean µ and covariance Σ，平均值为 µ 且协方差为Σ的多元高斯分布

强化学习符号

$s,s^{\prime}$ state 状态

$a$ action 动作

$r$ reward 奖励

$R$ reward function 奖励函数

$\mathcal{S}$ set of all non-terminal states 非终结状态

$\mathcal{S}^{+}$ set of all states, including the terminal state，全部状态，包括终结状态

$\mathcal{A}$ set of actions，动作集合

$\mathcal{R}$ set of all possible rewards，奖励集合

$\boldsymbol{P}$ transition matrix，转移矩阵

$t$ discrete time step，离散时间步

$T$ final time step of an episode，回合内最终时间步

$S_{t}$ state at time t，时间 t 的状态

$A_t$ action at time t，时间 t 的动作

$R_t$ reward at time t, typically due, stochastically, to $A_t$ and $S_{t}$ ，时间 $t$ 的奖励，通常为随机量，且由 $A_t$ 和 $S_{t}$ 决定

$G_t$ return following time t，回报

$G_t^{(n)}$ n-step return following time t，n 步回报

$\pi$ policy, decision-making rule，策略

$\pi(s)$ action taken in state s under deterministic policy π，根据确定性策略 π，状态 s 时的动作

$\pi(a|s)$ probability of taking action a in state s under stochastic policy π，根据随机性策略 π，状态s时执行动作a的概率

$p (s^{'}, r ∣ s, a)$ probability of transitioning to state s′, with reward r, from state s and action a，根据状态s和动作a，使得状态转移成s′且获得奖励r的概率

$p (s^{'} ∣ s, a)$ probability of transitioning to state s′, from state s taking action a，根据状态 s 和动作a，使得状态转移成s′的概率

$v_{\pi}(s)$ value of state s under policy π (expected return)，根据策略 π，状态 s 的价值（回报期望）

$v_{*}(s)$ value of state s under the optimal policy，根据最优策略，状态 s 的价值

$q_{\pi}(s,a)$ value of taking action a in state s under policy π，根据策略 π，在状态 s 时执行动作a的价值

$q_{*}(s,a)$ value of taking action a in state s under the optimal policy，根据最优策略，在状态 s 时执行动作a的价值

$V,V_{t}$ estimates of state-value function $v_{\pi}(s)$ or $v_{*}(s)$ ，状态价值函数的估计

$Q,Q_{t}$ estimates of action-value function $q_{\pi}(s,a)$ or $q_{*}(s,a)$ ，动作价值函数的估计

$τ$ trajectory, which is a sequence of states, actions and rewards , $τ$ =(S0,A0,R0,S1,A1,R1,···)，状态、动作、奖励的轨迹

$\gamma$ reward discount factor, $\gamma$ ∈ [0,1]，奖励折扣因子

$\epsilon$ probability of taking a random action in $\epsilon$ -greedy policy，根据 $\epsilon$ -贪婪策略，执行随机动作的概率

$\alpha,\beta$ step-size parameters，步长

$\text{λ}$ decay-rate parameter for eligibility traces，资格迹的衰减速率

强化学习中术语总结

$R$ 是奖励函数， $R_t =R(S_t)$ 是MRP中状态 $S_t$ 的奖励， $R_t =R(S_t,A_t)$ 是MDP中的奖励， $S_t ∈ S$ 。

$R(\tau)\text{ 是轨迹 }\tau\text{ 的 }\gamma\text{-折扣化回报,}R(\tau)=\sum_{t=0}^\infty\gamma^tR_t$

$p(\tau)$ 是轨迹的概率：

$\begin{aligned}&p(\tau)=\rho_0(S_0)\prod_{t=0}^{T-1}p(S_{t+1}|S_t)\text{ 对于 MP 和 MRP,}\rho_0(S_0)\text{ 是起始状态分布(Start-State}\\&\text{Distribution)。}\end{aligned}$
$p(\tau|\pi)=\rho_0(S_0)\prod_{t=0}^{T-1}p(S_{t+1}|S_t,A_t)\pi(A_t|S_t)\text{ 对于 MDP,}\rho_0(S_0)\text{ 是起始状态分布}$