【强化学习的数学理论：基础概念】

小翔很开心

已于 2023-04-25 12:57:49 修改

阅读量137

点赞数

分类专栏：强化学习的数学原理【西湖大学赵世钰】文章标签：机器学习人工智能

于 2023-04-25 12:54:58 首次发布

本文链接：https://blog.csdn.net/qq_26930625/article/details/130335016

版权

强化学习的数学原理【西湖大学赵世钰】专栏收录该内容

5 篇文章 0 订阅

订阅专栏

学习笔记：基础概念

B站听课笔记：
对强化学习原理的基本了解，后续将深入学习。

State

The status of the agent with respect to the environment.
即agent相对于environment的一个状态。

State space

the set of all states.一个状态集，即状态空间。

Action

For each state, there are five possible actions: a1,a2,a3,a4,a5.
对于每种状态，都有对应可能存在的动作。

Action space of a state

the set of all possible actions of a state.一个状态的动作集。

State transition

When taking an action, the agent may move from one state to another.
即状态转换。
State transition定义了状态和环境的一种交互interaction行为。

表达形式

Tabular representation 表格表示法
比较直观，但使用受限，只能表示确定性的情况 deterministic cases。
State transition probability 状态转移概率法
使用条件概率来数学描述：
$p(s_2|s_1,a_2)=1 \\ p(s_i|s_1,a_2)=0 \ {\forall} i≠2$
利用deterministic的条件概率，可以来描述随机性stochastic的例子。

Policy

tells the agent what actions to take at a state. 决策，策略。
当agent在某个state时，policy能说明下一步的action。

表达方式

Intuitive representation
可用箭头来表示一个决策。
基于policy，能得到一个整体的路径path
Mathematical representation
用条件概率conditional probability来表示。
For example, for state $s_1$ :
一种确定性deterministic的情况：
$\pi (a_1|s_1)=0 \\ \pi (a_2|s_1)=1\\ \pi (a_3|s_1)=0\\ \pi (a_4|s_1)=0\\ \pi (a_5|s_1)=0$
一种不确定性stochastic的情况：
$\pi (a_1|s_1)=0 \\ \pi (a_2|s_1)=0.5\\ \pi (a_3|s_1)=0.5\\ \pi (a_4|s_1)=0\\ \pi (a_5|s_1)=0$
【注意】对于编程中，如何实现不确定性的一种情况？
先在0-1中随机生成数x，当x位于[0，0.5]，则action为 $a_2$ ；当x位于[0.5，1]，则action为 $a_3$ 。

Reward

a real number we get after taking an action.
在action之后，得到的一个数。

An positive reward——encouragement奖励
An negative reward——punishment惩罚
A zero reward——no punishment
【注意】 positive can mean punishment.

Reward 能作为一种人机交互的手段human-machine interface
人通过reward，能让机器往着人所需求的方面前进。

Reward取决于当前的state和action，而不是下一个state。

对于一个deterministic事件，the reward transition是随机的stochastic。
当获得reward时，具体获得多少，是不确定的。

Trajectory and return

Trajectory

是state-action-reward的链式。

return

是针对于一个Trajectory而言，将其上面所有reward的总和。

return的作用：
用来评估一个policy的好坏。

Discounted return

一个trajectory的return处于发散的时候，可以引入折扣因子dicounted rate $\gamma \in [0,1)$ ，则：
$\ return= \sum _i ^n \gamma ^i *r_i$
引入discounted return的作用：

将发散的return收敛；
平衡远处和近处的reward：
通过控制 $\gamma$ ，能控制agent所学到的策略：
减小 $\gamma$ ，会使得其更加近视——注重最近的reward；
增大 $\gamma$ ，会使得其更加远视——注重长远的reward。

Episode

或者称为trial。
解释：一个trajectory，其中含有最终停止的state，即terminal state。

一个episode通常是有限步的，含有episode的task也被称为episodic task。

若一个任务没有terminal states，那么该任务也被称为continuing tasks。

一般不区分episodic task和continuing task。
两种将episodic task转换为continuing task的方法：

将target state视为一种特殊的absorbing state。即当agent达到target state的时候，不会再采取其他的action离开该state，也就是说，把action space设置为0。同时，将之后得到的所有reward都为0，即 $\gamma =0$ 。
将target state视为一种普通的state，若留在target state，则一直获得 $\gamma = +1$ 。这种方法对目标不区别对待。更加一般化。

Markov dicision process(MDP)

MDP的所有key element：

Sets：
State- $S$
Action- $,\ s \in S$
Reward- $R (s, a)$
Probability distribution
State transition probability-当前在 $s$ 处采取行动 $a$ 跳到 $s^{'}$ 的概率： $p (s^{'} ∣ s, a)$
Reward probability-当前在 $s$ 处采取行动 $a$ 得到 $re w a r d = r$ 的概率： $p (r ∣ s, a)$
Policy:
当前在 $s$ 处采取行动 $a$ 的概率： $\pi (a|s)$
$\ property$ ：memoryless property，即与历史无关的一种性质。
$p(s_{t+1}|a_{t+1},s_t,...,a_1,s_0) = p(s_{t+1}|a_{t+1},s_t), \\ p(r_{t+1}|a_{t+1},s_t,...,a_1,s_0) = p(r_{t+1}|a_{t+1},s_t).$

【助记：MDP】

M- $\ property$ ,
D-Policy,
P-Sets+ Probability distribution.

$\ process$

MDP包括了所有的process。
当MDP中的policy一旦确定了，则为 $\ process$