Silver-Slides Chapter 1 - 强化学习入门：基本概念介绍

最新推荐文章于 2020-09-04 07:57:27 发布

MrTriste

最新推荐文章于 2020-09-04 07:57:27 发布

阅读量291

点赞数

本文链接：https://blog.csdn.net/wjc1182511338/article/details/79879014

版权

强化学习同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

silver slides

4 篇文章 0 订阅

订阅专栏

一些知识点

机器学习 = 监督学习 + 无监督学习 + 强化学习

RL的不同之处：

There is no supervisor, only a reward signal

Feedback is delayed, not instantaneous

Time really matters (sequential, non i.i.d data)

Agent’s actions aect the subsequent data it receives
RL的reward

A reward Rt is a scalar feedback signal

Indicates how well agent is doing at step t

The agent’s job is to maximise cumulative reward
All goals can be described by the maximisation of expected
cumulative reward
Sequential decision making

Goal: select actions to maximise total future reward

Actions may have long term consequences

Reward may be delayed

It may be better to sacri ce immediate reward to gain more long-term reward
Exploration and Exploitation

Reinforcement learning is like trial-and-error learning. The agent should discover a good policy from its experiences of the environment without losing too much reward along the way.
1. Exploration nds more information about the environment
2. Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit

RL的元素

参考 silver-slides与https://zhuanlan.zhihu.com/p/26608059

Agent

相当于主角，包括三个要素：

Policy, Value Function, Model.
- Policy:
  
  是Agent的行为指南，是一个从状态(s)到行动(a)的映射，可以分为确定性策略(Deterministic policy)和随机性策略(Stochastic policy)，前者是指在某一特定状态确定对应着某一个行为 $a=\pi(s)$ ，后者是指在某一状态下，对应不同行动有不同的概率，即 $\pi(a|s)=p[A_t=a|S_t=s]$ ，可以根据实际情况来决定具体采用哪种策略。
- Value Function:
  
  价值函数是对未来总Reward的一个预测，Used to evaluate the goodness/badness of states, and therefore to select between actions.
- Model:
  
  模型是指Agent通过对环境状态的个人解读所构建出来的一个认知框架，它可以用来预测环境接下来会有什么表现，比如，如果我采取某个特定行动那么下一个状态是什么，亦或是如果这样做所获得的奖励是多少。不过模型这个东西有些情况下是没有的。
  
  所以这就可以将Agent在连续决策(sequential decision making )行动中所遇到的问题划分为两种，即Reinforcement Learning problem和Planning problem。
  
  对于前者，没有环境的模型，Agent只能通过和环境来互动来逐步提升它的策略。The environment is initially unknown. The agent interacts with the environment. The agent improves its policy
  
  对于后者，环境模型已经有了，所以你怎么走会产生什么样的结果都是确定的了，这时候只要通过模型来计算那种行动最好从而提升自己策略就好。A model of the environment is known. The agent performs computations with its model (without any external interaction). The agent improves its policy, a.k.a. deliberation, reasoning, introspection, pondering, thought, search。
  
  举个例子就是，Reinforcement Learning problem是不知道游戏规则，通过游戏操纵杆采取行动，看分数来获得reward；Planning problem是知道游戏规则，可以查询模拟器，通过提前计划来找到最优策略，比如tree search
有关Agent的分类：

从采取的方法上可以分为Value Based，Policy Based 和Actor Critic。第一种是基于价值函数的探索方式，第二种就是基于策略的探索方式，第三种就是前两者结合。

从是否含有模型上Agent又可分为Model Free 和Model Based。
Environment

故事发生的场景，可分为两种：
- Fully Observable Environment：environment的所有信息agent都能观测到，Agent state = environment state = information state。
  Formally, this is a Markov decision process (MDP) 。
- Partially Observable Environment：environment的部分信息agent能观测到，此时的环境状态称为部分可观测MDP 。agent state $\neq$ environment state。
  
  Formally this is a partially observable Markov decision process(POMDP) 。
  
  Agent must construct its own state representation：
  
  Complete history: $S_t^a = H_t$
  Beliefs of environment state: $S_t^a=(P[S_t^e=s_1],...,P[S_t^e=s_n])$
  Recurrent neural network: $S_t^a=\sigma(S_{t-1}^aW_s+O_tW_o)$
State

State可分为三种，Environment State、Agent State、Information State，又称为Markov state
- Environment State：
  
  指环境用来选择下一步observation/reward的所有信息，是真正的环境所包含的信息，Agent一般情况下是看不到或凭agent自身能力不能完全地获取其信息的。即便环境信息整个是可见的，也许还会包含很多无关信息。
- Agent State：
  
  指Agent用来选择下一个行动的所有信息，也是我们算法进行所需要的那些信息，我个人理解是Agent自己对Environment State的解读与翻译，它可能不完整，但我们的确是指望着这些信息来做决定的。
- Information State/Markov state：
  
  包含了History中所有的有用信息。感觉这只是个客观的概念，并没有和前两种State形成并列关系，只是一个性质。
  
  它的核心思想是“在现在情况已知的情况下，过去的事件对于预测未来没有用”，也就相当于是现在的这个状态已经包含了预测未来所有的有用的信息，一旦你获取了现在的有用信息，那么之前的那些信息都可以扔掉了！
  
  The environment state $S^e_t$ is Markov，The history $H_t$ is Markov.
与State相关的有一个History：

The history is the sequence of observations, actions, rewards: $H_t = O_1,R_1,A_1, ...,A_{t-1},O_t,R_t$

它包含了到时间t为止所能观察到的变量信息，如observation,action和reward。所以可以说接下来所发生的事情是基于历史的，如agent的action或environment的observation/reward。

State就被定义为一个关于History的函数： $S_t = f(H_t)$ ，他们中间有某种对应关系，因为State也是对环境里边相关信息的一个观察和集成，也正是这些信息决定了接下来所发生的一切。

What happens next depends on the history:

The agent selects actions

The environment selects observations/rewards
Observation
Action
Reward

它是一个标量，是一个好坏的度量指标，然后Agent的终极目标就是尽可能的最大化整个过程的累计奖励(cumulative reward)，所以很多时候要把目光放长远一点，不要捡个芝麻丢个西瓜。

A reward Rt is a scalar feedback signal. Indicates how well agent is doing at step t. The agent’s job is to maximise cumulative reward

MrTriste

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Silver-Slides Chapter 1 - 强化学习入门：基本概念介绍

一些知识点机器学习 = 监督学习 + 无监督学习 + 强化学习RL的不同之处：There is no supervisor, only a reward signalFeedback is delayed, not instantaneousTime really matters (sequential, non i.i.d data)Agent’s actions a...
复制链接

扫一扫

专栏目录