一些知识点
机器学习 = 监督学习 + 无监督学习 + 强化学习
RL的不同之处:
There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions aect the subsequent data it receives
RL的reward
A reward Rt is a scalar feedback signal
Indicates how well agent is doing at step t
The agent’s job is to maximise cumulative reward
All goals can be described by the maximisation of expected
cumulative rewardSequential decision making
Goal: select actions to maximise total future reward
Actions may have long term consequences
Reward may be delayed
It may be better to sacri ce immediate reward to gain more long-term reward
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning. The agent should discover a good policy from its experiences of the environment without losing too much reward along the way.
- Exploration nds more information about the environment
- Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit
RL的元素
参考 silver-slides与https://zhuanlan.zhihu.com/p/26608059
Agent
相当于主角,包括三个要素:
Policy, Value Function, Model.
Policy:
是Agent的行为指南,是一个从状态(s)到行动(a)的映射,可以分为
确定性策略(Deterministic policy)
和随机性策略(Stochastic policy)
,前者是指在某一特定状态确定对应着某一个行为 a=π(s) a = π ( s ) ,后者是指在某一状态下,对应不同行动有不同的概率,即 π(a|s)=p[At=a|St=s] π ( a | s ) = p [ A t = a | S t = s ] ,可以根据实际情况来决定具体采用哪种策略。Value Function:
价值函数是对未来总Reward的一个预测,Used to evaluate the goodness/badness of states, and therefore to select between actions.
Model:
模型是指Agent通过对环境状态的个人解读所构建出来的一个认知框架,它可以用来预测环境接下来会有什么表现,比如,如果我采取某个特定行动那么下一个状态是什么,亦或是如果这样做所获得的奖励是多少。不过模型这个东西有些情况下是没有的。
所以这就可以将Agent在连续决策
(sequential decision making )
行动中所遇到的问题划分为两种,即Reinforcement Learning problem
和Planning problem
。对于前者,没有环境的模型,Agent只能通过和环境来互动来逐步提升它的策略。The environment is initially unknown. The agent interacts with the environment. The agent improves its policy
对于后者,环境模型已经有了,所以你怎么走会产生什么样的结果都是确定的了,这时候只要通过模型来计算那种行动最好从而提升自己策略就好。A model of the environment is known. The agent performs computations with its model (without any external interaction). The agent improves its policy, a.k.a. deliberation, reasoning, introspection, pondering, thought, search。
举个例子就是,Reinforcement Learning problem是不知道游戏规则,通过游戏操纵杆采取行动,看分数来获得reward;Planning problem是知道游戏规则,可以查询模拟器,通过提前计划来找到最优策略,比如tree search
有关Agent的分类:
从采取的方法上可以分为
Value Based
,Policy Based
和Actor Critic
。第一种是基于价值函数的探索方式,第二种就是基于策略的探索方式,第三种就是前两者结合。从是否含有模型上Agent又可分为
Model Free
和Model Based
。Environment
故事发生的场景,可分为两种:
Fully Observable Environment
:environment的所有信息agent都能观测到,Agent state = environment state = information state。
Formally, this is aMarkov decision process (MDP)
。Partially Observable Environment
:environment的部分信息agent能观测到,此时的环境状态称为部分可观测MDP
。agent state ≠ ≠ environment state。Formally this is a
partially observable Markov decision process(POMDP)
。Agent must construct its own state representation:
Complete history: Sat=Ht S t a = H t
Beliefs of environment state: Sat=(P[Set=s1],...,P[Set=sn]) S t a = ( P [ S t e = s 1 ] , . . . , P [ S t e = s n ] )
Recurrent neural network: Sat=σ(Sat−1Ws+OtWo) S t a = σ ( S t − 1 a W s + O t W o )
State
State可分为三种,Environment State、Agent State、Information State,又称为Markov state
Environment State:
指环境用来选择下一步observation/reward的所有信息,是真正的环境所包含的信息,Agent一般情况下是看不到或凭agent自身能力不能完全地获取其信息的。即便环境信息整个是可见的,也许还会包含很多无关信息。
Agent State:
指Agent用来选择下一个行动的所有信息,也是我们算法进行所需要的那些信息,我个人理解是Agent自己对Environment State的解读与翻译,它可能不完整,但我们的确是指望着这些信息来做决定的。
Information State/Markov state:
包含了History中所有的有用信息。感觉这只是个客观的概念,并没有和前两种State形成并列关系,只是一个性质。
它的核心思想是“在现在情况已知的情况下,过去的事件对于预测未来没有用”,也就相当于是现在的这个状态已经包含了预测未来所有的有用的信息,一旦你获取了现在的有用信息,那么之前的那些信息都可以扔掉了!
The environment state Set S t e is Markov,The history Ht H t is Markov.
与State相关的有一个History:
The history is the sequence of observations, actions, rewards: Ht=O1,R1,A1,...,At−1,Ot,Rt H t = O 1 , R 1 , A 1 , . . . , A t − 1 , O t , R t
它包含了到时间t为止所能观察到的变量信息,如observation,action和reward。所以可以说接下来所发生的事情是基于历史的,如agent的action或environment的observation/reward。
State就被定义为一个关于History的函数: St=f(Ht) S t = f ( H t ) ,他们中间有某种对应关系,因为State也是对环境里边相关信息的一个观察和集成,也正是这些信息决定了接下来所发生的一切。
What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
Observation
Action
Reward
它是一个标量,是一个好坏的度量指标,然后Agent的终极目标就是尽可能的最大化整个过程的累计奖励
(cumulative reward)
,所以很多时候要把目光放长远一点,不要捡个芝麻丢个西瓜。A reward Rt is a scalar feedback signal. Indicates how well agent is doing at step t. The agent’s job is to maximise cumulative reward