Reinforcement Learning——Chapter 1 Introduction

Introduction

强化学习是一种从与环境交互中直接获得goal-direxted learning的方法。

1.1 Reinforcement Learning

强化学习有两个主要的特征:trial and error(不停的试错)和delay reward(延迟反馈)。
强化学习要解决的三个内容:work well on the problem, the field that studies this problem, its solution methods.

使用强化学习的几个理由:

  1. All reinforcement learning agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments
  2. Of all the forms of machine learning, reinforcement learning is the closest to the kind of learning that humans and other animals do, and many of the core algorithms of reinforcement learning were originally inspired by biological learning systems.
  3. reinforcement learning research is certainly part of the swing back toward simpler and fewer general principles of artificial intelligence.

1.3 Elements of Reinforcement Learning

在强化学习系统中,通常有四个元素:policy、reward signal、value function、环境的model。

  1. policy: defines the learning agent’s way of behaving at a given time,简单的说,policy就是决定采取哪个action来对当前环境做出回应的策略。(相当于一个行为准则)。可以是随机的为每个可以采用的action都赋予一个随机概率。
  2. reward signal: defines the goal of a reinforcement learning problem。在与环境互动的每一个步骤中,环境都会针对agent采取的action给一个分数。而agent唯一的目标就是让最后总的reward最大,reward决定了policy选择哪一个action。 The reward signal is the primary basis for altering the policy; if an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future.
  3. value function: is the total amount of reward an agent can expect to accumulate over the future。从现在这个状态开始,到最后结束的累积得分。
    就重要性来说,reward要比value重要,reward是当前真实的,而value则是一个预测值。但是reward要比value容易得到,reward可以直接从环境中获得,但是value必须进行估计,本质上来说,强化学习最重要的部分,就是对value的估计。
  4. model of the environment:somthing mimics the behavior of the environment

1.4 Limitations and Scope

state:输入到 policy and value function的东西,说白了就是代表当前环境是啥的东西。换句话说,think of the state as whatever information is available to the agent about its environment。(state is produced by some preprocessing system )。

我们的目的是如何估计value function,而不是解决problem
要解决problem,可以选择遗传算法等不用估计value function的进化算法

强化学习在获得最佳reward(具有随机性)之后,转向下一个generation,重复算法,但是进化算法 do not learn during their individual lifetimes。如果policies的参数空间足够小,或者可以轻松的找到,那么可以使用进化算法。 其优点是,evolutionary methods have advantages on problems in which the learning agent cannot sense the complete state of its environment。

进化算法的缺点:

  1. they do not use the fact that the policy they are searching for is a function from states to actions(EA 不是 关于state和action的函数)
  2. they do not notice which states an individual passes through during its lifetime, or which actions it selects(对于不同的states和action,没有明显的区别)

1.5 An Extended Example

t t t时刻,假设agent通过value function估计的值为 V ( S t ) V(S_t) V(St),在 t + 1 t+1 t+1时刻通过value function估计的value为 V ( S t + 1 ) V(S_{t+1}) V(St+1),那么,在获得 t + 1 t+1 t+1时刻的值之后,就有了如下优化关系式:
V ( S t ) ← V ( S t ) + α [ V ( S t + 1 ) − V ( S t ) ] V(S_t) \leftarrow V(S_t)+\alpha [V(S_{t+1})-V(S_t)] V(St)V(St)+α[V(St+1)V(St)]
在上式中, α \alpha α是一个正的分数,称之为step-size parameter,相当于学习率。这种更新方式也 称为差分学习法(temporal-difference learning method)

假设目前你要玩这么一个游戏:三连棋游戏(两人轮流在印有九格方盘上划“+”或“O”字, 谁先把三个同一记号排成横线、直线、斜线, 即是胜者)。
在这里插入图片描述
那么,我们如何构建一个agent,来使得获胜的几率最大?
基本思路:

  1. 首先我们建立一个表格,表格中的数字代表了从当前状态开始,对获胜的概率一个估计。这个估计值(也就是表中的数字)就是这个state对应的value,整个表格就是value function。
  2. State A has higher value than state B, or is considered “better” than state B, if the current estimate of the probability of our winning from A is higher than it is from B.
  3. 假设我们着手棋为"X",那么,满足“X”满足能赢条件的state对应的概率均为1, 那么能够满足对手"O"的状态 the correct probability is 0; We set the initial values of all the other states to 0.5, representing a guess that we have a 50% chance of winning
  4. 在学习过程中,我们考虑每一个可以下子的位置,并且去table中找这个地方对应的value。通常情况下,我们选择的是贪心算法,就是 selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from among the other moves instead. These are called exploratory moves because they cause us to experience states that we might otherwise never see.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值