RL (Chapter 1): The Reinforcement Learning Problem

本文为强化学习笔记,主要参考以下内容:

还有两个应该比较好的公开课,我还没看过:

Gym 库

  • 目前用于强化学习编程实践的常用手段是使用 OpenAI 推出的 gym库
  • gym 库的一个很大的特点是可以可视化,把强化学习算法的人机交互用动画的形式呈现出来
    在这里插入图片描述

Features of RL

Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.

  • trial-and-error search (试错): The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.
    • One of the challenges that arise in reinforcement learning is the trade-off between exploration and exploitation (试探与开发). The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future.
  • delayed reward (延迟收益): Actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards.
    • 强化学习的难点还在于不是每个动作都能产生 reward

Markov decision processes

  • We formalize the problem of reinforcement learning using ideas from dynamical systems theory, specifically, as the optimal control of incompletely-known Markov decision processes (不完全可知的马尔可夫决策过程).
  • The basic idea of this formalization is simply to capture the most important aspects of the real problem facing a learning agent interacting over time with its environment to achieve a goal. (sensation, action, and goal)

Elements (要素) of RL

在这里插入图片描述

  • Agent (智能体) & Environment
    • The learner and decision maker is called the a g e n t agent agent.
    • The thing it interacts with, comprising everything outside the agent, is called the e n v i r o n m e n t environment environment.
  • Policy (策略)
    • Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.
    • In general, policies may be stochastic.
  • State
    • We can think of the state as whatever information is available to the agent about its environment
  • Reward signal (收益信号)
    • A reward signal defines the goal of a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number called the r e w a r d reward reward. The agent’s sole objective is to maximize the total reward it receives over the long run.
    • In general, reward signals may be stochastic functions of the state of the environment and the actions taken.
  • Value function (价值函数)
    • Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run.
    • Roughly speaking, the v a l u e value value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.
    • Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. Unfortunately, it is much harder to determine values than it is to determine rewards. In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.
  • Model of the environment (Optional)
    • This is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave.
    • Models are used for p l a n n i n g planning planning (规划), by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Category of RL Methods

在这里插入图片描述

Model-based methods

  • Methods that use models and planning

Model-free methods (trial-and-error learners)

  • Policy-based (Policy gradient…): 根据环境输出下一步采取的各种行动的概率,根据概率采取行动。因此每种行动都可能被选中,只是可能性不同
  • Value-based (DQN, Q learning, Sarsa…): 根据环境输出所有行动的价值,根据最高价值来选择动作,但如果动作是连续的,Value-based 方法就无能为力了,而 Policy-based 方法却依然能根据概率分布来选择最佳动作
  • Actor-Critic: Actor 根据概率做出动作,Critic 根据做出的动作给出动作的价值
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值