Chapter 1 - 5 The RL Framework: The Problem

Chapter 1 – 5 The RL Framework: The Problem

1.5.2 The setting, Revisited

 Specifically, you’ll learn how to take a real world problem and formulate it so it can be solved through reinforcement learning.

  A puppy(reinforcement learning agent) set the state as agent who learns from trial and error how to behave in an environment to maximize world. In particular, the RL framework is characterized by an agent learned to interact with its environment. We assume that time evolves and discrete timesteps. At the initial timestep, the agent observes the environment. You can think of this observation as a situation that the environment presents to the agent. Then, it must select an appropriate action in response. Then at the next timestep in response to the agent action, the environment presents a new situation to the agent. At the same time the environment gives the agent a reward which provides some indication of whether the agent has responded appropriately to the environment. Then the process continues where at each timestep, the environment sends the agent an observation and reward. And in response, the agent must choose an action. In general, we don’t need to assume that the environment show the agent everything he needs to make well-informed decisions.
在这里插入图片描述
 We’ll make the assumption that the agent is able to fully observe what ever state the environment is in. And instead of referring to the agent as receiving an observation, it receives the environment state.

 The agent first receives the environment state which we denote by S0. Then, based on that observation the agent chooses an action A0, at the next timestep, a direct consequence of the agent’s choice of action A0 and the environments previous state S0, the environment transitions to a new state S1, and gives some reward R1 to the agent.

 Whereas the agent interacts with the environment, this interaction is manifest as a sequence of states, action, and rewards. The reward will always be the most relevant quantity to the agent.
在这里插入图片描述
 To be specific, any agent has the goal to maximize expected cumulative reward or the some of the rewards attained over all timesteps. In other words, it seeks to find the strategy for choosing actions with the cumulative reward is likely to be quite high. And the agent can only accomplish this by interacting with the environment. This is because at every timestep, the environment decides how much reward the agent receives. In others words, the agent must play by the rules of the environment. But through interaction, the agent can learn those rules and choose appropriate actions to accomplish its goal.

1.5.3 Episodic vs. Continuing Tasks

 Many of the real world situations we’ll consider will have a well-defined ending point. For instance, we might be running a simulation to teach a car to drive. Then, the interaction ends if the car crashes.
在这里插入图片描述
 Of course, not all reinforcement learning tasks have a well-defined ending point but those that do are called episodic tasks. We’ll refer to a complete sequence of interaction from start to finish as an episode.
在这里插入图片描述
 When the episode ends, the agents looks at the total amount of reward it received to figure out how well if did. It’s then able to start from scratch as if it has been completely reborn into the same environment but now with the added knowledge of what happened in its past life. In this way, as time passes over its many lives, the agent makes better and better decisions.
 Once your agents have spent enough time getting to know the environment, they should be able to pick a strategy where the cumulative reward is quite high. So episodic tasks are tasks with a well-defined ending point.
 We’ll also look at tasks that go on forever, without end. And those are called continuing task. In this case, the agent lives forever. So it has to learn the best way to choose actions while simultaneously interacting with the environment.
 When the reward signal is largely uninformative in this way, we say that the task suffers the problem of sparse rewards.

1.5.4 Quiz: Test Your Intuition

 下象棋的过程是一个episodic task,

 When the reward signal is largely uninformative in this way, we say that the task suffers the problem of sparse rewards.

1.5.5 Quiz: Episodic or continuing?

 Remember:

  1. A task is an instance of the reinforcement learning (RL) problem.
  2. Continuing tasks are tasks that continue forever, without end.
  3. Episodic tasks are tasks with a well-defined starting and ending point.
    1. In this case, we refer to a complete sequence of interaction, from start to finish, as an episode.
    2. Episodic tasks come to an end whenever the agent reaches a terminal state.

1.5.6 The Reward Hypothesis

 It’s important to note that the word “reinforcement” and “reinforcement learning” is a term originally from behavioral science. It refers to a stimulus that’s delivered immediately after behavior to make the behavior more likely to occur in the future. The fact that this name is borrowed is no coincidence.

Reward Hypothesis: In fact, it’s an important to defining hypothesis and reinforcement learning that we can always formulate an agent goal along the lines of maximizing expected cumulative reward. (All gorals can be framed as the maximization of expected cumulative reward)

1.5.8 Goals and Rewards, Part 2

 So far, we’ve been trying to frame the idea of a humanoid learning to walk in the context of reinforcement learning. We’ve detailed the states in actions, and we still need to specify the rewards.
在这里插入图片描述
 And the reward structure from the DeepMind paper is surprisingly intuitive. Each term communicates to the agent some part of what we’d like it to accomplish.

 Deepmind的人形机器人训练模型,奖励函数为:
在这里插入图片描述
 They frame the problem as an episodic task where if the human falls, then the episodic is terminated.
 These are four somewhat competing requirements that the agent has to balance for all timesteps towards its goal of maximizing expected cumulative reward.
在这里插入图片描述

1.5.10 Cumulative Reward

 Reinforcement learning framework gives us a way to study how an agent can learn to accomplish a goal from interacting with its environment. This framework works for many real world applications and simplifies the interaction into three signals that are passed between agent and environment. The sate signal is the environment’s way of presenting a situation to the agent. The agent then responds with an action which influences the environment. And the environment responds with the reward which gives some indications of whether the agent has responded appropriately to the environment. Also, built into the framework is the agent’s goal which is to maximize cumulative reward.

 Actions have short and long term consequences and the agent needs to gain some understanding of the complex effects its actions have on the environment.

 How exactly does it keep all time steps in mind?

 It’s important to note that the rewards for all previous time steps have already been decided as they’re in the past. Only future rewards are inside the agent’s control.
在这里插入图片描述
 We refer to the sum of rewards from the next time step onward as the return and denote it with a capital G, and at an arbitrary time step, the agent will always choose an action towards the goal of maximizing the return. But it’s actually more accurate to say that the agent seeks to maximize expected return.
在这里插入图片描述

 This is because it’s generally the case that the agent can’t predict with complete certainty what the future reward is likely to be. So it has to rely on a prediction or an estimate.

1.5.11 Discounted Return

 Should present reward carry the same weight as future reward?

 Maybe it makes more sense to value rewards that come sooner more highly, since those rewards are more predictable.

 We’ll maximize a different sum with rewards that are farther along in time are multiplied by smaller values. We refer to this sum as discounted return.
在这里插入图片描述
 By discounted, we mean that we’ll change the goal to care more about immediate rewards rather than rewards that are received further in the future.
We define a discount rate,
在这里插入图片描述
在这里插入图片描述
 It’s important to note that this gamma is not something that’s learned by the agent. It’s something that you set to refine the goal that you have for the agent.

 The larger value you make gamma, the more the agent cares about the distant future. And as gamma gets smaller and smaller, we get increasingly extreme discounting.

 So we use discounting to avoid having to look too far into the limitless future.

1.5.13 MDPs, Part 1

 For context, we’ll work with the example of a recycling robot from the Sutton textbook.
在这里插入图片描述
在这里插入图片描述

1.5.14 MDPs, Part 2

在这里插入图片描述

1.5.15 Quiz: one-step Dynamics, Part 1

 Say at an arbitrary time step t, the state of the robot’s battery is high St =high). Then, in response, the agent decides to search At =search). You learned in the previous concept that in this case, the environment responds to the agent by flipping a theoretical coin with 70% probability of landing heads.

 If the coin lands heads, the environment decides that the next state is high St+1 =high), and the reward is 4 (Rt+1=4).

 If the coin lands tails, the environment decides that the next state is low St+1 =low), and the reward is 4 (R t+1 =4).
 This is depicted in the figure below.
在这里插入图片描述

1.5.16 Quiz: One-Step Dynamic, Part 2

 At an arbitrary time step tt, the agent-environment interaction has evolved as a sequence of states, actions, and rewards(S0,A0,R1,S1,A1,…,Rt−1,St−1,At−1,Rt,St,At).

 When the environment responds to the agent at time step t+1t+1, it considers only the state and action at the previous time step (St,At).
 In particular, it does not care what state was presented to the agent more than one step prior. (In other words, the environment does not consider any of {S_0, \ldots, S_{t-1} }

 And, it does not look at the actions that the agent took prior to the last one. (In other words, the environment does not consider any of { A_0, \ldots, A_{t-1} }

 Furthermore, how well the agent is doing, or how much reward it is collecting, has no effect on how the environment chooses to respond to the agent. (In other words, the environment does not consider any of { R_0, \ldots, R_t }

 Because of this, we can completely define how the environment decides the state and reward by specifying
p ( s ′ , r ∣ s , a ) ≐ P ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) p(s′,r∣s,a)≐P(St+1=s′,Rt+1=r∣St=s,At=a) p(s,rs,a)P(St+1=s,Rt+1=rSt=s,At=a)
 for each possible s’, r, s, \text{and } a. These conditional probabilities are said to specify the one-step dynamics of the environment.

1.5.17 MDPs, Part 3

 So, formally, a Markov decision process or MDP is defined by the set of states, the set of actions, and the set of rewards along with the one-step dynamics of the environment and the discount rate.

1.5.19 Summary

在这里插入图片描述

The Setting, Revisited
 The reinforcement learning (RL) framework is characterized by an agent learning to interact with its environment.

 At each time step, the agent receives the environment’s state (the environment presents a situation to the agent), and the agent must choose an appropriate action in response. One time step later, the agent receives a reward (the environment indicates whether the agent has responded appropriately to the state) and a new state.

 All agents have the goal to maximize expected cumulative reward, or the expected sum of rewards attained over all time steps.
Episodic vs. Continuing Tasks

  • A task is an instance of the reinforcement learning (RL) problem.
  • Continuing tasks are tasks that continue forever, without end.
    • Episodic tasks are tasks with a well-defined starting and ending point.
    • In this case, we refer to a complete sequence of interaction, from start to finish, as an episode.
    • Episodic tasks come to an end whenever the agent reaches a terminal state.
      The Reward Hypothesis
      Reward Hypothesis: All goals can be framed as the maximization of (expected) cumulative reward.
      Goals and Rewards
      Cumulative Reward
  • The return at time step tt is G_t := R_{t+1} + R_{t+2} + R_{t+3} + \ldots
  • The agent selects actions with the goal of maximizing expected (discounted) return.
    Discounted Return
  • The discounted return at time step t is G_t := R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots
  • The discount rate \gamma is something that you set, to refine the goal that you have the agent.
    • It must satisfy 0 \leq \gamma \leq 1.
    • If \gamma=0, the agent only cares about the most immediate reward.
    • If \gamma=1, the return is not discounted.
    • For larger values of \gamma, the agent cares more about the distant future. Smaller values of \gamma result in more extreme discounting, where - in the most extreme case - agent only cares about the most immediate reward.
      MDPs and One-Step Dynamics
  • The state space \mathcal{S}S is the set of all (nonterminal) states.
  • In episodic tasks, we use \mathcal{S}^+ to refer to the set of all states, including terminal states.
  • The action space \mathcal{A}is the set of possible actions. (Alternatively, \mathcal{A}(s)refers to the set of possible actions available in state s \in \mathcal{S}.)
  • The one-step dynamics of the environment determine how the environment decides the state and reward at every time step. The dynamics can be defined by specifying p(s’,r|s,a) \doteq \mathbb{P}(S_{t+1}=s’, R_{t+1}=r|S_{t} = s, A_{t}=a) for each possible s’, r, s, \text{and } a.
    A (finite) Markov Decision Process (MDP) is defined by:
  • a (finite) set of states \mathcal{S}S (or \mathcal{S}^+, in the case of an episodic task)
  • a (finite) set of actions \mathcal{A}A
  • a set of rewards \mathcal{R}R
  • the one-step dynamics of the environment
  • the discount rate \gamma \in [0,1]γ∈[0,1]
     
     
     
    r = min ⁡ ( v x , v m a x ) − 0.005 ( v y 2 + v z 2 ) − 0.05 y 2 − 0.02 r=\min(v_{x},v_{max})-0.005(v_{y}^{2}+v^{2}_{z})-0.05y^{2}-0.02 r=min(vx,vmax)0.005(vy2+vz2)0.05y20.02

p ( s ′ , r ∣ s , a ) ≐ P ( S t + 1 = s ′ , R t + 1 = r ∣ S t , A t = a ) p(s',r\vert s,a)\doteq \mathbb P(S_{t+1}=s',R_{t+1}=r\vert S_{t},A_{t}=a) p(s,rs,a)P(St+1=s,Rt+1=rSt,At=a)

{ S 0 , … , S t − 1 } \{S_0, \ldots, S_{t-1} \} {S0,,St1}
s ′ , r , s , and  a s', r, s, \text{and } a s,r,s,and a

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值