Chapter 1 - 6: The RL Framework: The solution

Chapter 1 - 6: The RL Framework: The solution

1.6.2 Polices

  We’ve seen that we use a Markov decision process or MDP as a formal definition of the problem that we’d like to solve with reinforcement learning.
在这里插入图片描述
 The solution is a series of actions that need to be learned by the agent towards the pursuit of a goal.
  Reward is always decided in the context of the state that it was decided in along with the state that follows.As long as the agent learns an appropriate action response to any environment state that it can observe–policy. The simplest kind of policy is a mapping from the set of environment states to the set of possible actions. We call this kind of policy a deterministic policy.
在这里插入图片描述
  Another type of policy that we’ll examize is a stochastic policy. The stochastic policy will allow the agent to choose actions randomly. We define a stochastic policy as a mapping that accepts an environment state S and action A, and returns the probability that the agent takes action A while in state S.
在这里插入图片描述在这里插入图片描述
 It’s important to note that any deterministic policy can be expressed using the same notation that we generally reserve for a stochastic policy.
在这里插入图片描述

1.6.3 Quiz: Interpret the Policy

 A policy determines how an agent chooses an action in response to the current state. In other words, it specifies how the agent responds to situations that the environment has presented.

1.6.4 Gridworld Example

在这里插入图片描述

1.6.5 State-Value Functions

在这里插入图片描述
  After all, if the agent starts at the goal, the eposide ends immediately and no reward is received.
  Let’s attach a bit of notation and terminology to this process we just followed. You can think of this grid of numbers as a function of the environment state. For each state, it has a corresponding number, and we refer to this function as the state-value function. For each state, it yields the return that’s likely to follow if the agent starts in that state and then follows the policy for all time steps, but it’s more common to see it equivalently expressed but with a bit more notation.
在这里插入图片描述
 The state-value function for a policy pi is a function of the environment state. For each state s, it tells us the expected discounted return if the agent started in that state s, and then use the policy to choose its actions for all time steps, the state value function will always correspond to a particular policy. So if change the policy, we change the state-value function.
在这里插入图片描述

1.6.6 Bellman Equations

 We can express the value of any state as the sum of the immediate reward plus the value of tha state that follows.
在这里插入图片描述
 We need to use the discounted value of the state that follows. We can express this idea in terms of what’s known as the Bellman expectation equation where for a general MDP we have to calculate the expected value of the sum. This is because, in general, with more complicated worlds, the immediate reward and next state cannot be known with cartainty.
在这里插入图片描述
在这里插入图片描述
 We can express the value of any state in the MDP in terms of the immediate reward and the discounted value of the state that follows.

In this gridworld example, once the agent selects an action,

  • it always moves in the chosen direction (contrasting general MDPs where the agent doesn’t always have complete control over what the next state will be), and
  • the reward can be predicted with complete certainty (contrasting general MDPs where the reward is a random draw from a probability distribution).

 In the event that the agent’s policy π \pi π is deterministic, the agent selects action π ( s ) \pi(s) π(s) when in state s s s, and the Bellman Expectation Equation can be rewritten as the sum over two variables ( s ′ a n d r ) : (s' and r): (sandr):

v π ( s ) = ∑ s ′ ∈ S + , r ∈ R p ( s ′ , r ∣ s , π ( s ) ) ( r + γ v π ( s ′ ) ) v_\pi(s) = \text{} \sum_{s'\in\mathcal{S}^+, r\in\mathcal{R}} p(s',r|s,\pi(s))(r+\gamma v_\pi(s')) vπ(s)=sS+,rRp(s,rs,π(s))(r+γvπ(s))
 In this case, we multiply the sum of the reward and discounted value of the next state ( r + γ v π ( s ′ ) ) (r+\gamma v_\pi(s')) (r+γvπ(s)) by its corresponding probability p ( s ′ , r ∣ s , π ( s ) ) p(s',r|s,\pi(s)) p(s,rs,π(s))and sum over all possibilities to yield the expected value.
 If the agent’s policy π \pi π is stochastic, the agent selects action a a a with probability$ \pi(a|s)$ when in state s s s, and the Bellman Expectation Equation can be rewritten as the sum over three variables ( s ′ , r , and  a ) (s', r, \text{and } a) (s,r,and a):

v π ( s ) = ∑ s ′ ∈ S + , r ∈ R , a ∈ A ( s ) π ( a ∣ s ) p ( s ′ , r ∣ s , a ) ( r + γ v π ( s ′ ) ) v_\pi(s) = \text{} \sum_{s'\in\mathcal{S}^+, r\in\mathcal{R},a\in\mathcal{A}(s)}\pi(a|s)p(s',r|s,a)(r+\gamma v_\pi(s')) vπ(s)=sS+,rR,aA(s)π(as)p(s,rs,a)(r+γvπ(s))
 In this case, we multiply the sum of the reward and discounted value of the next state ( r + γ v π ( s ′ ) ) (r+\gamma v_\pi(s')) (r+γvπ(s)) by its corresponding probability π ( a ∣ s ) p ( s ′ , r ∣ s , a ) \pi(a|s)p(s',r|s,a) π(as)p(s,rs,a) and sum over all possibilities to yield the expected value.

1.6.7 Optimality

  The goal of the agent is to maximize return.
在这里插入图片描述
 Be definition, we say that a policy Pi-Prime is better than or equal to a policy Pi if it’s state-value function is greater than or equal to that of policy Pi for all states.
在这里插入图片描述
  An optimal policy is what the agent is searching for. It’s the solution to the MDP and the best strategy to accomplish it’s goal.

1.6.9 Action-Value Functions

State value function for a policy: for each state s, it yields the expected discounted return if the agent starts in state s and then uses the policy to choose its actions for all time steps.
在这里插入图片描述
Action value function
 While the state values are a function of the environment state, the action values are a function of the environement state and the agent’s action.
 For each state s and action a, the action value function yields the expected discounted return if the agent starts in state s then choose action a and then uses a policy to choose its actions for all future time steps.
在这里插入图片描述
  在寻找最优策略开始以前,需要先有一个随机策略。
在这里插入图片描述
在这里插入图片描述
 假设当前状态为红圈位置,采取的行为为向上,这一步的奖励为-1,到达黄点位置后,在黄点位置(状态)时,根据已有策略(黄箭头表示)奖励为-1 -1 -1 5 = 2,所以在红圈状态处,向上的行为的动作值函数为1(当前状态下的动作值函数,与该动作所到达的下一状态在当前策略下的累计奖励值直接相关)。
在这里插入图片描述

1.6.10 Quiz: Action-Value Functions

在这里插入图片描述
 For a deterministic policy π \pi π,
v π ( s ) = q π ( s , π ( s ) )   f o r  all  s ∈ S v_\pi(s)=q_\pi(s,\pi(s))\text{ } for \text{ all } s \in\mathcal{S} vπ(s)=qπ(s,π(s)) for all sS

1.6.11 Optimal Policies

 If the agent has the optimal action value functions, it can quickly obtain an optimal policy.
在这里插入图片描述

1.6.12 Quiz: Optimal Policies

 If the state space S \mathcal{S} S and action space A \mathcal{A} A are finite, we can represent the optimal action-value function q ∗ q_* q in a table, where we have one entry for each possible environment state s ∈ S s \in \mathcal{S} sS and action a ∈ A a\in\mathcal{A} aA.
 The value for a particular state-action pair s s s, a a a is the expected return if the agent starts in state s s s, takes action a a a, and then henceforth follows the optimal policy π ∗ \pi_* π.
在这里插入图片描述
 Once the agent has determined the optimal action-value function q ∗ q_* q, it can quickly obtain an optimal policy π ∗ \pi_* π by setting π ∗ ( s ) = arg ⁡ max ⁡ a ∈ A ( s ) q ∗ ( s , a )   f o r  all  s ∈ S \pi_*(s) = \arg\max_{a\in\mathcal{A}(s)} q_*(s,a) \text{ }for\text{ all } s\in\mathcal{S} π(s)=argaA(s)maxq(s,a) for all sS.
To see why this should be the case, note that it must hold that v ∗ ( s ) = max ⁡ a ∈ A ( s ) q ∗ ( s , a ) v_*(s) = \max_{a\in\mathcal{A}(s)} q_*(s,a) v(s)=aA(s)maxq(s,a)

1.6.13 Summary

Policies

  • A deterministic policy is a mapping π : S → A \pi: \mathcal{S}\to\mathcal{A} π:SA. For each state s ∈ S s\in\mathcal{S} sS, it yields the action a ∈ A a\in\mathcal{A} aA that the agent will choose while in state s s s.
  • A stochastic policy is a mapping π : S × A → [ 0 , 1 ] \pi: \mathcal{S}\times\mathcal{A}\to [0,1] π:S×A[0,1]. For each state s ∈ S s\in\mathcal{S} sS and action a ∈ A a\in\mathcal{A} aA, it yields the probability π ( a ∣ s ) \pi(a|s) π(as) that the agent chooses action a a a while in state s s s.
    State-Value Functions
  • The state-value function for a policy π \pi π is denoted v π v_\pi vπ. For each state s ∈ S s \in\mathcal{S} sS, it yields the expected return if the agent starts in state s s s and then uses the policy to choose its actions for all time steps. That is, v π ( s ) ≐ E π [ G t ∣ S t = s ] v_\pi(s) \doteq \text{} \mathbb{E}_\pi[G_t|S_t=s] vπ(s)Eπ[GtSt=s]. We refer to v π ( s ) v_\pi(s) vπ(s) as the value of state s s s under policy π \pi π.
  • The notation E π [ ⋅ ] \mathbb{E}_\pi[\cdot] Eπ[] is borrowed from the suggested textbook, where E π [ ⋅ ] \mathbb{E}_\pi[\cdot] Eπ[] is defined as the expected value of a random variable, given that the agent follows policy π \pi π.
    Bellman Equations
    The Bellman expectation equation for v π v_\pi vπis: v π ( s ) = E π [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ] v_\pi(s) = \text{} \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1})|S_t =s] vπ(s)=Eπ[Rt+1+γvπ(St+1)St=s].
    Optimality
  • A policy π ′ \pi' π is defined to be better than or equal to a policy π \pi π if and only if v π ′ ( s ) ≥ v π ( s ) v_{\pi'}(s) \geq v_\pi(s) vπ(s)vπ(s) for all s ∈ S s\in\mathcal{S} sS.
  • An optimal policy π ∗ \pi_* π satisfies π ∗ ≥ π \pi_* \geq \pi ππ for all policies π \pi π. An optimal policy is guaranteed to exist but may not be unique.
  • All optimal policies have the same state-value function v ∗ v_* v, called the optimal state-value function.
    Action-Value Functions
  • The action-value function for a policy π \pi π is denoted q π q_\pi qπ. For each state s ∈ S s \in\mathcal{S} sS and action a ∈ A a \in\mathcal{A} aA, it yields the expected return if the agent starts in state s s s, takes action a a a, and then follows the policy for all future time steps. That is, q π ( s , a ) ≐ E π [ G t ∣ S t = s , A t = a ] q_\pi(s,a) \doteq \mathbb{E}_\pi[G_t|S_t=s, A_t=a] qπ(s,a)Eπ[GtSt=s,At=a]. We refer to q π ( s , a ) q_\pi(s,a) qπ(s,a) as the value of taking action a a a in state ss under a policy π \pi π (or alternatively as the value of the state-action pair s s s, a a a).
  • All optimal policies have the same action-value function q ∗ q_* q, called the optimal action-value function.
    Optimal Policies
    Once the agent determines the optimal action-value function q ∗ q_* q, it can quickly obtain an optimal policy π ∗ \pi_* π by setting π ∗ ( s ) = arg ⁡ max ⁡ a ∈ A ( s ) q ∗ ( s , a ) \pi_*(s) = \arg\max_{a\in\mathcal{A}(s)} q_*(s,a) π(s)=argaA(s)maxq(s,a).
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值