Chapter 1 - 6: The RL Framework: The solution

最新推荐文章于 2021-11-16 16:12:04 发布

小朱智能驾驶

最新推荐文章于 2021-11-16 16:12:04 发布

阅读量377

点赞数

分类专栏：深度强化学习专栏

本文链接：https://blog.csdn.net/weixin_37532614/article/details/105185854

版权

深度强化学习专栏专栏收录该内容

13 篇文章 0 订阅

订阅专栏

Chapter 1 - 6: The RL Framework: The solution

1.6.2 Polices

We’ve seen that we use a Markov decision process or MDP as a formal definition of the problem that we’d like to solve with reinforcement learning.
在这里插入图片描述
The solution is a series of actions that need to be learned by the agent towards the pursuit of a goal.
Reward is always decided in the context of the state that it was decided in along with the state that follows.As long as the agent learns an appropriate action response to any environment state that it can observe–policy. The simplest kind of policy is a mapping from the set of environment states to the set of possible actions. We call this kind of policy a deterministic policy.
在这里插入图片描述
Another type of policy that we’ll examize is a stochastic policy. The stochastic policy will allow the agent to choose actions randomly. We define a stochastic policy as a mapping that accepts an environment state S and action A, and returns the probability that the agent takes action A while in state S.
在这里插入图片描述
It’s important to note that any deterministic policy can be expressed using the same notation that we generally reserve for a stochastic policy.

1.6.3 Quiz: Interpret the Policy

A policy determines how an agent chooses an action in response to the current state. In other words, it specifies how the agent responds to situations that the environment has presented.

1.6.4 Gridworld Example

在这里插入图片描述

1.6.5 State-Value Functions

在这里插入图片描述
After all, if the agent starts at the goal, the eposide ends immediately and no reward is received.
Let’s attach a bit of notation and terminology to this process we just followed. You can think of this grid of numbers as a function of the environment state. For each state, it has a corresponding number, and we refer to this function as the state-value function. For each state, it yields the return that’s likely to follow if the agent starts in that state and then follows the policy for all time steps, but it’s more common to see it equivalently expressed but with a bit more notation.
在这里插入图片描述
The state-value function for a policy pi is a function of the environment state. For each state s, it tells us the expected discounted return if the agent started in that state s, and then use the policy to choose its actions for all time steps, the state value function will always correspond to a particular policy. So if change the policy, we change the state-value function.
在这里插入图片描述

1.6.6 Bellman Equations

We can express the value of any state as the sum of the immediate reward plus the value of tha state that follows.
在这里插入图片描述
We need to use the discounted value of the state that follows. We can express this idea in terms of what’s known as the Bellman expectation equation where for a general MDP we have to calculate the expected value of the sum. This is because, in general, with more complicated worlds, the immediate reward and next state cannot be known with cartainty.
在这里插入图片描述

We can express the value of any state in the MDP in terms of the immediate reward and the discounted value of the state that follows.

In this gridworld example, once the agent selects an action,

it always moves in the chosen direction (contrasting general MDPs where the agent doesn’t always have complete control over what the next state will be), and
the reward can be predicted with complete certainty (contrasting general MDPs where the reward is a random draw from a probability distribution).

In the event that the agent’s policy $\pi$ is deterministic, the agent selects action $\pi(s)$ when in state $s$ , and the Bellman Expectation Equation can be rewritten as the sum over two variables $(s^{'} a n d r) :$

$v_\pi(s) = \text{} \sum_{s'\in\mathcal{S}^+, r\in\mathcal{R}} p(s',r|s,\pi(s))(r+\gamma v_\pi(s'))$
In this case, we multiply the sum of the reward and discounted value of the next state $(r+\gamma v_\pi(s'))$ by its corresponding probability $p(s',r|s,\pi(s))$ and sum over all possibilities to yield the expected value.
If the agent’s policy $\pi$ is stochastic, the agent selects action $a$ with probability$ \pi(a|s)$ when in state $s$ , and the Bellman Expectation Equation can be rewritten as the sum over three variables $\text{and } a)$ :

$v_\pi(s) = \text{} \sum_{s'\in\mathcal{S}^+, r\in\mathcal{R},a\in\mathcal{A}(s)}\pi(a|s)p(s',r|s,a)(r+\gamma v_\pi(s'))$
In this case, we multiply the sum of the reward and discounted value of the next state $(r+\gamma v_\pi(s'))$ by its corresponding probability $\pi(a|s)p(s',r|s,a)$ and sum over all possibilities to yield the expected value.

1.6.7 Optimality

The goal of the agent is to maximize return.
在这里插入图片描述
Be definition, we say that a policy Pi-Prime is better than or equal to a policy Pi if it’s state-value function is greater than or equal to that of policy Pi for all states.

An optimal policy is what the agent is searching for. It’s the solution to the MDP and the best strategy to accomplish it’s goal.

1.6.9 Action-Value Functions

State value function for a policy: for each state s, it yields the expected discounted return if the agent starts in state s and then uses the policy to choose its actions for all time steps.
在这里插入图片描述
Action value function
While the state values are a function of the environment state, the action values are a function of the environement state and the agent’s action.
For each state s and action a, the action value function yields the expected discounted return if the agent starts in state s then choose action a and then uses a policy to choose its actions for all future time steps.
在这里插入图片描述
在寻找最优策略开始以前，需要先有一个随机策略。

假设当前状态为红圈位置，采取的行为为向上，这一步的奖励为-1，到达黄点位置后，在黄点位置（状态）时，根据已有策略（黄箭头表示）奖励为-1 -1 -1 5 = 2，所以在红圈状态处，向上的行为的动作值函数为1（当前状态下的动作值函数，与该动作所到达的下一状态在当前策略下的累计奖励值直接相关）。
在这里插入图片描述

1.6.10 Quiz: Action-Value Functions

在这里插入图片描述
For a deterministic policy $\pi$ ,
$v_\pi(s)=q_\pi(s,\pi(s))\text{ } for \text{ all } s \in\mathcal{S}$

1.6.11 Optimal Policies

If the agent has the optimal action value functions, it can quickly obtain an optimal policy.
在这里插入图片描述

1.6.12 Quiz: Optimal Policies

If the state space $\mathcal{S}$ and action space $\mathcal{A}$ are finite, we can represent the optimal action-value function $q_*$ in a table, where we have one entry for each possible environment state $\in \mathcal{S}$ and action $a\in\mathcal{A}$ .
The value for a particular state-action pair $s$ , $a$ is the expected return if the agent starts in state $s$ , takes action $a$ , and then henceforth follows the optimal policy $\pi_*$ .
在这里插入图片描述
Once the agent has determined the optimal action-value function $q_*$ , it can quickly obtain an optimal policy $\pi_*$ by setting $\pi_*(s) = \arg\max_{a\in\mathcal{A}(s)} q_*(s,a) \text{ }for\text{ all } s\in\mathcal{S}$ .
To see why this should be the case, note that it must hold that $v_*(s) = \max_{a\in\mathcal{A}(s)} q_*(s,a)$

1.6.13 Summary

Policies

A deterministic policy is a mapping $\pi: \mathcal{S}\to\mathcal{A}$ . For each state $s\in\mathcal{S}$ , it yields the action $a\in\mathcal{A}$ that the agent will choose while in state $s$ .
A stochastic policy is a mapping $\pi: \mathcal{S}\times\mathcal{A}\to [0,1]$ . For each state $s\in\mathcal{S}$ and action $a\in\mathcal{A}$ , it yields the probability $\pi(a|s)$ that the agent chooses action $a$ while in state $s$ .
State-Value Functions
The state-value function for a policy $\pi$ is denoted $v_\pi$ . For each state $\in\mathcal{S}$ , it yields the expected return if the agent starts in state $s$ and then uses the policy to choose its actions for all time steps. That is, $v_\pi(s) \doteq \text{} \mathbb{E}_\pi[G_t|S_t=s]$ . We refer to $v_\pi(s)$ as the value of state $s$ under policy $\pi$ .
The notation $\mathbb{E}_\pi[\cdot]$ is borrowed from the suggested textbook, where $\mathbb{E}_\pi[\cdot]$ is defined as the expected value of a random variable, given that the agent follows policy $\pi$ .
Bellman Equations
The Bellman expectation equation for $v_\pi$ is: $v_\pi(s) = \text{} \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1})|S_t =s]$ .
Optimality
A policy $\pi'$ is defined to be better than or equal to a policy $\pi$ if and only if $v_{\pi'}(s) \geq v_\pi(s)$ for all $s\in\mathcal{S}$ .
An optimal policy $\pi_*$ satisfies $\pi_* \geq \pi$ for all policies $\pi$ . An optimal policy is guaranteed to exist but may not be unique.
All optimal policies have the same state-value function $v_*$ , called the optimal state-value function.
Action-Value Functions
The action-value function for a policy $\pi$ is denoted $q_\pi$ . For each state $\in\mathcal{S}$ and action $\in\mathcal{A}$ , it yields the expected return if the agent starts in state $s$ , takes action $a$ , and then follows the policy for all future time steps. That is, $q_\pi(s,a) \doteq \mathbb{E}_\pi[G_t|S_t=s, A_t=a]$ . We refer to $q_\pi(s,a)$ as the value of taking action $a$ in state ss under a policy $\pi$ (or alternatively as the value of the state-action pair $s$ , $a$ ).
All optimal policies have the same action-value function $q_*$ , called the optimal action-value function.
Optimal Policies
Once the agent determines the optimal action-value function $q_*$ , it can quickly obtain an optimal policy $\pi_*$ by setting $\pi_*(s) = \arg\max_{a\in\mathcal{A}(s)} q_*(s,a)$ .

小朱智能驾驶

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter 1 - 6: The RL Framework: The solution

Chapter 1 - 6: The RL Framework: The solution1.6.2 Polices We’ve seen that we use a Markov decision process or MDP as a formal definition of the problem that we’d like to solve with reinforcement l...
复制链接

扫一扫

专栏目录