Chapter 1 - 6: The RL Framework: The solution
1.6.2 Polices
We’ve seen that we use a Markov decision process or MDP as a formal definition of the problem that we’d like to solve with reinforcement learning.
The solution is a series of actions that need to be learned by the agent towards the pursuit of a goal.
Reward is always decided in the context of the state that it was decided in along with the state that follows.As long as the agent learns an appropriate action response to any environment state that it can observe–policy. The simplest kind of policy is a mapping from the set of environment states to the set of possible actions. We call this kind of policy a deterministic policy.
Another type of policy that we’ll examize is a stochastic policy. The stochastic policy will allow the agent to choose actions randomly. We define a stochastic policy as a mapping that accepts an environment state S and action A, and returns the probability that the agent takes action A while in state S.
It’s important to note that any deterministic policy can be expressed using the same notation that we generally reserve for a stochastic policy.
1.6.3 Quiz: Interpret the Policy
A policy determines how an agent chooses an action in response to the current state. In other words, it specifies how the agent responds to situations that the environment has presented.
1.6.4 Gridworld Example
1.6.5 State-Value Functions
After all, if the agent starts at the goal, the eposide ends immediately and no reward is received.
Let’s attach a bit of notation and terminology to this process we just followed. You can think of this grid of numbers as a function of the environment state. For each state, it has a corresponding number, and we refer to this function as the state-value function. For each state, it yields the return that’s likely to follow if the agent starts in that state and then follows the policy for all time steps, but it’s more common to see it equivalently expressed but with a bit more notation.
The state-value function for a policy pi is a function of the environment state. For each state s, it tells us the expected discounted return if the agent started in that state s, and then use the policy to choose its actions for all time steps, the state value function will always correspond to a particular policy. So if change the policy, we change the state-value function.
1.6.6 Bellman Equations
We can express the value of any state as the sum of the immediate reward plus the value of tha state that follows.
We need to use the discounted value of the state that follows. We can express this idea in terms of what’s known as the Bellman expectation equation where for a general MDP we have to calculate the expected value of the sum. This is because, in general, with more complicated worlds, the immediate reward and next state cannot be known with cartainty.
We can express the value of any state in the MDP in terms of the immediate reward and the discounted value of the state that follows.
In this gridworld example, once the agent selects an action,
- it always moves in the chosen direction (contrasting general MDPs where the agent doesn’t always have complete control over what the next state will be), and
- the reward can be predicted with complete certainty (contrasting general MDPs where the reward is a random draw from a probability distribution).
In the event that the agent’s policy π \pi π is deterministic, the agent selects action π ( s ) \pi(s) π(s) when in state s s s, and the Bellman Expectation Equation can be rewritten as the sum over two variables ( s ′ a n d r ) : (s' and r): (s′andr):
v
π
(
s
)
=
∑
s
′
∈
S
+
,
r
∈
R
p
(
s
′
,
r
∣
s
,
π
(
s
)
)
(
r
+
γ
v
π
(
s
′
)
)
v_\pi(s) = \text{} \sum_{s'\in\mathcal{S}^+, r\in\mathcal{R}} p(s',r|s,\pi(s))(r+\gamma v_\pi(s'))
vπ(s)=s′∈S+,r∈R∑p(s′,r∣s,π(s))(r+γvπ(s′))
In this case, we multiply the sum of the reward and discounted value of the next state
(
r
+
γ
v
π
(
s
′
)
)
(r+\gamma v_\pi(s'))
(r+γvπ(s′)) by its corresponding probability
p
(
s
′
,
r
∣
s
,
π
(
s
)
)
p(s',r|s,\pi(s))
p(s′,r∣s,π(s))and sum over all possibilities to yield the expected value.
If the agent’s policy
π
\pi
π is stochastic, the agent selects action
a
a
a with probability$ \pi(a|s)$ when in state
s
s
s, and the Bellman Expectation Equation can be rewritten as the sum over three variables
(
s
′
,
r
,
and
a
)
(s', r, \text{and } a)
(s′,r,and a):
v
π
(
s
)
=
∑
s
′
∈
S
+
,
r
∈
R
,
a
∈
A
(
s
)
π
(
a
∣
s
)
p
(
s
′
,
r
∣
s
,
a
)
(
r
+
γ
v
π
(
s
′
)
)
v_\pi(s) = \text{} \sum_{s'\in\mathcal{S}^+, r\in\mathcal{R},a\in\mathcal{A}(s)}\pi(a|s)p(s',r|s,a)(r+\gamma v_\pi(s'))
vπ(s)=s′∈S+,r∈R,a∈A(s)∑π(a∣s)p(s′,r∣s,a)(r+γvπ(s′))
In this case, we multiply the sum of the reward and discounted value of the next state
(
r
+
γ
v
π
(
s
′
)
)
(r+\gamma v_\pi(s'))
(r+γvπ(s′)) by its corresponding probability
π
(
a
∣
s
)
p
(
s
′
,
r
∣
s
,
a
)
\pi(a|s)p(s',r|s,a)
π(a∣s)p(s′,r∣s,a) and sum over all possibilities to yield the expected value.
1.6.7 Optimality
The goal of the agent is to maximize return.
Be definition, we say that a policy Pi-Prime is better than or equal to a policy Pi if it’s state-value function is greater than or equal to that of policy Pi for all states.
An optimal policy is what the agent is searching for. It’s the solution to the MDP and the best strategy to accomplish it’s goal.
1.6.9 Action-Value Functions
State value function for a policy: for each state s, it yields the expected discounted return if the agent starts in state s and then uses the policy to choose its actions for all time steps.
Action value function
While the state values are a function of the environment state, the action values are a function of the environement state and the agent’s action.
For each state s and action a, the action value function yields the expected discounted return if the agent starts in state s then choose action a and then uses a policy to choose its actions for all future time steps.
在寻找最优策略开始以前,需要先有一个随机策略。
假设当前状态为红圈位置,采取的行为为向上,这一步的奖励为-1,到达黄点位置后,在黄点位置(状态)时,根据已有策略(黄箭头表示)奖励为-1 -1 -1 5 = 2,所以在红圈状态处,向上的行为的动作值函数为1(当前状态下的动作值函数,与该动作所到达的下一状态在当前策略下的累计奖励值直接相关)。
1.6.10 Quiz: Action-Value Functions
For a deterministic policy
π
\pi
π,
v
π
(
s
)
=
q
π
(
s
,
π
(
s
)
)
f
o
r
all
s
∈
S
v_\pi(s)=q_\pi(s,\pi(s))\text{ } for \text{ all } s \in\mathcal{S}
vπ(s)=qπ(s,π(s)) for all s∈S
1.6.11 Optimal Policies
If the agent has the optimal action value functions, it can quickly obtain an optimal policy.
1.6.12 Quiz: Optimal Policies
If the state space
S
\mathcal{S}
S and action space
A
\mathcal{A}
A are finite, we can represent the optimal action-value function
q
∗
q_*
q∗ in a table, where we have one entry for each possible environment state
s
∈
S
s \in \mathcal{S}
s∈S and action
a
∈
A
a\in\mathcal{A}
a∈A.
The value for a particular state-action pair
s
s
s,
a
a
a is the expected return if the agent starts in state
s
s
s, takes action
a
a
a, and then henceforth follows the optimal policy
π
∗
\pi_*
π∗.
Once the agent has determined the optimal action-value function
q
∗
q_*
q∗, it can quickly obtain an optimal policy
π
∗
\pi_*
π∗ by setting
π
∗
(
s
)
=
arg
max
a
∈
A
(
s
)
q
∗
(
s
,
a
)
f
o
r
all
s
∈
S
\pi_*(s) = \arg\max_{a\in\mathcal{A}(s)} q_*(s,a) \text{ }for\text{ all } s\in\mathcal{S}
π∗(s)=arga∈A(s)maxq∗(s,a) for all s∈S.
To see why this should be the case, note that it must hold that
v
∗
(
s
)
=
max
a
∈
A
(
s
)
q
∗
(
s
,
a
)
v_*(s) = \max_{a\in\mathcal{A}(s)} q_*(s,a)
v∗(s)=a∈A(s)maxq∗(s,a)
1.6.13 Summary
Policies
- A deterministic policy is a mapping π : S → A \pi: \mathcal{S}\to\mathcal{A} π:S→A. For each state s ∈ S s\in\mathcal{S} s∈S, it yields the action a ∈ A a\in\mathcal{A} a∈A that the agent will choose while in state s s s.
- A stochastic policy is a mapping
π
:
S
×
A
→
[
0
,
1
]
\pi: \mathcal{S}\times\mathcal{A}\to [0,1]
π:S×A→[0,1]. For each state
s
∈
S
s\in\mathcal{S}
s∈S and action
a
∈
A
a\in\mathcal{A}
a∈A, it yields the probability
π
(
a
∣
s
)
\pi(a|s)
π(a∣s) that the agent chooses action
a
a
a while in state
s
s
s.
State-Value Functions - The state-value function for a policy π \pi π is denoted v π v_\pi vπ. For each state s ∈ S s \in\mathcal{S} s∈S, it yields the expected return if the agent starts in state s s s and then uses the policy to choose its actions for all time steps. That is, v π ( s ) ≐ E π [ G t ∣ S t = s ] v_\pi(s) \doteq \text{} \mathbb{E}_\pi[G_t|S_t=s] vπ(s)≐Eπ[Gt∣St=s]. We refer to v π ( s ) v_\pi(s) vπ(s) as the value of state s s s under policy π \pi π.
- The notation
E
π
[
⋅
]
\mathbb{E}_\pi[\cdot]
Eπ[⋅] is borrowed from the suggested textbook, where
E
π
[
⋅
]
\mathbb{E}_\pi[\cdot]
Eπ[⋅] is defined as the expected value of a random variable, given that the agent follows policy
π
\pi
π.
Bellman Equations
The Bellman expectation equation for v π v_\pi vπis: v π ( s ) = E π [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s ] v_\pi(s) = \text{} \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1})|S_t =s] vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s].
Optimality - A policy π ′ \pi' π′ is defined to be better than or equal to a policy π \pi π if and only if v π ′ ( s ) ≥ v π ( s ) v_{\pi'}(s) \geq v_\pi(s) vπ′(s)≥vπ(s) for all s ∈ S s\in\mathcal{S} s∈S.
- An optimal policy π ∗ \pi_* π∗ satisfies π ∗ ≥ π \pi_* \geq \pi π∗≥π for all policies π \pi π. An optimal policy is guaranteed to exist but may not be unique.
- All optimal policies have the same state-value function
v
∗
v_*
v∗, called the optimal state-value function.
Action-Value Functions - The action-value function for a policy π \pi π is denoted q π q_\pi qπ. For each state s ∈ S s \in\mathcal{S} s∈S and action a ∈ A a \in\mathcal{A} a∈A, it yields the expected return if the agent starts in state s s s, takes action a a a, and then follows the policy for all future time steps. That is, q π ( s , a ) ≐ E π [ G t ∣ S t = s , A t = a ] q_\pi(s,a) \doteq \mathbb{E}_\pi[G_t|S_t=s, A_t=a] qπ(s,a)≐Eπ[Gt∣St=s,At=a]. We refer to q π ( s , a ) q_\pi(s,a) qπ(s,a) as the value of taking action a a a in state ss under a policy π \pi π (or alternatively as the value of the state-action pair s s s, a a a).
- All optimal policies have the same action-value function
q
∗
q_*
q∗, called the optimal action-value function.
Optimal Policies
Once the agent determines the optimal action-value function q ∗ q_* q∗, it can quickly obtain an optimal policy π ∗ \pi_* π∗ by setting π ∗ ( s ) = arg max a ∈ A ( s ) q ∗ ( s , a ) \pi_*(s) = \arg\max_{a\in\mathcal{A}(s)} q_*(s,a) π∗(s)=arga∈A(s)maxq∗(s,a).