Reinforcement-Learning 2.State Value and Bellman Equation

被资本驾驭

已于 2024-08-25 23:53:04 修改

阅读量623

点赞数 8

文章标签：算法

于 2024-08-25 14:07:56 首次发布

本文链接：https://blog.csdn.net/POSEidoNqs/article/details/141525977

版权

0.Outline

1.Motivating examples

Motivating example 1: Why return is important?

Motivating example 2: How to calculate return?

2.State value

3.Bellman equation: Derivation

Deriving the Bellman equation

An illustrative example

Exercise

4.Bellman equation: Matrix-vector form

Matrix-vector form of the Bellman equation

Illustrative examples

5.Bellman equation: Solve the state values

6.Action value

Action value

llustrative example for action value

7.Summary

0.Outline

In this lecture:

• A core concept: state value

• A fundamental tool: the Bellman equation

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

1.Motivating examples

Motivating example 1: Why return is important?

• What is return? The (discounted) sum of the rewards obtained along a trajectory.

• Why is return important? See the following examples.

• Question: From the starting point s1, which policy is the “best”? Which is the “worst”?

• Intuition: the first is the best and the second is the worst, because of the forbidden area.

• Math: can we use mathematics to describe such intuition?

Return could be used to evaluate policies. See the following.

Based on policy 1 (left figure), starting from s1, the discounted return is

Based on policy 2 (middle figure), starting from s1, what is the discounted return? Answer:

Policy 3 is stochastic!

Based on policy 3 (right figure), starting from s1, the discounted return is

Answer:

return即每种policy的reward值考虑discounted rate的总和，用于评估不同policy的好坏

In summary, starting from s1,

return1 > return3 > return2

The above inequality suggests that the first policy is the best and the second policy is the worst, which is exactly the same as our intuition.

Calculating return is important to evaluate a policy.

Motivating example 2: How to calculate return?

While return is important, how to calculate it?

Method 1: by definition

Let vi denote the return obtained starting from si (i = 1, 2, 3, 4)

Method 2:

The returns rely on each other. Bootstrapping!

How to solve these equations? Write in the following matrix-vector form:

which can be rewritten as

v = r + γPv

This is the Bellman equation (for this specific deterministic problem)!!

• Though simple, it demonstrates the core idea: the value of one state relies on the values of other states.

• A matrix-vector form is more clear to see how to solve the state values.

Exercise: Consider the policy shown in the figure. Please write out the relation among the returns (that is to write out the Bellman equation)

Answer:

v1 = 0 + γv3

v2 = 1 + γv4

v3 = 1 + γv4

v4 = 1 + γv4

Exercise: How to solve them? We can first calculate v4, and then v3, v2, v1.

2.State value

Consider the following single-step process:

Note that St, At, Rt+1 are all random variables.

This step is governed by the following probability distributions:

• St → At is governed by π(At = a|St = s) 由policy来决定的

• St, At → Rt+1 is governed by p(Rt+1 = r|St = s, At = a) 由reward probability 决定

• St, At → St+1 is governed by p(St+1 = s′|St = s, At = a) 由state transition probability决定

At this moment, we assume we know the model (i.e., the probability distributions)!

Consider the following multi-step trajectory:

The discounted return is

• γ ∈ (0, 1) is a discount rate.

• Gt is also a random variable since Rt+1, Rt+2, . . . are random variables.

state value就是在某一policy下，在s-state下的discounted return和

The expectation (or called expected value or mean) of Gt is defined as the state-value function or simply state value:

Remarks:

• It is a function of s. It is a conditional expectation with the condition that the state starts from s.

• It is based on the policy π. For a different policy, the state value may be different.

Q: What is the relationship between return and state value?

3.Bellman equation: Derivation

• While state value is important, how to calculate? The answer lies in the Bellman equation.

• In a word, the Bellman equation describes the relationship among the values of all states.

• Next, we derive the Bellman equation.

• There is some math.

• We already have the intuition.

Deriving the Bellman equation

Consider a random trajectory:

The return Gt can be written as

Then, it follows from the definition of the state value that

Next, calculate the two terms, respectively.

First, calculate the first term E[Rt+1|St = s]:

在s-state下获得Rt+1的期望，即等于在s-state采取a-action的概率乘以在s-state，a-action下Rt+1的期望的和，即全概率公式

等于在s-state，a-action取到r-reward的概率乘以r值，即期望的公式

Note that

• This is the mean of immediate rewards

Second, calculate the second term E[Gt+1|St = s]:

Note that

• This is the mean of future rewards

• E[Gt+1|St = s, St+1 = s′] = E[Gt+1|St+1 = s′] due to the memoryless Markov property.

Therefore, we have

Highlights:

• The above equation is called the Bellman equation, which characterizes the relationship among the state-value functions of different states.

• It consists of two terms: the immediate reward term and the future reward term.

• A set of equations: every state has an equation like this!!!

Highlights: symbols in this equation

• vπ(s) and vπ(s′) are state values to be calculated. Bootstrapping!

• π(a|s) is a given policy. Solving the equation is called policy evaluation.

• p(r|s, a) and p(s′|s, a) represent the dynamic model. What if the model is known or unknown?

An illustrative example

Write out the Bellman equation according to the general expression:

This example is simple because the policy is deterministic.

First, consider the state value of s1:

• π(a = a3|s1) = 1 and π(a 6= a3|s1) = 0.

• p(s′ = s3|s1, a3) = 1 and p(s′ 6= s3|s1, a3) = 0.

• p(r = 0|s1, a3) = 1 and p(r 6= 0|s1, a3) = 0.

Substituting them into the Bellman equation gives

Write out the Bellman equation according to the general expression:

Similarly, it can be obtained that、

What to do after we have calculated state values? Be patient (calculating action value and improve policy)

Exercise

• write out the Bellman equations for each state.

• solve the state values from the Bellman equations.

• compare with the policy in the last example.

Solve the above equations one by one from the last to the first.

Substituting γ = 0.9 yields

Compare with the previous policy. This one is worse.

4.Bellman equation: Matrix-vector form

Matrix-vector form of the Bellman equation

Why consider the matrix-vector form? Because we need to solve the state values from it!

• One unknown relies on another unknown. How to solve the unknowns?

• Elementwise form: The above elementwise form is valid for every state s ∈ S. That means there are |S| equations like this!

• Matrix-vector form: If we put all the equations together, we have a set of linear equations, which can be concisely written in a matrix-vector form. The matrix-vector form is very elegant and important.

Recall that:

Rewrite the Bellman equation as

where

Suppose the states could be indexed as si (i = 1, . . . , n).

For state si, the Bellman equation is

Put all these equations for all the states together and rewrite to a matrix-vector form

Illustrative examples

For this specific example:

5.Bellman equation: Solve the state values

Why to solve state values?

• Given a policy, finding out the corresponding state values is called policy evaluation! It is a fundamental problem in RL. It is the foundation to find better policies.

• It is important to understand how to solve the Bellman equation.

The Bellman equation in matrix-vector form is

• The closed-form solution is:

In practice, we still need to use numerical tools to calculate the matrix inverse.

Can we avoid the matrix inverse operation? Yes, by iterative algorithms.

• An iterative solution is:

This algorithm leads to a sequence {v0, v1, v2, . . . }. We can show that

The following are two “good” policies and the state values. The two policies are different for the top two states in the forth column.

The following are two “bad” policies and the state values. The state values are less than those of the good policies.

6.Action value

Action value

From state value to action value:

• State value: the average return the agent can get starting from a state.

• Action value: the average return the agent can get starting from a state and taking an action.

Why do we care action value? Because we want to know which action is better. This point will be clearer in the following lectures. We will frequently use action values.

Definition: