[Chapter 2] Value Iteration and Policy Iteration

本文探讨了无限期问题中计算最优策略的关键在于求解价值函数,主要介绍了两种方法:价值迭代和策略迭代。价值迭代通过贝尔曼方程更新值函数,不断迭代直至收敛。策略迭代则先初始化并更新策略,通过策略评估和策略改进交替进行,直到策略不再改变,从而找到最优策略。更精确的策略迭代方法包括多次迭代的策略评估和改进。
摘要由CSDN通过智能技术生成

We now know the most important thing for computing an optimal policy is to compute the value function. But how? (The following contents are all based on infinite horizon problems.) The solution to this problem can be roughly divided into two categories: Value Iteration and Policy Iteration.

Value Iteration

Based on the formula of expected utility with discounted factor:

U π ( s ) = E [ ∑ t = 1 ∞ γ t R ( s t ) ] U^{\pi}(s)=E[\sum_{t=1}^{\infty}{\gamma^t R(s_t)}] Uπ(s)=E[t=1γtR(st)]

and the transition function P ( s ′ ∣ s , a ) P(s^′|s,a) P(ss,a) defined in MDP model, there is an equation for the value function to intutively statisfy:

V ( s ) = R ( s ) + γ m a x a ∈ A ( s ) ⁡ ∑ s ′ P ( s ′ │ s , a ) V ( s ′ ) V(s)=R(s)+\gamma max_{a \in A(s)}⁡\sum_{s^′}{P(s^′│s,a)V(s^′)} V(s)=R(s)+γmaxaA(s)sP(ss,a)V(s)

which is called Bellman equation.

Unfortunately, it’s very difficult to solve the Bellman euqation, since there is a m a x max max operator, so that’s why we need value iteration method.

Based on the Bellman equation, we can get the Bellman updata:

U t + 1 ( s ) ← R ( s ) + γ m a x a ∈ A ( s ) ⁡ ∑ s ′ P ( s ′ │ s , a ) U t ( s ′ ) U_{t+1}(s) \leftarrow R(s)+\gamma max_{a \in A(s)}⁡ \sum_{s^′}{P(s^′│s,a)U_t(s^′)} Ut+1(s)R(s)+γmaxaA(s)sP(ss,a)Ut(s)

Where t t t represents the iteration time steps.

The value iteration algorithm can be described as following:

We can initialize all utilities for all states as 0, and using the Bellman update formula to compute new utilities step by step until it converges (all values reach unique fixed points. This will save much more time than solve the Bellman equations directly.

Policy Iteration

Now, think about this, we are updating the values/utilities for each state in the first method, but in policy iteration, we initialize and update the policies. This is based on that sometimes, to find the optimal policy, we don’t really need to find the highly accurate value function, e.g. if one action is clearly better than others. So we give up computing the values using Bellman update, we initialize the policies as π 0 {\pi}_0 π0 and then update them. To do this, we alternate the following two main steps:

  • Policy evaluation: given the current policy π t {\pi}_t πt, calculate the next-step utilities U t + 1 U_{t+1} Ut+1:

U t + 1 ( s ) ← R ( s ) + γ ∑ s ′ P ( s ′ │ s , π t ( s ) ) U t ( s ′ ) U_{t+1}(s) \leftarrow R(s)+\gamma \sum_{s^′}{P(s^′│s,{\pi}_t(s))U_t(s^′)} Ut+1(s)R(s)+γsP(ss,πt(s))Ut(s)

This update formula is similar but simpler than Bellman update formula, there is no need to compute the maximum value among all possible actions, but using the actions given by the policy at time step t t t.

  • Policy inrovement: Using the calculated one step look-ahead utilities U t + 1 ( s ) U_{t+1}(s) Ut+1(s) to compute new policy π t + 1 {\pi}_{t+1} πt+1.

To improve the policy, we need to choose another better policy to replace the current one, to do so, we need to introduce the action-value function or Q-function for policy π {\pi} π:

Q π ( s , a ) = R ( s ) + ∑ s ′ P ( s ′ │ s , a ) U π ( s ′ ) Q^{\pi}(s,a)=R(s)+\sum_{s^′}{P(s^′│s,a)U^{\pi}(s^′)} Qπ(s,a)=R(s)+sP(ss,a)Uπ(s)

The main different for Q-function in comparison to the value function is that the Q-function is the expected utility after determining the exact action a a a. Suppose the size of the state space and action space is ∣ S ∣ |S| S and ∣ A ∣ |A| A respectively, then for each policy π {\pi} π, there will be ∣ S ∣ |S| S value functions, one for each state, and will be ∣ S ∣ × ∣ A ∣ |S| \times |A| S×A Q-functions, one for each state-action pair.

The policy improvement theorem can be very easy then: suppose a new policy π ′ {\pi}′ π that for all s ∈ S s \in S sS:

Q π ( s , π ′ ( s ) ) ≥ U π ( s ) Q^{\pi}(s,{\pi}′(s)) \geq U^{\pi}(s) Qπ(s,π(s))Uπ(s)

Then for all s ∈ S s \in S sS, there will be:

U π ′ ( s ) ≥ U π ( s ) U^{\pi}′(s) \geq U^{\pi}(s) Uπ(s)Uπ(s)

In this case, we can say that π ′ {\pi}′ π is better than π {\pi} π, so we can improve the policy from π {\pi} π to π ′ {\pi}′ π.

Alternatively do the above two steps until there is no change in policy, we can get the optimal policy, this is the policy iteration algorithm, describing as following:

More Accurate Policy Iteration Methods

The policy iteration method stated above with both doing policy evaluation and policy improvement one step is called generalized policy iteration, which is a very general and common method. However, there are some other more accurate methods based on more accurate policy evaluation.

  • For each step of policy evaluation, not only update the utilities one time step, but using the following set of equations to solve the accurate utilities:

U t ( s ) = R ( s ) + γ ∑ s ′ P ( s ′ │ s , π t ( s ) ) U t ( s ′ ) U_t(s)=R(s)+\gamma \sum_{s^′}{P(s^′│s,{\pi}_t(s))U_t(s^′)} Ut(s)=R(s)+γsP(ss,πt(s))Ut(s)

For N N N states, there are N N N equations and they can be solved in O ( N 3 ) O(N^3) O(N3) time using some basic linear algebra knowledge. This will be much more complex but much more accurate, however, it’s a bit too much for almost all problems, we ususlly don’t need the most accurate utilities.

  • Another method does k k k steps iterations in the policy evaluation step instead of to convergence or only one step, which is called modified policy iteration. k k k can be defined according to the environment and the problem.
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值