[Chapter 2] Value Iteration and Policy Iteration

最新推荐文章于 2021-11-28 19:53:33 发布

超级超级小天才

最新推荐文章于 2021-11-28 19:53:33 发布

阅读量296

点赞数 2

分类专栏： Reinforcement Learning Overview 文章标签：强化学习

本文链接：https://blog.csdn.net/qq_38962621/article/details/117374790

版权

Reinforcement Learning Overview 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

本文探讨了无限期问题中计算最优策略的关键在于求解价值函数，主要介绍了两种方法：价值迭代和策略迭代。价值迭代通过贝尔曼方程更新值函数，不断迭代直至收敛。策略迭代则先初始化并更新策略，通过策略评估和策略改进交替进行，直到策略不再改变，从而找到最优策略。更精确的策略迭代方法包括多次迭代的策略评估和改进。

摘要由CSDN通过智能技术生成

We now know the most important thing for computing an optimal policy is to compute the value function. But how? (The following contents are all based on infinite horizon problems.) The solution to this problem can be roughly divided into two categories: Value Iteration and Policy Iteration.

Value Iteration

Based on the formula of expected utility with discounted factor:

$U^{\pi}(s)=E[\sum_{t=1}^{\infty}{\gamma^t R(s_t)}]$

and the transition function $P(s^′|s,a)$ defined in MDP model, there is an equation for the value function to intutively statisfy:

$V(s)=R(s)+\gamma max_{a \in A(s)}⁡\sum_{s^′}{P(s^′│s,a)V(s^′)}$

which is called Bellman equation.

Unfortunately, it’s very difficult to solve the Bellman euqation, since there is a $m a x$ operator, so that’s why we need value iteration method.

Based on the Bellman equation, we can get the Bellman updata:

$U_{t+1}(s) \leftarrow R(s)+\gamma max_{a \in A(s)}⁡ \sum_{s^′}{P(s^′│s,a)U_t(s^′)}$

Where $t$ represents the iteration time steps.

The value iteration algorithm can be described as following:

We can initialize all utilities for all states as 0, and using the Bellman update formula to compute new utilities step by step until it converges (all values reach unique fixed points. This will save much more time than solve the Bellman equations directly.

Policy Iteration

Now, think about this, we are updating the values/utilities for each state in the first method, but in policy iteration, we initialize and update the policies. This is based on that sometimes, to find the optimal policy, we don’t really need to find the highly accurate value function, e.g. if one action is clearly better than others. So we give up computing the values using Bellman update, we initialize the policies as ${\pi}_0$ and then update them. To do this, we alternate the following two main steps:

Policy evaluation: given the current policy ${\pi}_t$ , calculate the next-step utilities $U_{t+1}$ :

$U_{t+1}(s) \leftarrow R(s)+\gamma \sum_{s^′}{P(s^′│s,{\pi}_t(s))U_t(s^′)}$

This update formula is similar but simpler than Bellman update formula, there is no need to compute the maximum value among all possible actions, but using the actions given by the policy at time step $t$ .

Policy inrovement: Using the calculated one step look-ahead utilities $U_{t+1}(s)$ to compute new policy ${\pi}_{t+1}$ .

To improve the policy, we need to choose another better policy to replace the current one, to do so, we need to introduce the action-value function or Q-function for policy ${\pi}$ :

$Q^{\pi}(s,a)=R(s)+\sum_{s^′}{P(s^′│s,a)U^{\pi}(s^′)}$

The main different for Q-function in comparison to the value function is that the Q-function is the expected utility after determining the exact action $a$ . Suppose the size of the state space and action space is $∣ S ∣$ and $∣ A ∣$ respectively, then for each policy ${\pi}$ , there will be $∣ S ∣$ value functions, one for each state, and will be $\times |A|$ Q-functions, one for each state-action pair.

The policy improvement theorem can be very easy then: suppose a new policy ${\pi}′$ that for all $\in S$ :

$Q^{\pi}(s,{\pi}′(s)) \geq U^{\pi}(s)$

Then for all $\in S$ , there will be:

$U^{\pi}′(s) \geq U^{\pi}(s)$

In this case, we can say that ${\pi}′$ is better than ${\pi}$ , so we can improve the policy from ${\pi}$ to ${\pi}′$ .

Alternatively do the above two steps until there is no change in policy, we can get the optimal policy, this is the policy iteration algorithm, describing as following:

More Accurate Policy Iteration Methods

The policy iteration method stated above with both doing policy evaluation and policy improvement one step is called generalized policy iteration, which is a very general and common method. However, there are some other more accurate methods based on more accurate policy evaluation.

For each step of policy evaluation, not only update the utilities one time step, but using the following set of equations to solve the accurate utilities:

$U_t(s)=R(s)+\gamma \sum_{s^′}{P(s^′│s,{\pi}_t(s))U_t(s^′)}$

For $N$ states, there are $N$ equations and they can be solved in $O(N^3)$ time using some basic linear algebra knowledge. This will be much more complex but much more accurate, however, it’s a bit too much for almost all problems, we ususlly don’t need the most accurate utilities.

Another method does $k$ steps iterations in the policy evaluation step instead of to convergence or only one step, which is called modified policy iteration. $k$ can be defined according to the environment and the problem.

超级超级小天才

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
[Chapter 2] Value Iteration and Policy Iteration

We now know the most important thing for computing an optimal policy is to compute the value function. But how? (The following contents are all based on infinite horizon problems.) The solution to this problem can be roughly divided into two categories: Va
复制链接

扫一扫

专栏目录