Reinforcement Learning: An Introduction读书笔记第四章动态规划

最新推荐文章于 2022-07-19 23:04:01 发布

TZX世界第一可爱

最新推荐文章于 2022-07-19 23:04:01 发布

阅读量227

点赞数

分类专栏： Reinforcement Learning 文章标签：动态规划机器学习

本文链接：https://blog.csdn.net/joycethoreau/article/details/107323494

版权

Reinforcement Learning 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Reinforcement Learning: An Introduction读书笔记第四章动态规划）

Reinforcement Learning: An Introduction读书笔记第四章动态规划

Reinforcement Learning: An Introduction读书笔记第四章动态规划

The term dynamic programming (DP) refers to a collection of algorithms that
can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).

The environment is assumed as a finit MDP in this chapter.

Bellman optimality equations:
在这里插入图片描述

DP algorithms are obtained by turning Bellman equations into update rules for improving approximations of the desired value functions.

4.1 Policy Evaluation

Suppose that the environment’s dynamics are completely known.
Consider a sequence of approximate value functions v0, v1, v2, . . ., each mapping S⁺ to R.
在这里插入图片描述
This algorithm is called iterative policy evaluation.

All the backups done in DP algorithms are called full backups because they are based on all possible next states rather than on a sample next state.

4.2 Policy Improvement

For some state s we would like to know whether or not we should change the policy to deterministically choose an action a ≠ π(s).

Policy improvement theorem:
Let π and π’ be any pair of deterministic policies such that, for all s ∈ S,

Then the policy π’ must be as good as, or better than, π. That is, it must obtain greater or equal expected return from all states s ∈ S:

Proof:

the new greedy policy, π’ is given by:
在这里插入图片描述

4.3 Policy Iteration

在这里插入图片描述
where E→ denotes a policy evaluation and I→ denotes a policy improvement.

4.4 Value Iteration

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set.
One important special case is when policy evaluation is stopped after just one sweep (one backup of each state). This algorithm is called value iteration.

在这里插入图片描述

4.5 Asynchronous Dynamic Programming

A major drawback to the DP methods that we have discussed so far is that they involve operations over the entire state set of the MDP, that is, they require sweeps of the state set. If the state set is very large, then even a single sweep can be prohibitively expensive.
Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms back up the values of states in any order whatsoever, using whatever values of other states happen to be available. The values of some states may be backed up several times before the values of others are backed up once.

4.6 Generalized Policy Iteration(GPI)

We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

Almost all reinforcement learning methods are well described as GPI. That is, all have identiﬁable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy.

The evaluation and improvement processes in GPI can be viewed as both competing and cooperating.
One might also think of the interaction between the evaluation and improvement processes in GPI in terms of two constraints or goals|for example, as two lines in two-dimensional space:

总结

在这里插入图片描述

TZX世界第一可爱

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Reinforcement Learning: An Introduction读书笔记第四章动态规划

Reinforcement Learning: An Introduction读书笔记第四章动态规划）Reinforcement Learning: An Introduction读书笔记第四章动态规划4.1 Policy Evaluation4.2 Policy Improvement4.3 Policy Iteration4.4 Value Iteration4.5 Asynchronous Dynamic Programming4.6 Generalized Policy Iteration(
复制链接

扫一扫