Chapter 8: Planning and Learning with Tabular Methods

最新推荐文章于 2022-07-13 21:51:09 发布

xiwang_chn

最新推荐文章于 2022-07-13 21:51:09 发布

阅读量397

点赞数

分类专栏： Reinforced learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/107285484

版权

Reinforced learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Chapter 8: Planning and Learning with Tabular Methods

1 Introduction
2 Summary of TD, DP, MC, (MCTS(HS), ES)
3 Dyna, Dyna-Q+
4 Real-time dynamic programming
5 Decision-time planning

1 Introduction

In this chapter we develop a unified view of reinforcement learning methods that require a model of the environment, such as dynamic programming and heuristic search, and methods that can be used without a model, such as Monte Carlo and temporal-difference methods. These are respectively called model-based and model-free reinforcement learning methods. Model-based methods rely on planning as their primary component, while model-free methods primarily rely on learning.
Model-Based: By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions. Some models produce a description of all possibilities and their probabilities; these we call distribution models. Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models.
Planning: We use the term to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment. (1) all state-space planning methods involve computing value functions as a key intermediate step toward improving the policy, and (2) they compute value functions by updates or backup operations applied to simulated experience.
Planning and learning are similar methods except that they are based on different environments. Planning is based on current available information or a model of the environment, while learning is based on real environment. The heart of both learning and planning methods is the estimation of value functions by backing-up update operations. Learning can update the model with the real experience. They can be naturally combined, such as in Dyna, Dyna-Q. They both aim to get a optimal policy by calculating the value function (tabular form).

2 Summary of TD, DP, MC, (MCTS(HS), ES)

2.1 TD, MC, DP returns and update

2.2 Expected & sample Update

Expected update Q:
$Q(s,a)=\sum_{s',r}p'(s',r|s,a)[r+\gamma \mathop{max}\limits_{a'}Q(s',a')]$
The dynamic of the model is needed to do such expected update.

Sample updates Q:
$Q(s,a)=Q(s,a)+\alpha[R+\gamma \mathop{max}\limits_{a'}Q(S',a')-Q(s,a)]$

2.3 Trajectory sampling

As discussed previously, distributing updates uniformly during planning is often sub-optimal. This is because for many tasks, the majority of possible updates will be on irrelevant or low-probability trajectories.

We could generate experience and updates in planning by interacting the current policy with the model, then only updating the simulated trajectories. We call this trajectory sampling. Naturally, trajectory sampling generates updates according to the on-policy distribution.

Focusing on the on-policy distribution could be beneficial because it causes uninteresting parts of the space to be ignored, but it could be detrimental because it causes the same parts of the space to be updated repeatedly. It is often the case the distributing updates according to the on-policy distribution is preferable to using the uniform distribution for larger problems.

3 Dyna, Dyna-Q+

Indirect and direct learning:
Both direct and indirect methods have advantages and disadvantages. Indirect methods often make fuller use of a limited amount of experience and thus achieve a better policy with fewer environmental interactions. On the other hand, direct methods are much simpler and are not affected by biases in the design of the model.

3.1 Dyna

3.2 When the model is dynamic — Dyna Q+

When the model will change with time, the present optimal policy may also change. To encourage such exploration, a special reward of long-untried actions can be assigned, resulting in Dyna-Q+ method.

In a planning context, exploration means trying actions that improve the model, whereas exploitation means behaving in the optimal way given the current model.

The Dyna-Q+ agent that did solve the shortcut maze uses one such heuristic. This agent keeps track for each state–action pair of how many time steps have elapsed since the pair was last tried in a real interaction with the environment. The more time that has elapsed, the greater (we might presume) the chance that the dynamics of this pair has changed and that the model of it is incorrect. To encourage behavior that tests long-untried actions, a special “bonus reward” is given on simulated experiences involving these actions.

3.3 When the model is large — Prioritized Sweeping

When we face a fairly complex model, it is important to focus on some important states or state-action pairs. The idea of backward focusing and prioritized sweeping could be utilized to focus on influential state or state-action values.

backward focusing
In general, we want to work back not just from goal states but from any state whose value has changed.
forward focusing
Another would be to focus on states according to how easily they can be reached from the states that are visited frequently under the current policy, which might be called forward focusing.

It is natural to prioritize the updates according to a measure of their urgency, and perform them in order of priority. This is the idea behind prioritized sweeping. A queue is maintained of every state–action pair whose estimated value would change nontrivially if updated , prioritized by the size of the change.

4 Real-time dynamic programming

Real-time dynamic programming (RTDP) is a on-policy, trajectory-sampling version of value-iteration DP. This is DP value iteration, but with the updates distributed according to the on-policy distribution. As such, it is a form of asynchronous DP.

In general, finding an optimal policy with on-policy trajectory-sampling control method (e.g. Sarsa) requires visiting all state action pairs infinitely many times in the limit. This is true for RTDP as well, but there are certain types of problems for which RTDP is guaranteed to find ann optimal partial policy without visiting all states infinitely often. This is an advantage for problems with very large state sets.

5 Decision-time planning

The type of planning we have considered so far is the improvement of a policy or value function based on simulated experience. This is not focussed on interaction with the environment and is called background planning.

An alternative type of planning, decision time planning, is the search (sometimes many actions deep) of possible future trajectories given the current state.

5.1 Heuristic search

In heuristic search, for each state encountered, a large tree of possible continuations is considered. The approximate value function is applied to the leaf nodes and then backed up toward the current state at the root. The backing up within the search tree is just the same as in the expected updates with maxes discussed throughout this book. The backing up stops at the state (action nodes for the current state). Once the backed-up values of these nodes are computed, the best of them is chosen as the current action, and then all backed-up values are discarded.

5.2 Rollout Algorithm

Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state. Rollout algorithms start in a given state, then estimate the value of the state by averaging simulated returns from that state after following a given policy, called the rollout policy. The action with the highest estimated value is selected and the process is repeated. This is useful when one knows a policy but needs to average over some stochasticity in the environment.

5.3 Monte carlo tree search (MCTS)

Monte-Carlo Tree Search (MCTS) is a successful example of decision time planning. It is a rollout algorithm that accumulates value estimates from the Monte Carlo simulations in order to guide the search. A variant of MCTS was used in AlphaGo.

A basic version of MCTS follows the following steps, starting at the current state:

Selection: Starting at the root node, a tree policy based on action-values attached to the edges of the tree (that balances exploration and exploitation) traverses the tree to select a leaf node.
Expansion: On some iterations (depending on the implementation), the tree is expanded from the selected leaf node by adding one of more child nodes reached from the selected node via unexplored actions.
Simulation: From the selected node, or from one if its newly added child nodes (if any), simulation of a complete episode is run with actions selected by the rollout policy. The result is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout policy.
Backup: The return generated by the simulated episode is backed up to update, or to initialise, the action values attached to the edges of the tree traversed by the tree policy in this iteration of MCTS. No values are saved for the states and actions visited by the rollout policy beyond the tree.

The figure below illustrates this process. MCTS executes this process iteratively, starting at the current state, until no more time is left or computational resources are exhausted. An action is then taken based on some statistics in the tree (e.g. largest action-value or most visited node). After the environment transitions to a new state, MCTS is run again, sometimes starting with a tree of a single root node representing the new state, but often starting with a tree containing any descendants of this node left over from the tree constructed by the previous execution of MCTS; all the remaining nodes are discarded, along with the action values associated with them.