Planning and Learning with Tabular Methods: Part 2

fo-in

于 2021-07-24 16:14:25 发布

阅读量208

点赞数

分类专栏： RL

本文链接：https://blog.csdn.net/WZX_Hello/article/details/119024242

版权

RL 专栏收录该内容

6 篇文章 1 订阅

订阅专栏

Expected vs. Sample Updates
- Is it better devoted to a few expected updates or to $b$ times as many sample updates?
Trajectory Sampling
- On-policy distribution
Real-time Dynamic Programming(RTDP)
Planning at Decision Time
- Background Planning
- Decision-time Planning

Expected vs. Sample Updates

Sample updates can in many cases get closer to the true value function with less computation despite the variance introduced by sampling. Sample updates break the overall backing-up computation into smaller pieces, which enables it to be focused more narrowly on the pieces that will have largest impact.

Expected updates certainly yield a better estimate because they are uncorrupted by sampling error,but they also require more computation, and the computation is often the limiting resource in planning.

Here is an example of applying expected and sample updates to approximate $q^*:$
Expected updates
$\leftarrow \sum_{s',r} \widehat{p}( s',r|s,a)[r+\gamma \max_{a'} Q(s', a')].$
Sample updates
$\leftarrow Q(s, a)+\alpha[R+\gamma \max_{a'}Q(S', a')-Q(s,a)]$

In favor of the expected update is that it is an exact computation, resulting in a new Q(s, a) whose correctness is limited only by the correctness of the Q(s’, a’) at successor states.
The sample update is in addition affected by sampling error. But, it is cheaper computationally because it considers only one next state.
For a particular starting pair, $s, a$ , let $b$ be the branching factor, Then an expected update of this pair requires roughly $b$ times as much computation as a sample update.

Is it better devoted to a few expected updates or to $b$ times as many sample updates?

请添加图片描述
The figure above shows the estimaton error as a function of computation time for expected and sample updates for a variety of branching factors, $b$ .
The condition is that $b$ successor states are equally likely and in which the error in the initial estimate is 1.

The expected updates reduce the error to zero upon its completion, sample updates reduce the error according to $\sqrt{\frac{b-1}{bt}}$ where $t$ is the number of sample updates that have been performed.

From the figure, it can be concluded that for moderately large $b$ the error falls dramatically with a tiny fraction of $b$ updates. And the reason may be that many state-action pairs could have their values improved dramatically, to within a few percent of the effect of an expected update, in the same time that a single state-action pair could undergo an expected update.

Sample updates are likely to be superior to expected updates on problems with large stochastic branching factors and too many states to be solved exactly.

In a real problem, the values of the successor states would be estimates that are themselves updated. By causing estimates to be more accurate sooner, sample updates will have a second advantage in that the values backed up from the successor states will be more accurate.

Trajectory Sampling

There are two ways of distributing updates:

Exhaustive sweeps: perform sweeps through the entire state space, updating each state once per sweep.
To sample from the state or state-action space according to the on-policy distribution, that is, according to the distribution observed when following the current policy.

One advantage of the latter approach is that it is easily generated. One simply needs to interacts with the model, following the current policy.
Sample state transitions and rewards are given by the model, and sample actions are given by the current policy. And Here comes the way of generating experience and updates——Trajectory Sampling :

One simulates explicit individual trajectories and performs updates at the state or state-action pairs encountered along the way.

On-policy distribution

Focusing on the on-policy distribution could be benifit because it causes vast, uninteresting parts of the space to be ignored, or it could be detrimental because it causes the same old parts of the space to be updated over and over.

Here we conduct an experiment to access it.

In the uniform case, we cycled through all state-action pairs, updating each in place, and in the on-policy case we simulated episodes, all starting in the same state, updating each state-action pair that occurred under the current $\epsilon$ -greedy policy( $\epsilon=0.1$ ). The tasks were undiscounted episodic tasks, generated randomly as follows.
From each of the $∣ S ∣$ states, two actions were possible, each of which resulted in one of $b$ next states, all equally likely, with a different random selection of $b$ next states for each state-action pair. The branching factor, $b$ , was the same for all state-action pairs. In addition, on all transitions there was a 0.1 probability of transition to the terminal state, ending in the episode. The expected reward on each transition was selected from Gaussian distribution with mean 0 and variance 1. At any point in the planning process one can stop and exhaustively compute $v_{\tilde{\pi}}(s_0)$ , the true value of the start state under the greedy policy, $\tilde{\pi}$ , given the current action-value function $Q$ , as an indication of how well the agent would do on a new episode on which it acted greedily.
Sampling according to the oon-policy distribution resulted in faster planning initially and retarded planning in the long run. The effect was stronger, and the initial period of faster planning was longer, at smaller branching factors. Moreover, the effect also becomes stronger as the number of states increased.

In the short term, sampling according to the on-policy distribution helps by focusing on states that are near descendants of the start state. In the long run, focusing on the on-policy distribution may hurt because the commonly occurring states are already have their correct values. The exhaustive, unfocused approach dose better in the long run, at least in small problems.
Sampling according to the on-policy distribution can be a great advantage for large problems, in particular for problems in which a small subset of the state-action space is visited under the on-policy distribution.

Real-time Dynamic Programming(RTDP)

RTDP is an on-policy trajectory-sampling version of the value iteration algorithm of DP. RTDP updates the values of states visited in actual or simulated trajectories by means of expected tabular value-iteration updates:
$v_{k+1}(s) \dot{=} \max_{a}E[R_{t+1}+\gamma v_k(S_{t+1})|S_t=s, A_t=a] \\ =\max_{a} \sum_{s',r}p(s', r|s,a)[r+\gamma v_k(s')]$
RTDP is an example of an asynchronous DP algorithm.

Asynchronous DP algorithms are not organized in terms of systematic sweeps of the state set; they update the state values in any order whatsoever, using whatever values of other states happen to be available.

The most interesting part of RTDP is that for certain types of problems satisfying reasonable conditions, RTDP is guaranteed to find a policy that is optimal on the relevant states without visiting every state infinitely often, or even without visiting some states at all. This is a great advantage for problems with very large state sets, where even a single sweep may not be feasible.

There are for conditions that ensure RTDP to converge with probability 1 to an optimal policy for all the relevant states provided:

the initial value of every goal state is 0.
there exists at least 1 policy that guarantees that a goal state will reached with probability 1 from any start state.
all rewards for transitions from non-goal states are strictly negative.
all the initial values are equal to, or greater than, their optimal values.

RTDP can find the optimal policy with half number of time steps as DP does. And as the value function approaches the optimal value function $v_*$ , the policy used by the agent to generate trajectories approaches an optimal policy.

Convertional value iteration continued to update the value of all the states, RTDP strongly focused on the subsets of the states that were relevant to the problem’s objective. This focus became increasingly narrow as learning continued.

RTDP eventually would have focus only on the states that making up optimal paths, and it achieved nearly optimal control with about 50% of the computation required by sweep-based value iteration.

Planning at Decision Time

Planning can used in at least two ways: Background Planning and Decision-time Planning

Background Planning

Like DP and Dyna, it uses planning to gradually improve a policy or value function on the basis of simulated experience obtained from a model. Before an action is selected for any current state $S_t$ , planning has played a part in improving the table entries, that is, it needs to select the action for many states, include $S_t$ .

Decision-time Planning

To begin and complete planning after encountering each new state $S_t$ .
And the idea is that: If the state values are available, and the action is selected by comparing the values of model-predicted next states for each action (by comparing the values of the afterstates).