Reinforcement learning book 学习笔记 第二章

Chapter 2 Multi-armed Bandits


The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions.


Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken. This kind of feedback is the basis of
supervised learning, which includes large parts of pattern classification, artificial neural networks, and system identification.


evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken


2.1 A k-armed Bandit Problem

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen objective is to maximize the expected total reward over some time period


In our k-armed bandit problem, each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action. We denote the action selected on time step t as At, and the corresponding reward as Rt. The value then of an arbitrary action a, denoted q*(a) , is the expected reward given that a is selected:
If you knew the value of each action, then it would be trivial to solve the k-armed bandit problem: you would always select the action with highest value. We assume that you do not know the action values with certainty, although you may have estimates. We denote the estimated value of action a at time step t as Qt(a). We would like Qt(a) to be close to q*(a). We denote the estimated value of action a at time step t as Qt(a). We would like Qt(a) to be close to q⇤(a).


If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.


For example, suppose a greedy action’s value is known with certainty, while several other actions are estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don’t know which one. If you have many time steps ahead on which to make action selections, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them many times. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the “conflict” between exploration and exploitation.


2.2 Action-value Methods

We begin by looking more closely at methods for estimating the values of actions and for using the estimates to make action selection decisions, which we collectively call action-value methods. Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received:


where Ipredicate denotes the random variable that is 1 if predicate is true and 0 if it is not. If the denominator is zero, then we instead define Qt(a) as some default value, such as 0. As thedenominator goes to infinity, by the law of large numbers, Qt(a) converges to q*(a). We call this the sample-average method for estimating action values because each estimate is an average of the sample of relevant rewards. Of course this is just one way to estimate action values, and not necessarily the best one. Nevertheless, for now let us stay with this simple estimation method and turn to the question of how the estimates might be used to select actions.


We write this greedy action selection method as


A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability ", instead select randomly from among all the actions with equal probability, independently of the action-value estimates. We call methods using this near-greedy action selection rule "epsilon-greedy methods. An advantage of these methods is that, in the limit as the number of steps increases, every action will be sampled an infinite number of times, thus ensuring that all the Qt(a) converge to q*(a). This of course implies that the probability of selecting the optimal action converges to greater than 1 −epsilon ", that is, to near certainty. These are just asymptotic guarantees, however, and say little about the practical effectiveness of the methods.


2.3 The 10-armed Testbed




the action values, q*(a), a = 1,…, 10, were selected according to a normal (Gaussian) distribution with mean 0 and variance 1.


Then, when a learning method applied to that problem selected action At at time step t, the actual reward, Rt, was selected from a normal distribution with mean q*(At) and variance 1.


For any learning method, we can measure its performance and behavior as it improves with experience over 1000 time steps when applied to one of the bandit problems. This makes up one run. Repeating this for 2000 independent runs, each with a different bandit problem, we obtained measures of the learning algorithm’s average behavior.



Figure 2.2 compares a greedy method with two “epsilon-greedy methods (”= 0.01 and "= 0.1), as described above, on the 10-armed testbed. All the methods formed their action-value estimates using the sample-average technique. The upper graph shows the increase in expected reward with experience.


The upper graph shows the increase in
expected reward with experience. The greedy method improved slightly faster than the
other methods at the very beginning, but then leveled off at a lower level. It achieved a
reward-per-step of only about 1, compared with the best possible of about 1.55 on this
testbed. The greedy method performed significantly worse in the long run because it often got stuck performing suboptimal actions.


The lower graph shows that the greedy
method found the optimal action in only approximately one-third of the tasks. In the
other two-thirds, its initial samples of the optimal action were disappointing, and it never
returned to it. The "epsilon-greedy methods eventually performed better because they continued to explore and to improve their chances of recognizing the optimal action. The " = 0.1
method explored more, and usually found the optimal action earlier, but it never selected
that action more than 91% of the time. The " = 0.01 method improved more slowly, but
eventually would perform better than the " = 0.1 method on both performance measures
shown in the figure. It is also possible to reduce " over time to try to get the best of both
high and low values.


The advantage of "epsilon-greedy over greedy methods depends on the task. For example,
suppose the reward variance had been larger, say 10 instead of 1. With noisier rewards
it takes more exploration to find the optimal action, and "epsilon-greedy methods should fare
even better relative to the greedy method. On the other hand, if the reward variances
were zero, then the greedy method would know the true value of each action after trying
it once. In this case the greedy method might actually perform best because it would
soon find the optimal action and then never explore.


But even in the deterministic case
there is a large advantage to exploring if we weaken some of the other assumptions. For
example, suppose the bandit task were nonstationary, that is, the true values of the
actions changed over time. In this case exploration is needed even in the deterministic
case to make sure one of the nongreedy actions has not changed to become better than
the greedy one. As we shall see in the next few chapters, nonstationarity is the case
most commonly encountered in reinforcement learning. Even if the underlying task is
stationary and deterministic, the learner faces a set of banditlike decision tasks each of
which changes over time as learning proceeds and the agent’s decision-making policy
changes. Reinforcement learning requires a balance between exploration and exploitation.


2.4 Incremental Implementation


To simplify notation we concentrate on a single action. Let Ri now denote the reward
received after the ith selection of this action, and let Qn denote the estimate of its action
value after it has been selected n − 1 times, which we can now write simply as



The expression [
is an error in the estimate. It is reduced by taking
a step toward the “Target.” The target is presumed to indicate a desirable direction in
which to move, though it may be noisy. In the case above, for example, the target is the
nth reward




2.5 Tracking a Nonstationary Problem (跟踪一个非平稳问题)


In such cases it makes sense to give more weight to recent rewards than to long-past rewards. One of the most popular ways of doing this is to use a constant step-size parameter. For example, the incremental update rule for updating an average Qn of the n − 1 past rewards is modified to be


where the step-size parameter alpha属于 (0, 1] is constant. This results in Qn+1 being a weighted
average of past rewards and the initial estimate Q1:


We call this a weighted average because the sum of the weights is
(1-alpha)**n+∑i=1到n alpha(1-alpha)**n-i=1;as you can check for yourself. Note that the weight alpha(1 − alpha)**n−i , given to thereward Ri depends on how many rewards ago, n − i, it was observed. The quantity 1 − alphais less than 1, and thus the weight given to Ri decreases as the number of intervening rewards increases. In fact, the weight decays exponentially according to the exponent on 1 − alpha.Accordingly, this is sometimes called an exponential recency-weighted average


Sometimes it is convenient to vary the step-size parameter from step to step. Let 【alpha】 n(a) denote the step-size parameter used to process the reward received after the nth selection of action a. As we have noted, the choice 【alpha】n(a) = 1/n results in the sample-average method,
which is guaranteed to converge to the true action values by the law of large numbers. But of course convergence is not guaranteed for all choices of the sequence {【alpha】n(a)}. A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1:

有时候随着时刻一步步改变步长参数是很方便的。设【alpha】n(a) 表示用于处理第n次选择动作a后收到的收益的步长参数。正如我们注意到的,选择【alpha】n(a) = 1/n,将会得到平均采样法,大数定律保证它收敛到真值但是不一定对所有的序列都满足,保证收敛概率为1的条件为:


上述两个收敛条件在【alpha】n(a) = 1/n,得到满足,但是在【alpha】n(a) = alpha时,无法满足第二种情况,说明估计永远无法收敛,而是会随着最近的得到的收益而变化。


2.6 Optimistic Initial Values

All the methods we have discussed so far are dependent to some extent on the initial action-value estimates, Q1(a). In the language of statistics, these methods are biased by their initial estimates. For the sample-average methods, the bias disappears once all actions have been selected at least once, but for methods with constant alpha, the bias is permanent, though decreasing over time as given by (2.6). In practice, this kind of bias is usually not a problem and can sometimes be very helpful. The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero. The upside is that they provide an easy way to supply some prior knowledge about what level of rewards can be expected.


Initial action values can also be used as a simple way to encourage exploration. Suppose that instead of setting the initial action values to zero, as we did in the 10-armed testbed, we set them all to +5. Recall that the q*(a) in this problem are selected from a normal distribution with mean 0 and variance 1. An initial estimate of +5 is thus wildly optimistic. But this optimism encourages action-value methods to explore. Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.



Initially, the optimistic method performs worse because it explores more, but eventually it performs better because its exploration decreases with time. We call this technique for encouraging exploration optimistic initial values. We regard it as a simple trick that can be quite effective on stationary problems, but it is far from being a generally useful approach to encouraging exploration. For example, it is not well suited to nonstationary problems because its drive for exploration is inherently


2.7 Upper-Confidence-Bound Action Selection (基于置信度上界的动作选择)

Exploration is needed because there is always uncertainty about the accuracy of the action-value estimates. The greedy actions are those that look best at present, but some of the other actions may actually be better


epsilon-greedy action selection forces the non-greedy actions to be tried, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates



The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with c determining the confidence level.

这种基于置信度上界的动作选择思想是,平方根项  是对a动作值估计的不确定性或方差的度量。a的可能值的上限是最大值,参数c决定了置信水平。每次选a时不确定性都会减少。Nt(a)增加,平方根项就会减少。选择a以外的动作就会导致分母不变,分子的t变大,不确定性增加。自然对数意味着增加会逐渐变小但是是无限的。所有动作都会被选择,但是价值较低或者被多次选择的动作被选频率降低。


Results with UCB on the 10-armed testbed are shown in Figure 2.4. UCB often performs well, as shown here, but is more difficult than "-greedy to extend beyond bandits to the more general reinforcement learning settings considered in the rest of this book. One difficulty is in dealing with nonstationary problems; methods more complex than those presented in Section 2.5 would be needed. Another difficulty is dealing with large state spaces, particularly when using function approximation as developed in Part II of this book. In these more advanced settings the idea of UCB action selection is usually not practical.



2.8 Gradient Bandit Algorithms (梯度赌博机算法)

In this section we consider learning a numerical preference for each action a, which we denote Ht(a). The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward.


Only the relative preference of one action over another is important;if we add 1000 to all the action preferences there is no effect on the action probabilities, which are determined according to a soft-max distribution



There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. On each step, after selecting action At and receiving the reward Rt, the action preferences are updated by:



shows results with the gradient bandit algorithm on a variant of the 10- armed testbed in which the true expected rewards were selected according to a normal distribution with a mean of +4 instead of zero (and with unit variance as before). This shifting up of all the rewards has absolutely no effect on the gradient bandit algorithm because of the reward baseline term, which instantaneously adapts to the new level. But if the baseline were omitted (that is, if ¯ Rt was taken to be constant zero in (2.12)), then performance would be significantly degraded, as shown in the figure.



2.9 Associative Search (Contextual Bandits)

So far in this chapter we have considered only nonassociative tasks, that is, tasks in which there is no need to associate different actions with different situations. In these tasks the learner either tries to find a single best action when the task is stationary, or tries to track the best action as it changes over time when the task is nonstationary.



As an example, suppose there are several different k-armed bandit tasks, and that on each step you confront one of these chosen at random. Thus, the bandit task changes randomly from step to step. This would appear to you as a single, nonstationary k-armed bandit task whose true action values change randomly from step to step. You could try using one of the methods described in this chapter that can handle nonstationarity, but unless the true action values change slowly, these methods will not work very well.


Now suppose, however, that when a bandit task is selected for you, you are given some distinctive clue about its identity (but not its action values). Maybe you are facing an actual slot machine that changes the color of its display as it changes its action values. Now you can learn a policy associating each task, signaled by the color you see, with the best action to take when facing that task—for instance, if red, select arm 1; if green, select arm 2. With the right policy you can usually do much better than you could in the absence of any information distinguishing one bandit task from another.


This is an example of an associative search task, so called because it involves both trial-and-error learning to search for the best actions, and association of these actions with the situations in which they are best. Associative search tasks are often now called contextual bandits in the literature. Associative search tasks are intermediate between the k-armed bandit problem and the full reinforcement learning problem. They are like the full reinforcement learning problem in that they involve learning a policy, but like our version of the k-armed bandit problem in that each action a↵ects only the immediate reward. If actions are allowed to a↵ect the next situation as well as the reward, then we have the full reinforcement learning problem. We present this problem in the next chapter and consider its ramifications throughout the rest of the book.


2.10 Summary


Although the simple methods explored in this chapter may be the best we can do at present, they are far from a fully satisfactory solution to the problem of balancing exploration and exploitation.


The Gittins-index approach is an instance of Bayesian methods, which assume a known initial distribution over the action values and then update the distribution exactly after each step (assuming that the true action values are stationary). In general, the update computations can be very complex, but for certain special distributions (called conjugate priors) they are easy. One possibility is to then select actions at each step according to their posterior probability of being the best action. This method, sometimes called posterior sampling or Thompson sampling, often performs similarly to the best of the distribution-free methods we have presented in this chapter.


In the Bayesian setting it is even conceivable to compute the optimal balance between exploration and exploitation. One can compute for any possible action the probability of each possible immediate reward and the resultant posterior distributions over action values. This evolving distribution becomes the information state of the problem. Given a horizon, say of 1000 steps, one can consider all possible actions, all possible resulting rewards, all possible next actions, all next rewards, and so on for all 1000 steps. Given the assumptions, the rewards and probabilities of each possible chain of events can be determined, and one need only pick the best. But the tree of possibilities grows extremely rapidly; even if there were only two actions and two rewards, the tree would have 22000 leaves. It is generally not feasible to perform this immense computation exactly, but perhaps it could be approximated efficiently. This approach would effectively turn the bandit problem into an instance of the full reinforcement learning problem. In the end, we may be able to use approximate reinforcement learning methods such as those presented in Part II of this book to approach this optimal solution. But that is a topic for research and beyond the scope of this introductory book.

