Reinforcement learning book 学习笔记第二章

最新推荐文章于 2024-10-19 16:02:10 发布

same-pxt

最新推荐文章于 2024-10-19 16:02:10 发布

阅读量283

点赞数

分类专栏： Reinforcement learning 文章标签：学习开发语言

本文链接：https://blog.csdn.net/m0_61509617/article/details/127778559

版权

Reinforcement learning 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Chapter 2 Multi-armed Bandits

前言

The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions.

强化学习和其他学习的最大不同，就是强化学习采用的是评估判断动作的好坏，而不是通过正确动作示范给出的指导。

Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken. This kind of feedback is the basis of
supervised learning, which includes large parts of pattern classification, artificial neural networks, and system identification.

单纯的评估型反馈只能表示当前采取的动作的好坏程度，没办法指出当前动作是否是当前状态下最好或最坏的动作。单出的指导性反馈只能说出正确动作是什么，这个正确动作与目前的动作无关。

evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken

二者有很大的不同，评估型反馈取决于采取的动作，即采取不同的动作有不同的反馈，指导性反馈不依赖于当前的动作，采取不同的动作也会有相同的反馈。

2.1 A k-armed Bandit Problem

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen objective is to maximize the expected total reward over some time period

假设面临一个k个选择的场景，每次选择后你会得到一个奖励，你的目标是在一定的时间内最大化你的奖励期望。

In our k-armed bandit problem, each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action. We denote the action selected on time step t as At, and the corresponding reward as Rt. The value then of an arbitrary action a, denoted q*(a) , is the expected reward given that a is selected:

If you knew the value of each action, then it would be trivial to solve the k-armed bandit problem: you would always select the action with highest value. We assume that you do not know the action values with certainty, although you may have estimates. We denote the estimated value of action a at time step t as Qt(a). We would like Qt(a) to be close to q*(a). We denote the estimated value of action a at time step t as Qt(a). We would like Qt(a) to be close to q⇤(a).

在我们当前的问题中，我们将t时选择的动作记做At，对应的收益记为Rt，任意动作对应的价值记做q*(a)。
上述表达式表达了他们的关系。我们将对于价值的估计值记作Qt(a).

If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.

当你持续的估计价值函数的时候，每个时间步你都选择最大奖励的值，叫做贪心，将这个趁为exploit，如果不选择最大的奖励值，就叫做explore，此举可以改善你对当前状态下贪心动作的价值估计。exploit可以最大化当前时刻的奖励，但是explore可以最大化长远的奖励。

For example, suppose a greedy action’s value is known with certainty, while several other actions are estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don’t know which one. If you have many time steps ahead on which to make action selections, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them many times. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the “conflict” between exploration and exploitation.

举例来说，目前已知一个贪心动作，与此同时还有其他的动作具有不确定性，但是这些动作中至少有一个好于目前的贪心动作，如果你可以做很多次选择，那么选择非贪心的动作会好于贪心动作。短期内效益可能会很低，但是长期来说会很好。在同一次动作选择时，explore与exploit不能同时存在。

2.2 Action-value Methods

We begin by looking more closely at methods for estimating the values of actions and for using the estimates to make action selection decisions, which we collectively call action-value methods. Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received:

在这里插入图片描述
使用价值的估计来进行动作的选择，称为动作价值方法。动作的价值是选择这个动作收益的期望值，可以利用计算实际收益的平均值来估计动作的的价值。

where Ipredicate denotes the random variable that is 1 if predicate is true and 0 if it is not. If the denominator is zero, then we instead define Qt(a) as some default value, such as 0. As thedenominator goes to infinity, by the law of large numbers, Qt(a) converges to q*(a). We call this the sample-average method for estimating action values because each estimate is an average of the sample of relevant rewards. Of course this is just one way to estimate action values, and not necessarily the best one. Nevertheless, for now let us stay with this simple estimation method and turn to the question of how the estimates might be used to select actions.

Ipredicate表示随机变量，当predicate为真时其值为1，为假为0。分母为0时，将Qt（a）定义成固定值，比如0，当分母趋向于无穷时，会收敛于q*(a)。将这种估计价值的方法称作采样平均法，因为都是对于相关收益样本的平均。

We write this greedy action selection method as

上述greedy方法对应的描述如上，选择目前奖励值最大的动作。

A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability ", instead select randomly from among all the actions with equal probability, independently of the action-value estimates. We call methods using this near-greedy action selection rule "epsilon-greedy methods. An advantage of these methods is that, in the limit as the number of steps increases, every action will be sampled an infinite number of times, thus ensuring that all the Qt(a) converge to q*(a). This of course implies that the probability of selecting the optimal action converges to greater than 1 −epsilon ", that is, to near certainty. These are just asymptotic guarantees, however, and say little about the practical effectiveness of the methods.

epsilon-greedy方法，指的是大部分时间选择greedy动作，以一定的概率epsilon（一般很小）等可能的选择所有动作。这种方法的优点是如果时间不断拉长，那么每个动作都会被选择，确保Qt（a）收敛于q*(a)。

2.3 The 10-armed Testbed

这一节通过两个实例（十臂）来验证epsilon-greedy方法的有效性。

实例1：

在这里插入图片描述

the action values, q*(a), a = 1,…, 10, were selected according to a normal (Gaussian) distribution with mean 0 and variance 1.

2.1中，动作的真实价值为q*(a)，从一个均值为0，方差为1的标准正态分布中选择。

Then, when a learning method applied to that problem selected action At at time step t, the actual reward, Rt, was selected from a normal distribution with mean q*(At) and variance 1.

对于该种方法在时刻t选择At时，实际的收益Rt则由一个均值为q*(At)方差为1的正态分布决定。
上图中显示为灰色区域。

For any learning method, we can measure its performance and behavior as it improves with experience over 1000 time steps when applied to one of the bandit problems. This makes up one run. Repeating this for 2000 independent runs, each with a different bandit problem, we obtained measures of the learning algorithm’s average behavior.

对于一个学习方法，我们会对它进行1000次的学习提升，经验交互的结果进行比较，
这样称为一轮。对于一个特定的问题，我们会进行2000轮次来得到对于这个方法效果的评估。

实例2
在这里插入图片描述

Figure 2.2 compares a greedy method with two “epsilon-greedy methods (”= 0.01 and "= 0.1), as described above, on the 10-armed testbed. All the methods formed their action-value estimates using the sample-average technique. The upper graph shows the increase in expected reward with experience.

图2.2比较了epsilon-greedy方法与普通的贪心方法。所有方法都用平均策略来形成对动作价值的估计。

The upper graph shows the increase in
expected reward with experience. The greedy method improved slightly faster than the
other methods at the very beginning, but then leveled off at a lower level. It achieved a
reward-per-step of only about 1, compared with the best possible of about 1.55 on this
testbed. The greedy method performed significantly worse in the long run because it often got stuck performing suboptimal actions.

上方的图显示了期望的收益随着经验的增长而增长。贪心方法最初增长的较快，但是后来就保持在一个靠下的水平。从长远来看，贪心方法比较糟糕，因为几乎总会选择次优的动作。

The lower graph shows that the greedy
method found the optimal action in only approximately one-third of the tasks. In the
other two-thirds, its initial samples of the optimal action were disappointing, and it never
returned to it. The "epsilon-greedy methods eventually performed better because they continued to explore and to improve their chances of recognizing the optimal action. The " = 0.1
method explored more, and usually found the optimal action earlier, but it never selected
that action more than 91% of the time. The " = 0.01 method improved more slowly, but
eventually would perform better than the " = 0.1 method on both performance measures
shown in the figure. It is also possible to reduce " over time to try to get the best of both
high and low values.

靠下的图反映了贪心方法只在任务的大约三分之一部分能够找到最优动作，在另外三分之二所选动作非常不好，无法找到最优动作。epsilon-greedy方法最终表现的很好，因为他可以不断地试探提升来选择更优的动作。0.1的方法探索的范围更大，所以更快选择出最优动作。但是这个概略永远不会超过91%。所以0.01最终效果更好。我们最好可以将epsilon随着时间逐渐减小以获得最有效果。

The advantage of "epsilon-greedy over greedy methods depends on the task. For example,
suppose the reward variance had been larger, say 10 instead of 1. With noisier rewards
it takes more exploration to find the optimal action, and "epsilon-greedy methods should fare
even better relative to the greedy method. On the other hand, if the reward variances
were zero, then the greedy method would know the true value of each action after trying
it once. In this case the greedy method might actually perform best because it would
soon find the optimal action and then never explore.

epsilon-greedy方法的优势取决于任务本身。例如，对于奖励方差大的任务，比如是10不是1.由于收益的噪声过多，所以需要更多次的试探，epsilon-greedy会比贪心好很多。但是如果方差为0，贪心方法在尝试一次就可以获得最好值，那么可能由于epsilon-greedy方法。

But even in the deterministic case
there is a large advantage to exploring if we weaken some of the other assumptions. For
example, suppose the bandit task were nonstationary, that is, the true values of the
actions changed over time. In this case exploration is needed even in the deterministic
case to make sure one of the nongreedy actions has not changed to become better than
the greedy one. As we shall see in the next few chapters, nonstationarity is the case
most commonly encountered in reinforcement learning. Even if the underlying task is
stationary and deterministic, the learner faces a set of banditlike decision tasks each of
which changes over time as learning proceeds and the agent’s decision-making policy
changes. Reinforcement learning requires a balance between exploration and exploitation.

即使在确定性的情况下，当我们弱化一些假设时，试探也会有很大好处。比如，赌博机是非平稳的，动作的价值会随着时间变化，那么我们也需要试探。强化学习需要在开发和试探中取得平衡。

2.4 Incremental Implementation

本节是要给出一种更加高效计算收益均值的方法，尤其是保证常数级内存需求和常数级单时刻计算量。

To simplify notation we concentrate on a single action. Let Ri now denote the reward
received after the ith selection of this action, and let Qn denote the estimate of its action
value after it has been selected n − 1 times, which we can now write simply as

我们将Ri记作这一动作被选择i次所获得的收益，Qn表示被选择n-1次后他的估计的动作价值。

这种方法是没有必要的。

为了计算每个新的收益，要设计增量式公式以小而恒定的计算更新平均值。可以用如下方法计算：
在这里插入图片描述
一般形式是：

The expression [
Target−OldEstimate]
is an error in the estimate. It is reduced by taking
a step toward the “Target.” The target is presumed to indicate a desirable direction in
which to move, though it may be noisy. In the case above, for example, the target is the
nth reward

[目标-旧估计值]是估计值的误差，误差会随着靠近目标而减少。虽然目标也会有一定的噪声，但是我们可以看做他是对的。上述例子中，目标就是第n次的收益。

步长是会变化的，我们将其记作alpha或者alpha（a）。

算法代码：
在这里插入图片描述
bandit(a)会接收一个动作作为参数返回其收益。

2.5 Tracking a Nonstationary Problem （跟踪一个非平稳问题）

非平稳问题就是收益的概率会随着时间变化的问题；

In such cases it makes sense to give more weight to recent rewards than to long-past rewards. One of the most popular ways of doing this is to use a constant step-size parameter. For example, the incremental update rule for updating an average Qn of the n − 1 past rewards is modified to be

这种情况下我们需要给当前的奖励以更大的权重，一种方法是设置一个常数步长，
例如，上述更新法则可以更改为上图。

where the step-size parameter alpha属于 (0, 1] is constant. This results in Qn+1 being a weighted
average of past rewards and the initial estimate Q1:

将步长固定后，可以得到过去收益以及初始估计的加权平均。

We call this a weighted average because the sum of the weights is
（1-alpha）**n+∑i=1到n alpha（1-alpha）**n-i=1；as you can check for yourself. Note that the weight alpha(1 − alpha)**n−i , given to thereward Ri depends on how many rewards ago, n − i, it was observed. The quantity 1 − alphais less than 1, and thus the weight given to Ri decreases as the number of intervening rewards increases. In fact, the weight decays exponentially according to the exponent on 1 − alpha.Accordingly, this is sometimes called an exponential recency-weighted average

我们将它称为加权平均，因为权重的和为1.赋予Rn的权值取决于他被观测到的时刻与当前时刻的差值，即n-i。
赋予Rn的权值也逐渐减小。事实上将会以指数形式递减。这种方法也叫做“指数近因加权平均”

Sometimes it is convenient to vary the step-size parameter from step to step. Let 【alpha】 n(a) denote the step-size parameter used to process the reward received after the nth selection of action a. As we have noted, the choice 【alpha】n(a) = 1/n results in the sample-average method,
which is guaranteed to converge to the true action values by the law of large numbers. But of course convergence is not guaranteed for all choices of the sequence {【alpha】n(a)}. A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1:

有时候随着时刻一步步改变步长参数是很方便的。设【alpha】n(a) 表示用于处理第n次选择动作a后收到的收益的步长参数。正如我们注意到的，选择【alpha】n(a) = 1/n，将会得到平均采样法，大数定律保证它收敛到真值但是不一定对所有的序列都满足，保证收敛概率为1的条件为：

在这里插入图片描述

上述两个收敛条件在【alpha】n(a) = 1/n，得到满足，但是在【alpha】n(a) = alpha时，无法满足第二种情况，说明估计永远无法收敛，而是会随着最近的得到的收益而变化。

习题2.4
在这里插入图片描述

2.6 Optimistic Initial Values

All the methods we have discussed so far are dependent to some extent on the initial action-value estimates, Q1(a). In the language of statistics, these methods are biased by their initial estimates. For the sample-average methods, the bias disappears once all actions have been selected at least once, but for methods with constant alpha, the bias is permanent, though decreasing over time as given by (2.6). In practice, this kind of bias is usually not a problem and can sometimes be very helpful. The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero. The upside is that they provide an easy way to supply some prior knowledge about what level of rewards can be expected.

在我们之前讨论的所有问题中，都基于对初始值Q1（a）的选择。从统计学来讲这些方法是有偏的，对于平均采样法，这种偏差会在所有动作被选择一次后消失，但是对于固定步长alpha的情况，偏差会逐渐减小但是不会消失。缺点是如果不把他们设置为0，它就变成了一个必须由用户选择的参数集。好处是他们可以简单的设置为关于预期收益水平的先验知识。

Initial action values can also be used as a simple way to encourage exploration. Suppose that instead of setting the initial action values to zero, as we did in the 10-armed testbed, we set them all to +5. Recall that the q*(a) in this problem are selected from a normal distribution with mean 0 and variance 1. An initial estimate of +5 is thus wildly optimistic. But this optimism encourages action-value methods to explore. Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.

初始动作的价值也提供了一种简单的试探方式，比如10-armed测试中，将初始值替换为5。在之前这个问题中q*(a)是按照均值为0，方差为1的正态分布选择的。5是一个非常乐观的估计。这种方法会鼓励我们使用动作价值方法去试探，无论选择什么，收益都会比最开始的估计小一点。机器就会失望，从而试探其他动作，这样的话即使每次都采用贪心动作，也会进行很多的试探。

在这里插入图片描述

Initially, the optimistic method performs worse because it explores more, but eventually it performs better because its exploration decreases with time. We call this technique for encouraging exploration optimistic initial values. We regard it as a simple trick that can be quite effective on stationary problems, but it is far from being a generally useful approach to encouraging exploration. For example, it is not well suited to nonstationary problems because its drive for exploration is inherently

上图显示了将最初奖励值为5的情况。也列举了为0的情况。如图所示，在最开始的时候，优化的方法比原来的方法糟糕因为他进行了太多的试探，但是最终他表现的很好因为他的试探变少了。我们将这种方法称作乐观初始价值。这是一个在平稳问题中非常有效的方法。但是不适用于非平稳情况，所有关注初始条件的方法都不太能够解决非平稳问题。

2.7 Upper-Confidence-Bound Action Selection （基于置信度上界的动作选择）

Exploration is needed because there is always uncertainty about the accuracy of the action-value estimates. The greedy actions are those that look best at present, but some of the other actions may actually be better

因为动作价值估计有不确定性，所以试探是必须的，贪心动作是目前看的最好的动作，但是其他动作实际上可能更好。

epsilon-greedy action selection forces the non-greedy actions to be tried, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates

epsilon-greedy方法会尝试选择非贪心的动作，但是这种选择比较盲目，因为不太会选择接近贪心或者不确定性更大的动作。在非贪心的动作里，我们最好还是根据他们的潜力选择可能的最优动作，考虑他们有多接近最大值，以及这些估计的不确定性。

在这里插入图片描述
上述公式提供了一种思路，Nt（a）表示在时刻t之前动作a被选择的次数，c是控制试探的程度，若Nt（a）等于0，则认为a是满足最大化条件的动作。

The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with c determining the confidence level.

这种基于置信度上界的动作选择思想是，平方根项  是对a动作值估计的不确定性或方差的度量。a的可能值的上限是最大值，参数c决定了置信水平。每次选a时不确定性都会减少。Nt(a)增加，平方根项就会减少。选择a以外的动作就会导致分母不变，分子的t变大，不确定性增加。自然对数意味着增加会逐渐变小但是是无限的。所有动作都会被选择，但是价值较低或者被多次选择的动作被选频率降低。

在这里插入图片描述

Results with UCB on the 10-armed testbed are shown in Figure 2.4. UCB often performs well, as shown here, but is more difficult than "-greedy to extend beyond bandits to the more general reinforcement learning settings considered in the rest of this book. One difficulty is in dealing with nonstationary problems; methods more complex than those presented in Section 2.5 would be needed. Another difficulty is dealing with large state spaces, particularly when using function approximation as developed in Part II of this book. In these more advanced settings the idea of UCB action selection is usually not practical.

上图展现了UCB算法的结果，通常表现得较好

在最初的10步中，模型主要是在所有动作中选择，因为此时Nt(a）为0。然后在11步时就开始greedy选择，智能体会选择greedy动作，但是当一个动作被选择很多次以后，智能体再次访问时就会减少奖励值。

2.8 Gradient Bandit Algorithms （梯度赌博机算法）

In this section we consider learning a numerical preference for each action a, which we denote Ht(a). The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward.

在本节我们对每一个动作a学习一个数值化的偏好函数Ht（a），偏好函数越大，动作就越频繁被选择，但是偏好函数不是从收益的意义提出的。

Only the relative preference of one action over another is important;if we add 1000 to all the action preferences there is no effect on the action probabilities, which are determined according to a soft-max distribution

相对的偏好才是重要的，给每个动作都赋值很大没有意义，soft-max分布如下

在这里插入图片描述
我们引入了一个新的概念pi/t(a)指的是在时刻t执行动作a的可能性，刚开始偏号是相同的，每个动作都会被选择。

There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. On each step, after selecting action At and receiving the reward Rt, the action preferences are updated by:

在这里插入图片描述
基于随机梯度上升思想，偏号函数按照如下方式更新，alpha表示步长，Rt一杠表示收益的平均值。高于Rt杠就会增加选择a的概率，放置会降低，没有被选择的动作被选择概率也会上升。

在这里插入图片描述

shows results with the gradient bandit algorithm on a variant of the 10- armed testbed in which the true expected rewards were selected according to a normal distribution with a mean of +4 instead of zero (and with unit variance as before). This shifting up of all the rewards has absolutely no effect on the gradient bandit algorithm because of the reward baseline term, which instantaneously adapts to the new level. But if the baseline were omitted (that is, if ¯ Rt was taken to be constant zero in (2.12)), then performance would be significantly degraded, as shown in the figure.

上图中，将真实的期望收益按照平均值4的正态分布来选择。这种变化对该算法没有任何影响，收益基准项很快就可以适应收益水平。Rt杠为0时，没有基准项，性能降低

在这里插入图片描述

2.9 Associative Search (Contextual Bandits)

So far in this chapter we have considered only nonassociative tasks, that is, tasks in which there is no need to associate different actions with different situations. In these tasks the learner either tries to find a single best action when the task is stationary, or tries to track the best action as it changes over time when the task is nonstationary.

目前仅仅讨论了非关联任务，对于他们来说，没有必要将不同的动作与不同的情景联系起来。这些任务中，任务是平稳的时候就要寻找一个最佳动作，非平稳问题最佳动作会随着时间变化而改变，此时会试图追踪最佳动作。

但是在一般的强化学习任务中，我们通常是要学习一种策略，就是从特定情景到最优动作的映射。
下面要讨论从非关联任务到关联任务的简单方法。

As an example, suppose there are several different k-armed bandit tasks, and that on each step you confront one of these chosen at random. Thus, the bandit task changes randomly from step to step. This would appear to you as a single, nonstationary k-armed bandit task whose true action values change randomly from step to step. You could try using one of the methods described in this chapter that can handle nonstationarity, but unless the true action values change slowly, these methods will not work very well.

举个例子，假设目前有多个k臂赌博机任务，每一步你要随机的面对其中的一个。因此赌博机任务在每一步都是在不断地变化的。从你的角度看，这是一个单一的非平稳的赌博机任务，真正的动作价值是每一步随机变化的，你可以使用本章中提出的处理非平稳的方法，但是除非真正的动作价值改变是非常缓慢的，否则这些方法效果都不太好。

Now suppose, however, that when a bandit task is selected for you, you are given some distinctive clue about its identity (but not its action values). Maybe you are facing an actual slot machine that changes the color of its display as it changes its action values. Now you can learn a policy associating each task, signaled by the color you see, with the best action to take when facing that task—for instance, if red, select arm 1; if green, select arm 2. With the right policy you can usually do much better than you could in the absence of any information distinguishing one bandit task from another.

现在假设，当你遇到一个k臂赌博机任务时，你会得到关于这个任务的编号的明显线索（但不是动作的价值）。也许你面对的是一个真正的老虎机，他的外观颜色与你的动作价值一一对应，动作价值集合改变的时候，外观颜色也会改变。这时你就可以学习一些任务相关策略。用你看到的颜色作为信号，把每个任务和当下最优动作直接关联起来。有了这种关联，再知道任务编号信息的时候，你会做的臂不知道任务编号信息时好。

This is an example of an associative search task, so called because it involves both trial-and-error learning to search for the best actions, and association of these actions with the situations in which they are best. Associative search tasks are often now called contextual bandits in the literature. Associative search tasks are intermediate between the k-armed bandit problem and the full reinforcement learning problem. They are like the full reinforcement learning problem in that they involve learning a policy, but like our version of the k-armed bandit problem in that each action a↵ects only the immediate reward. If actions are allowed to a↵ect the next situation as well as the reward, then we have the full reinforcement learning problem. We present this problem in the next chapter and consider its ramifications throughout the rest of the book.

这是一个关联任务搜索的例子，既涉及采用试错学习来搜索最优的动作，又将这些动作与他们表现最优的时候关联在一起。关联搜索任务通常被称为上下文关联的赌博机。

2.10 Summary

在本章中介绍了几种方法，
epsilon方法在一小段时间内进行动作的随机选择，
UCB方法虽然采用确定的动作选择，却可以通过在每个时刻对那些具有较少样本的动作进行优先选择来实现试探。
赌博机方法则不采用动作价值，而是利用偏好函数，利用softmax进行一种分级的，概率式的方法选择最优的动作。
简单的将收益的初始值进行乐观的设置，就可以让贪心方法进行显性试探。
在这里插入图片描述
上图显示了本章学习方法的性能曲线，UCB表现得更好。

Although the simple methods explored in this chapter may be the best we can do at present, they are far from a fully satisfactory solution to the problem of balancing exploration and exploitation.

尽管上述简单的方法使我们目前能够实现的最好的，但是他们远远不能解决平衡试探和开发的问题。

The Gittins-index approach is an instance of Bayesian methods, which assume a known initial distribution over the action values and then update the distribution exactly after each step (assuming that the true action values are stationary). In general, the update computations can be very complex, but for certain special distributions (called conjugate priors) they are easy. One possibility is to then select actions at each step according to their posterior probability of being the best action. This method, sometimes called posterior sampling or Thompson sampling, often performs similarly to the best of the distribution-free methods we have presented in this chapter.

贝叶斯方法假定已知动作价值的初始分布，然后在每步之后更新分布。更新计算可能复杂，但是对于某些特殊分布则很容易。这样我们就可以根据动作价值的后验概率，在每一步中选择最优的动作，这种方法成为后验采样，与本章提出的最好的无分布方法性能相近。

In the Bayesian setting it is even conceivable to compute the optimal balance between exploration and exploitation. One can compute for any possible action the probability of each possible immediate reward and the resultant posterior distributions over action values. This evolving distribution becomes the information state of the problem. Given a horizon, say of 1000 steps, one can consider all possible actions, all possible resulting rewards, all possible next actions, all next rewards, and so on for all 1000 steps. Given the assumptions, the rewards and probabilities of each possible chain of events can be determined, and one need only pick the best. But the tree of possibilities grows extremely rapidly; even if there were only two actions and two rewards, the tree would have 22000 leaves. It is generally not feasible to perform this immense computation exactly, but perhaps it could be approximated efficiently. This approach would effectively turn the bandit problem into an instance of the full reinforcement learning problem. In the end, we may be able to use approximate reinforcement learning methods such as those presented in Part II of this book to approach this optimal solution. But that is a topic for research and beyond the scope of this introductory book.

贝叶斯方法可以计算出试验和开发之间的最佳平衡，对于任何可能的动作，我们都可以计算出它对应的即时收益分布，以及相应动作的后验分布。这种不断变化的分布称为问题的信息状态。假设问题的视界有1000步，则可以考虑所有可能的动作，所有可能的收益，所有可能的下一个动作。以此类推到后1000步，有了这些假设，可以确定每个可能的事件链的收益和概率，并且只挑选最好的。贝叶斯方法有效的将赌博机问题转化为完整强化学习的一个实例。