RL(Chapter 2): Multi-arm Bandits (多臂读博机)-CSDN博客

本文链接：https://blog.csdn.net/weixin_42437114/article/details/109295010

本文探讨了强化学习的基础概念，包括探索与利用之间的平衡、动作-价值方法等，并通过10臂测试平台评估了不同方法的表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文为强化学习笔记，主要参考以下内容：

Reinforcement Learning: An Introduction
代码全部来自 GitHub
习题答案参考 Github

Notation
A $k$ -Armed Bandit Problem
- Explore vs. Exploit
Action-Value Methods
Incremental Implementation
Pseudocode for a complete bandit algorithm
Tracking a Nonstationary Problem
Optimistic Initial Values
Unbiased Constant-Step-Size Trick
Upper-Confidence-Bound Action Selection (UCB)
Gradient Bandit Algorithms
Associative Search (Contextual Bandits)
Compare the methods' performances

In this chapter we study the evaluative aspect of reinforcement learning in a simplified setting, one that does not involve learning to act in more than one situation. The particular nonassociative, evaluative feedback problem that we explore is a simple version of the $k$ -armed bandit problem.

Notation

$A_t$ : the action selected on time step $t$
$R_t$ : the corresponding reward after taking $A_t$
$q_*(a)$ : the value of an arbitrary action $a$ , 即给定动作 $a$ 时的期望收益
$q_*(a)=\mathbb E[R_t|A_t=a]$
$Q_t(a)$ : the estimated value of action $a$ at time step $t$
$\pi_t(a)$ : the probability of taking action $a$ at time $t$ .
$\mathbb I_{predictable}$ (指示函数): denotes the random variable that is $1$ if $p r e d i c a t e$ is true and $0$ if it is not

因为打不出黑板体的 1，就暂时用 $\mathbb I$ 代替

$\alpha_t(a)$ / $\alpha$ : step-size parameter (步长参数)
- $\alpha_t(a)$ denote the step-size parameter used to process the reward received after the $t$ th selection of action $a$

在增量式实现中用到

$N_t(a)$ : the number of times that action $a$ has been selected prior to time $t$
$H_t(a)$ : preference(偏好) for action $a$

在梯度读博机算法中用到

A $k$ -Armed Bandit Problem

This is the original form of the $k$ - $a r m e d$ $b a n d i t$ $p r o b l e m$ :
- You are faced repeatedly with a choice among $k$ different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected.
- Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or $t i m e$ $s t e p s$ .
In our $k$ -armed bandit problem, each action has an expected or mean reward given that that action is selected; let us call this the $v a l u e$ of that action.

Basically, We would like $Q_t(a)$ (our estimate) to be close to $q_*(a)$ (value).

Explore vs. Exploit

$e x p l o i t$ (开发): Selsct the action with the largest estimatied value. ( $g r e e d y$ actions 贪心动作)
$e x p l o r e$ (试探): Select one of the nongreedy actions.

Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.
Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them many times.
- For example, suppose a greedy action’s value is known with certainty, while several other actions are estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don’t know which one. If you have many time steps ahead on which to make action selections, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action.

Because it is not possible both to explore and to exploit with any single action selection, one often refers to the "conflict" between exploration and exploitation.

Action-Value Methods

动作-价值方法

Action-Value Methods include

methods for estimating the values of actions
methods for using the estimates to make action selection decisions

Sample-average method

采样平均方法

Recall that the true value of an action is the mean reward received when that action is selected.
One natural way to estimate this is by averaging the rewards actually received:
- If the denominator is zero, then we define $Q_t(a)$ as some default value, such as $0$ . As the denominator goes to infinity, by the law of large numbers, $Q_t(a)$ converges to $q_*(a)$ . (强大数定律可以保证采样均值以概率 1 收敛于均值)

Greedy Method

贪心方法

The simplest action selection rule is to select one of the actions with the highest estimated value. If there is more than one greedy action, then a selection is made among them in some arbitrary way, perhaps randomly.
$A_t=\argmax_a Q_t(a)\ \ \ \ \ \ \ \ \ (2.2)$

$\varepsilon$ -greedy Method

Behave greedily most of the time, but with small probability $\varepsilon$ , instead select randomly from among all the actions with equal probability independently of the action-value estimates.

An advantage of this method is that, in the limit as the number of plays increases, every action will be sampled an infinite number of times, thus ensuring that all the $Q_t(a)$ converge to their respective $q_*(a)$ .
This of course implies that the probability of selecting the optimal action converges to greater than $1-\varepsilon$ , that is, to near certainty.

These are just asymptotic guarantees, however, and say little about the practical effectiveness of the methods.

The $10$ -armed Testbed

10 臂测试平台

To roughly assess the relative effectiveness of the greedy and $\varepsilon$ -greedy methods, we compared them numerically on a set of 2000 randomly generated 10-armed bandit problems.

For each bandit, the action values, $q_*(a)$ , $a = 1, . . ., 10$ , were selected according to a standard normal (Gaussian) distribution.
Then, when a learning method applied to that problem selected action $A_t$ at time step $t$ , the actual reward, $R_t$ , was selected from a normal distribution with mean $q_*(A_t)$ and variance 1. These distributions are shown in gray in Figure 2.1. We call this suite of test tasks the $10$ - $a r m e d$ $t e s t b e d$ .

For any learning method, we can measure its performance and behavior as it improves with experience over 1000 time steps when applied to one of the bandit problems. This makes up one $r u n$ . Repeating this for 2000 independent runs, each with a different bandit problem, we obtained measures of the learning algorithm’s average behavior.
Figure 2.2 compares a greedy method with two $\varepsilon$ -greedy methods ( $\varepsilon=0.01$ and $\varepsilon=0.1$ ) on the 10-armed testbed.

All the methods formed their action-value estimates using the sample-average technique (with an initial estimate of $0$ ).

在这里插入图片描述

The greedy method performs significantly worse in the long run because it often gets stuck performing suboptimal actions.The $\varepsilon$ -greedy methods eventually performed better because they continued to explore and to improve their chances of recognizing the optimal action.
The $\varepsilon$ = 0.1 method explored more, and usually found the optimal action earlier, but it never selected that action more than 91% of the time. The $\varepsilon$ = 0.01 method improved more slowly, but eventually would perform better than the $\varepsilon$ = 0.1 method on both performance measures shown in the figure.
It is also possible to reduce $\varepsilon$ over time to try to get the best of both high and low values.

The advantage of $\varepsilon$ -greedy over greedy methods depends on the task.
- For example, suppose the reward variance had been larger. With noisier rewards it takes more exploration to find the optimal action, and $\varepsilon$ -greedy methods should fare even better relative to the greedy method.
- On the other hand, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once. In this case the greedy method might actually perform best because it would soon find the optimal action and then never explore.

But even in the deterministic (确定的) case, there is a large advantage to exploring if we weaken some of the other assumptions.
- For example, suppose the bandit task were $n o n s t a t i o n a r y$ (非平稳的), that is, the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the greedy one.

Nonstationarity (非平稳性) is the case most commonly encountered in reinforcement learning.

Incremental Implementation

增量式实现

The action-value methods we have discussed so far all estimate action values as sample averages of observed rewards. We now turn to the question of how these averages can be computed in a computationally efficient manner, in particular, with constant memory and constant per-time-step computation.

To simplify notation we concentrate on a single action. Let $R_i$ now denote the reward received after the $i$ th selection of this action, and let $Q_n$ denote the estimate of its action value after it has been selected $n - 1$ times, which we can now write simply as

在这里插入图片描述

The obvious implementation is to maintain, for each action $a$ , a record of all the rewards and then perform this computation whenever the estimated value was needed. A problem with this straightforward implementation is that its memory and computational requirements grow over time without bound.
As you might suspect, this is not really necessary. It is easy to devise incremental formulas for updating averages. For some action, let $Q_k$ denote the estimate for its $k$ th reward, then
$\begin{aligned}Q_{k+1}&=\frac{1}{k}\sum_{i=1}^kR_i \\&=\frac{1}{k}\bigg((k-1)Q_k+R_k\bigg) \\&=Q_k+\frac{1}{k}\bigg[R_k-Q_k\bigg]\ \ \ \ \ \ \ \ \ \ (2.3) \end{aligned}$ which holds even for $k = 1$ , obtaining $Q_2 = R_1$ for arbitrary $Q_1$ .

The update rule $(2.3)$ is of a form that occurs frequently throughout this book. The general form is

在这里插入图片描述

Pseudocode for a complete bandit algorithm

The function bandit(a) is assumed to take an action and return a corresponding reward.
The Pseudocode uses incrementally computed sample averages and $\varepsilon$ -greedy action selection.

在这里插入图片描述

Tracking a Nonstationary Problem

跟踪一个非平稳问题

The averaging methods discussed so far are appropriate in a stationary environment, that is, for bandit problems in which the reward probabilities do not change over time.
But we often encounter reinforcement learning problems that are effectively nonstationary. In such cases it makes sense to weight recent rewards more heavily than long-past ones. One of the most popular ways of doing this is to use a constant step-size parameter $\alpha\in(0,1]$
$Q_{n+1}=Q_n+\alpha\bigg[R_n-Q_n\bigg]\ \ \ \ \ \ \ \ (2.5)$ This results in $Q_{n+1}$ being a weighted average of past rewards and the initial estimate $Q_1$ :
- We call this a weighted average because the sum of the weights is $(1-\alpha)^n + \sum_{i=1}^n\alpha(1 -\alpha)^{n-i} = 1$ .
- Note that the weight, $\alpha(1-\alpha)^{n-i}$ , given to the reward $R_i$ depends on how many rewards ago, $n - i$ , it was observed. The weight given to $R_i$ decreases exponentially as the number of intervening rewards increases. (If $1-\alpha = 0$ , then all the weight goes on the very last reward, $R_k$ .) Accordingly, this is sometimes called an exponential, recency-weighted average. (指数近因加权平均)

Sometimes it is convenient to vary the step-size parameter from step to step. As we have noted, the choice $\alpha_n(a) = \frac{1}{n}$ results in the sample-average method, which is guaranteed to converge to the true action values by the law of large numbers. But of course convergence is not guaranteed for all choices of the sequence $\{\alpha_n(a)\}$ .
A well-known result in stochastic approximation theory (随机逼近定理) gives us the conditions required to assure convergence with probability 1:
- The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations (随机波动).
- The second condition guarantees that eventually the steps become small enough to assure convergence.
- Note that both convergence conditions are met for the sample-average case, $\alpha_n(a) = \frac{1}{n}$ , but not for the case of constant step-size parameter, $\alpha_n(a) = \alpha$ . In the latter case, the second condition is not met, indicating that the estimates never completely converge but continue to vary in response to the most recently received rewards. As we mentioned above, this is actually desirable in a nonstationary environment.
- In addition, sequences of step-size parameters that meet the conditions (2.7) often converge very slowly or need considerable tuning in order to obtain a satisfactory convergence rate. Although sequences of step-size parameters that meet these convergence conditions are often used in theoretical work, they are seldom used in applications and empirical research.

Optimistic Initial Values

乐观初始值

All the methods we have discussed so far are dependent to some extent on the initial action-value estimates, $Q_1(a)$ . In the language of statistics, these methods are $b i a s e d$ by their initial estimates.
For the sample-average methods, the bias disappears once all actions have been selected at least once, but for methods with constant $\alpha$ , the bias is permanent, though decreasing over time.

In practice, this kind of bias is usually not a problem, and can sometimes be very helpful.

The downside is that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only to set them all to zero.
The upside is that they provide an easy way to supply some prior knowledge (先验知识) about what level of rewards can be expected.

Initial action values can also be used as a simple way of encouraging exploration.
- Suppose that instead of setting the initial action values to zero, as we did in the 10-armed testbed, we set them all to $+ 5$ . Recall that the $q_*(a)$ in this problem are selected from a normal distribution with mean 0 and variance 1. An initial estimate of $+ 5$ is thus wildly optimistic. But this optimism encourages action-value methods to explore. Whichever actions are initially selected, the reward is less than the starting estimates; the learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is that all actions are tried several times before the value estimates converge. The system does a fair amount of exploration even if greedy actions are selected all the time.
  - Initially, the optimistic method performs worse because it explores more, but eventually it performs better because its exploration decreases with time.
We regard optimistic initial values as a simple trick that can be quite effective on stationary problems, but it is far from being a generally useful approach to encouraging exploration.
- For example, it is not well suited to nonstationary problems because its drive for exploration is inherently temporary. If the task changes, creating a renewed need for exploration, this method cannot help. Indeed, any method that focuses on the initial state in any special way is unlikely to help with the general nonstationary case.

The beginning of time occurs only once, and thus we should not focus on it too much.

Unbiased Constant-Step-Size Trick

无偏恒定步长技巧

Is it possible to avoid the initial bias of constant step sizes while retaining their advantages on nonstationary problems?
One way is to use a step size of
to process the $n$ th reward for a particular action, where $\alpha > 0$ is a conventional constant step size, and $\overline o_n$ is a trace of one that starts at 0:

Proof
$\begin{aligned} Q_{n+1} &= Q_n+\beta_n(R_n-Q_n)\\ &=\beta_nR_n+(1-\beta_n)Q_n\\ &=\beta_nR_n+(1-\beta_n)\beta_{n-1}R_{n-1}+(1-\beta_n)(1-\beta_{n-1})Q_{n-1}\\ &=...\\ &=\sum_{i=1}^n (\beta_iR_i \prod_{j=i+1}^{n}(1-\beta_j))+\prod_{i=1}^{n}(1-\beta_i)Q_1\\ \beta_1&=\frac{\alpha}{\bar{o_1}}=\frac{\alpha}{0+\alpha}=1\\ \Rightarrow\\ Q_{n+1} &= \sum_{i=1}^n (\beta_iR_i \prod_{j=i+1}^{n}(1-\beta_j))\\ \end{aligned}$ Thus $Q_n$ is an exponential recency-weighted average without initial bias.

Upper-Confidence-Bound Action Selection (UCB)

基于置信度上界的动作选择

$\varepsilon$ -greedy action selection forces the non-greedy actions to be tried, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates.
One effective way of doing this is to select actions as
- The number $c > 0$ controls the degree of exploration. If $N_t(a) = 0$ , then $a$ is considered to be a maximizing action.
- The square-root term is a measure of the uncertainty or variance in the estimate of $a$ 's value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action $a$ , with $c$ determining the confidence level (置信水平).
- Each time $a$ is selected the uncertainty is presumably reduced. On the other hand, each time an action other $a$ is selected $t$ is increased, the uncertainty estimate is increased. The use of the natural logarithm means that the increase gets smaller over time, but is unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.

UCB is more difficult than $\varepsilon$ -greedy to extend beyond bandits to the more general reinforcement learning settings.
- One difficulty is in dealing with nonstationary problems.
- Another difficulty is dealing with large state spaces, particularly function approximation. In these more advanced settings the idea of UCB action selection is usually not practical.

Gradient Bandit Algorithms

梯度读博机算法

In this section we consider learning a numerical $p r e f e r e n c e$ for each action $a$ , which we denote $H_t(a)\in\R$ (偏好函数) .
The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. Only the relative preference of one action over another is important, which are determined according to a $s o f t$ - $m a x$ $d i s t r i b u t i o n$ as follows:
Initially all preferences are the same (e.g., $\forall a,H_1(a) = 0$ ) so that all actions have an equal probability of being selected.

There is a natural learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent:
where $\overline R_t\in \R$ is the average of the rewards up to but not including time $t$ (with $\overline R_1=R_1$ ), which can be computed incrementally.
- The $\overline R_t$ term serves as a baseline with which the reward is compared. If the reward is higher than the baseline, then the probability of taking $A_t$ in the future is increased, and if the reward is below baseline, then the probability is decreased. The non-selected actions move in the opposite direction.

Figure 2.5 shows results with the gradient bandit algorithm on a variant of the 10-armed testbed in which the true expected rewards were selected according to a normal distribution with a mean of +4 instead of zero (and with unit variance as before). This shifting up of all the rewards has absolutely no effect on the gradient bandit algorithm because of the reward baseline term, which instantaneously adapts to the new level. But if the baseline were omitted (that is, if $\overline R_t$ was taken to be constant zero in (2.12)), then performance would be significantly degraded, as shown in the figure.

在这里插入图片描述

The Bandit Gradient Algorithm as Stochastic Gradient Ascent

One can gain a deeper insight into the gradient bandit algorithm by understanding it as a stochastic approximation to gradient ascent.

In exact gradient ascent, each action preference $H_t(a)$ would be incremented in proportion to the increment’s effect on performance:
where the measure of performance here is the expected reward:
Of course, it is not possible to implement gradient ascent exactly in our case, but in fact the updates of our algorithm (2.12) are equal to (2.13) in expected value.
First we take a closer look at the exact performance gradient:

where $B_t$ , called the $b a s e l i n e (基准项)$ , can be any scalar that does not depend on $x$ . We can include a baseline here without changing the equality because the gradient sums to zero over all the actions. As $H_t(a)$ is changed, some actions’ probabilities go up and some go down, but the sum of the changes must be zero because the sum of the probabilities is always one.

Note that we did not require any properties of the reward baseline other than that it does not depend on the selected action. For example, we could have set it to zero, or to 1000, and the algorithm would still be an instance of stochastic gradient ascent. The choice of the baseline does not affect the expected update of the algorithm, but it does affect the variance of the update and thus the rate of convergence (as shown, for example, in Figure 2.5). Choosing it as the average of the rewards may not be the very best, but it is simple and works well in practice.

The equation is now in the form of an expectation, summing over all possible values $x$ of the random variable $A_t$ , Thus:
where here we have chosen the baseline $B_t =\overline R_t$ and substituted $R_t$ for $q_*(A_t)$ , which is permitted because $E[R_t|A_t] = q_*(A_t)$ . Shortly we will establish that $\frac{\partial \pi_t(x)}{\partial H_t(a)}=\pi_t(x)(\mathbb I_{a=x}-\pi_t(a))$ . Assuming that for now, we have
Substituting a sample of the expectation above for the performance gradient in $(2.13)$ yields:
which you may recognize as being equivalent to our original algorithm $(2.12)$ .
Thus it remains only to show that $\frac{\partial \pi_t(x)}{\partial H_t(a)}=\pi_t(x)(\mathbb I_{a=x}-\pi_t(a))$ .
We have just shown that the expected update of the gradient bandit algorithm is equal to the gradient of expected reward, and thus that the algorithm is an instance of stochastic gradient ascent. This assures us that the algorithm has robust convergence properties.

Associative Search (Contextual Bandits)

关联搜索 (上下文相关的读博机)

Nonassociative tasks (非关联任务):

Tasks in which there is no need to associate different actions with different situations (无需将不同的动作与不同的情境联系起来)

Associative tasks:

There is more than one situation, and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations. (从特定情境到最优动作的映射)
- For example, maybe you are facing an actual slot machine that changes the color of its display as it changes its action values. Now you can learn a policy associating each task, signaled by the color you see, with the best action to take when facing that task—for instance, if red, select arm 1; if green, select arm 2.
- This is an example of an associative search task / contextual bandits, so called because it involves both trial-and-error learning to search for the best actions, and association of these actions with the situations in which they are best.

Compare the methods’ performances

We can certainly run them all on the 10-armed testbed. To get a meaningful comparison we have to consider their performance as a function of their parameter.
Our graphs so far have shown the course of learning over time for each algorithm and parameter setting, to produce a $l e a r n i n g$ $c u r v e$ for that algorithm and parameter setting. This kind of graph is called a $p a r a m e t e r$ $s t u d y$ (参数研究图).
- Note that the parameter values are varied by factors of two and presented on a log scale. Note also the characteristic inverted-U shapes of each algorithm’s performance; all the algorithms perform best at an intermediate value of their parameter, neither too large nor too small.
- In assessing a method, we should attend not just to how well it does at its best parameter setting, but also to how sensitive it is to its parameter value. All of these algorithms are fairly insensitive, performing well over a range of parameter values varying by about an order of magnitude. Overall, on this problem, UCB seems to perform best