Notes of chapter 2: Multi-armed bandits

最新推荐文章于 2024-08-18 23:54:03 发布

xiwang_chn

最新推荐文章于 2024-08-18 23:54:03 发布

阅读量590

点赞数

分类专栏： Reinforced learning 文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_42017454/article/details/106707370

版权

Reinforced learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Chapter 2: Multi-armed bandits

1 Summary
2 Exercises
2 Questions
- Q1 Bias in Sample-average methods
- Q2 Exercise 2.6: Mysterious Spikes

1 Summary

In this chapter we focus on the multi-armed bandits problem, in which there is only one state and thus one observation (nonassociative), and different actions correspond to different expected rewards/values.

We need to determine the decision-making policy according to our value table and update the value table after receiving the rewards. It still obeys the close-loop of “observe (no need in this case)” – “act” – “rewards” – “update estimate of value table”.

The trade-off between exploitation and exploration is considered in the following methods. Note we assume a costly exploration in this context.

1.1 The method of updating value table

1.1.1 Sample average method

Average all the rewards one action actually receives.
$Q_{n+1} =\frac{R_1 +R_2 +\cdots +R_{n}}{n}$
In incremental form, it is equivalent to updating the estimate of value table with a step-size parameter $\alpha=\frac{1}{n}$
When $n = 1$ , $Q_2=R_1$
$Q_{n+1}=Q_{n}+\frac{1}{n}[R_n-Q_n] \tag{2.3}$
Sample average method is sure to converge the estimate to actual action values in stationary problems.
Note that $n$ denotes the number of the specific action has selected. It should be tracked for each action.

1.1.2 Exponential recency-weighted average method (constant step size)

To weight more on recent rewards in nonstationary problems, constant step size could be adopted.
$Q_{n+1}=Q_n +\alpha[R_n-Q_n] \tag{2.5}$ where $\alpha \in (0,1]$ is constant.
When $n = 1$ , $Q_2=(1-\alpha)Q_1 +\alpha R_1$ , $Q_1$ is should be selected by user with prior knowledge. So $Q$ is biased by initial estimate.
It calculates the value take as follows:
$Q_{n+1}=(1-\alpha)^nQ_1+\sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_i \tag{2.6}$
The estimate never converges but continues to adapt to most recently received rewards. It is more appropriate to nonstationary problems.

1.1.3 General form of incremental update and convergence criterion

General update rule
$Estimate\leftarrow Old Estimate+StepSize[Target-OldEstimate]$ $[T a r g e t - O l d E s t i m a t e]$ is an error in estimate.
Convergence criterion

1.2 The method of selecting actions

Exploration is needed because there is always uncertainty about the accuracy of the action-value estimates. The greedy actions are those that look best at present, but some of the other actions may actually be better

1.2.1 Greedy action selection method

$A_t=\mathop{argmax}\limits_{a} Q_t(a)$

1.2.2 $\varepsilon$ -Greedy method

$A_t=\begin{cases} \mathop{argmax}\limits_{a} Q_t(a), \text{with probability of }(1-\epsilon) \\ \text{select an action randomly} \end{cases}$
Without optimistic initialization, the greedy method always exploits current knowledge to maximize immediate reward. The $\varepsilon$ -greedy methods will explore with a probability of $\varepsilon$ , and the optimal action will converge to larger than $1-\varepsilon$ after taking many actions.

1.2.3 Upper-Confidence-Bound (UCB) action selection method

$A_t=\mathop{argmax}\limits_{a} [Q_t(a) + c \sqrt{\frac{lnt}{N_t(a)}}]$ where $c > 0$ controls the exploration.
UCB uses upper bound like the mean plus variation to select actions. When time gets larger, actions that are selected less often will have larger upper bound and are more likely to be selected. By taking the upper bound, it obeys the optimistic principle which will lead to exploration in reinforced learning. It has logarithmic regrets.

UCB often performs well, as shown here, but is more difficult than $\varepsilon$ -greedy to extend beyond bandits to the more general reinforcement learning settings considered in the rest of this book. One difficulty is in dealing with nonstationary problems; methods more complex than those presented in Section 2.5 would be needed. Another difficulty is dealing with large state spaces.

1.3 Gradient Bandit Algorithms

1.3.1 Selecting actions

Gradient bandit does not use value tables to select actions, instead, it learns a numerical preference $H (a)$ for each action. Only the relative preference matters. The probability of selecting each action is calculated using soft-max:
$P_r \{A_t=a\}=\frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t (b)}} =\pi_t (a)$

1.3.2 Updating preference (like value table) for each action

$\begin{aligned} H_{t+1}(A_t)=H_{t}(A_t)+\alpha(R_t-\bar{R}_t)(\mathbb{1}_{A_t=a}-\pi_t(A_t)) \end{aligned}$ $\bar{R}_t$ is the mean of previous rewards.
It can be derived using stochastic gradient descent (refer to the book 60/38).
$H_{t+1}(a)=H_t (a)+\alpha \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)}$

1.4 Other methods

1.4.1 Optimistic Initial Values

The exponential recency weighted method is biased by the initial value one gives. If we like, we may set initial value estimates artificially high to encourage exploration in the short run (this is called optimistic initial values). This is a useful trick for stationary problems, but does not apply so well to non-stationary problems as the added exploration is only temporary.
By using large initial values, those actions that are not selected before is usually more likely to be selected. It is a simple way to encourage exploration. For sample-average method, bias disappears once all actions have been selected at least once. For methods with constant stepsize, the bias is permanent (though decreasing over time) $Q_2=(1-\alpha)Q_1 +\alpha R_1$ .

1.4.2 Thompson Sampling

It fits that distribution, picks samples from each of the distributions, and then picks the Arm that has the highest sample. It also has logarithmic regrets. Note that it can hardly be generalized to full reinforced learning problems.

2 Exercises

Exercise 2.1 $\varepsilon$ -greedy

In $\varepsilon$ -greedy action selection, for the case of two actions and $\varepsilon$ = 0.5, what is the probability that the greedy action is selected?
Answer: $P=\varepsilon+(1-\varepsilon)\times 1/2=0.75$

Exercise 2.2: Bandit example

Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using $\varepsilon$ -greedy action selection, sample-average action-value estimates, and initial estimates of $Q_1(a) = 0$ , for all $a$ . Suppose the initial sequence of actions and rewards is $A_1 = 1, R_1 = −1, A_2 = 2, R_2 = 1, A_3 = 2, R_3 = −2, A_4 = 2, R_4 = 2, A_5 = 3, R_5 = 0$ . On some of these time steps the $\varepsilon$ case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?
Answer: The value table for each loop is as follows: $\{0,0,0,0\}\rightarrow \{-1,0,0,0\}\rightarrow \{-1,1,0,0\}\rightarrow \{-1,-0.5,0,0\}\rightarrow \{-1,0.33,0,0\}\rightarrow \{-1,0.33,0,0\}$ . We can conclude that at the first step, second step, third step $\varepsilon$ may occur. It must occur at the fourth and fifth step.

Exercise 2.3 Greedy and $\varepsilon$ -greedy

In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.
$\varepsilon=0.01$ will perform the best with % optimal action $=(1-\varepsilon)+\varepsilon \times 1/10=99.1\%$ ;
$\varepsilon=0.1$ will perform the less well with % optimal action $=(1-\varepsilon)+\varepsilon \times 1/10=91\%$ ;
Greedy method probably gets stuck at some suboptimal point.

Exercise 2.4 Weighted average of varying step size

If the step-size parameters, $\alpha_n$ , are not constant, then the estimate $Q_n$ is a weighted average of previously received rewards with a weighting divergent from that given by (2.6). What is the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of step-size parameters?
$Q_{n+1}=\prod_{i=1}^{n}(1-\alpha_i)Q_1 + \sum_{i=1}^{n}\alpha_i \prod_{j=1}^{i}(1-\alpha_j)R_i$

Exercise 2.5 (programming)

Similar with the homework.

Exercise 2.6: Mysterious Spikes

The results shown in Figure 2.3 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? In other words, what might make this method perform particularly better or worse, on average, on particular early steps?
It may be because the low-value and high-value actions all have large initial estimates. It has to try the low-value actions multiple times to fix the estimate, so it perform worse on early steps, and the high-value actions will not be preferred immediately because there are other actions with high initial estimates. However, optimistic initialization only encourages exploration on early steps, so it may perform worse than the $\varepsilon=0.1$ -greedy method that will be 91% optimal in the long run.
Due to optimisic initialization the first 10 actions will be a sweep through the actions in some random order. On the 11th turn, the action that did best in the first 10 turns is selected again.This action has a greater than chance probability of being the optimal one. It still disappoints due to the optimistic initialization, leading to the subsequent dip. (answer in the quizz)

Exercise 2.7: Unbiased Constant-Step-Size Trick

$\text{Use the step }\beta_n=\alpha/\bar{o}_n, \text{ where }\bar{o}_n=\bar{o}_{n-1}+\alpha(1-\bar{o}_{n-1})$ $\begin{aligned} Q_{n+1}&=Q_n+\beta(R_n-Q_n)\\ \bar{o}_n Q_{n+1}&=\bar{o}_n Q_n+ \alpha(R_n-Q_n)\\ &=\bar{o}_{n-1} Q_n+\alpha(R_n-\bar{o}_{n-1} Q_n) \end{aligned} \\\text{If we compare this equation with Eq (2.6) in the book, then}\\ \begin{aligned} \bar{o}_n Q_{n+1}&=(1-\alpha)^n \bar{o}_{0} Q_1+\sum_{i=1}^n \alpha(1-\alpha)^{n-i}R_i \end{aligned} \\\text{Because $\bar{o}_0=0, Q_1$ will disappear.}$

Exercise 2.8: UCB Spikes

Before 11th step, all the actions that have not been select must be selected once because $N_t(a)=0$ . Thus at the 11th step, the action with the optimal $Q_t(a)$ should be selected because all the variance term is the same. Then at the 12nd step, the variance start to act again, so exploration begins and there is some drop.
The uncertainty estimate of the action selected at timestep 11 will be less than the others. This action will thus be at a disadvantage at the next step. If c is large, then this effect dominates and the action that performed best in the first 10 steps is ruled out on step 12 (answer in the quizz)

Exercise 2.9

Show that in the case of two actions, the soft-max distribution is the same
as that given by the logistic, or sigmoid, function often used in statistics and artificial neural networks.
It is easy to show that.

Exercise 2.10

Suppose you face a 2-armed bandit task whose true action values change
randomly from time step to time step. Specifically, suppose that, for any time step, the true values of actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you face at any step, what is the best expectation of success you can achieve and how should you behave to achieve it? Now suppose that on each step you are told whether you are facing case A or case B (although you still don’t know the true action values). This is an associative search task. What is the best expectation of success you can achieve in this task, and how should you behave to achieve it?
The best expectation for both actions are both 0.5, you can behave randomly.
You can estimate the value independently. Select action 2 in case A and action 1 in case B, the best expectation is 0.50.2+0.50.9=0.55.

Exercise 2.11 (programming)

Similar to the homework.

2 Questions

Q1 Bias in Sample-average methods

Bias in Sample-average methods
In section 2.6 (page 34), it is said that “For the sample-average methods, the bias disappears once all actions have been selected at least once”, I am confused how it works.

In page 2.7, it is said that “the sample-average methods, which also treat the beginning of time as a special event, averaging all subsequent rewards with equal weights”. I am also confused because there is no initial value in the equation 2.1 in page 27.

To conclude, when some initial value or bias is set, how sample-average methods handle it? Why will it disappears once all actions have been selected? Does it mean the initial value will be overwrriten by equation 2.1 when it is selected for the first time?

I aslo posted it in piazza, but I do not get endorsed replies so far.

Q2 Exercise 2.6: Mysterious Spikes

I have anwsered the exercise in the following section, but I still cannot understand the spike, is it caused by the random selection pattern?

This article is for self-learners. If you are taking a course, please do not copy this note.

xiwang_chn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Notes of chapter 2: Multi-armed bandits

Notes of chapter 2: Multi-armed bandits1 Summary1.1 The method of updating value tableSample average methodExponential recency-weighted average method (constant step size)1.2 The method of selecting actionsGreedy action selection methodε\varepsilonε-Greedy
复制链接

扫一扫