Reinforcement Learning——Chapter 2 Multi-armed Bandits

最新推荐文章于 2021-12-04 14:21:52 发布

EntropyPlus

最新推荐文章于 2021-12-04 14:21:52 发布

阅读量374

点赞数

分类专栏：强化学习

强化学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1. Perface

强化学习与其他学习方法最大的区别在于，强化学习 it uses training information that evaluates the actions taken rather than instructs by giving correct actions.

1.1 A k-armed Bandit Problem

假设你面前有K个不同的选项，每一次选择都会你选择的选项中得到一个量化的reward，你的目标是使得一段时间后获得的reward累积最大。一个具体的例子是这样的：一个赌徒，要去摇老虎机，走进赌场一看，一排老虎机，外表一模一样，但是每个老虎机吐钱的概率可不一样，他不知道每个老虎机吐钱的概率分布是什么，那么每次该选择哪个老虎机可以做到最大化收益呢？这就是多臂赌博机问题。

在 k-armed Bandit Problem中，每个action都有一个期望reward，称为这个action的value。假设在时刻 $t$ 选择的action为 $A_t$ ，得到的Reward为 $R_t$ ，那么对于选择Action $a$ 之后得到的reward期望 $q_*(a)$ 可以表示为：
$q_*(a)=E[R_t|A_t=a]$
在我们知道每个action的value之后，我们只要选择value最高的那个action就行了，因此，需要对每个action的value进行一个估计，假设 $t$ 时刻对action的估计函数为 $Q_t(a)$ ，接下来，我们需要 $Q_t(a)$ 尽可能的接近 $q_*(a)$ 。

如果我们选择目前时刻value最高的action，称为greedy action，其意义为： you are exploiting your current knowledge of the values of the actions。相反的，如果选择的是nongreedy actions， we say you are exploring。但是，仅仅基于当前state做最好的选择，并不一定是全局好的选择，因此，如何平衡好exploiting和exploring是一件非常重要的事。

1.2 Action-value Methods

先介绍一种最简单的value estimation方法：sample-average method。这种方式的特点是：each estimate is an average of the sample of relevant rewards，使用action去估计value的方法称为action-value methods，其表达式为：
在这里插入图片描述
$\mathbb{I}_{predicate}$ 是一个随机变量，当predicate为真时，取1；否则为0。当分母为0时，定义 $Q_t(a)$ 是一个默认的value，例如0。（连续 $t - 1$ 个时间内，action a发生的次数以及所得的reward）

最简单选择action的方法：选择highest estimated value，那么，可以写成：
$A_t=\argmax_aQ_t(a)$
选择一个action a，使得到的value最大，那么这个action就是我们需要的action。Greedy action 总是在进行exploits，因此，我们需要增加一点随机性，让他兼顾explore， say with small probability $\epsilon$ , select randomly from among all the actions with equal probability。这种方式称为 $\epsilon$ -greedy.

1.3 The 10-armed Testbed

通常情况下，并不是说使用 $\epsilon$ -greedy 一定要比 greedy的方式要好。if the reward variances were zero, then the greedy method would know the true value of each action after trying it once. But even in the deterministic case there is a large advantage to exploring if we weaken some of the other assumptions。
假设现在有10个赌博机，每个赌博机都有10种场景（对应的，也就有10个action）。假设每一个Testbed在不同action情况下的Reward都服从标准正态分布，那么随机采样构造小提琴图可以为：
在这里插入图片描述
然后对每一个赌博机进行1000个时间步的操作，这个过程重复2000次。

作者对比了两种贪心方式( $\epsilon=0.01 and \epsilon=0.1$ )的情况下，所有的action value的估计值都采用sample-average的方法，在短期内，greedy策略能够很快的提升reward，但是很容易陷入到 performing suboptimal action中，长时间之后，其效果不如 $\epsilon$ -greedy的方法。
在这里插入图片描述
使用greedy策略得到optimal actions的概率大概只有33%。而 $\epsilon$ -greedy方法则包括了explored。The $\epsilon=0 .01$ method improved more slowly, but eventually would perform better than the $\epsilon=0 .1$ method on both performance measures shown in the ﬁgure

1.4 Incremental Implementation

对于计算 $t$ 时刻对action的估计函数 $Q_t(a)$ ，我们考虑该公式(2.1)中的一种极端的情况，某个action被连续选择了 $n - 1$ 次，那么公式(2.1)可以简化为：
在这里插入图片描述
那么上述公式就可以写成：

公式2.3就可以理解为：

$\epsilon$ -greedy的算法流程：

1.5 Tracking a Nonstationary Problem

公式2.4能够处理 stationary 的问题，那就是在任意时刻，赌博机对于同样action，给出的reward的概率都不会发生改变。

但是对于 nonstationary 的问题来说，需要给不同时刻的值增加一个 $\alpha$ 权重：
在这里插入图片描述
那么公式就可以表示为：

我们称这种方式为加权平均或者exponential recency-weighted average。因为 $(1-\alpha)^n+\sum_{i=1}^n\alpha(1-\alpha)^{n-i}=1$ ，这个东西 $\alpha(1-\alpha)^{n-i}$ 取决于reward依赖的时间步长短

1.6 Optimistic Initial Values

对于估计的初值 $Q_1(a)$ ，一开始其实是一个偏差项，但是在统计学上，当所有的action都被选择一次之后，这个偏差项可以认为消失了，但是对于公式(2.3)来说，这个偏差是无法消除的，但是对于公式(2.6)来说，这个偏差的影响是逐渐减少的。

不过对于model来说，这个偏差的影响其实不是很大，下面这个是初值分别为0和5的对比试验。
在这里插入图片描述

1.7 Upper-Conﬁdence-Bound Action Selection(UCB)

如何选择non-greedy的action，其实是一门学问，下面提供了一种方法去评估 non-greedy方法 both how close their estimates are to being maximal and the uncertainties in those estimates.
在这里插入图片描述
$c > 0$ : controls the degree of exploration
$N_t(a)$ 在time $t$ 时间内 action $a$ 被选中的次数

The idea of this upper conﬁdence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value

当 $N_t(a)$ 增大时， as it appears in the denominator(分母), the uncertainty term 减少.
当某个action $a$ 被没有被选中时， $t$ 增大，但是 $N_t(a)$ 没有被选中， because $t$ appears in the numerator(分子), the uncertainty estimate 增大。

1.8 Gradient Bandit Algorithms

In this section we consider learning a numerical preference(导数偏好) for each action $a$ , which we denote $H_t(a)$ . 简言之，偏好越大，这个action被选中的次数越多，但是从reward角度来说没有办法进行解释。因此，不同action之间的preference相比较是十分重要的，一般来说，可以用Gibbs分布来表示：
在这里插入图片描述
在上式中， $\pi_t(a)$ 是在time $t$ 时采取的action $a$ 的概率，一般来说，设置 $H_1(a)=0$ 。

对于随机梯度上升算法，更新公式为(2.12)：
在这里插入图片描述
对于最佳的action，采用 $H_{t+1}(A_t)$ 的公式更新，对于非最佳的action，采用 $H_{t+1}(a)$ 的方式更新。
在上式中：

$a > 0$ ：step size
$\bar{R_t}\in\R$ 是 $t$ 时刻之前的average reward，这起到了一个baseline的作用。

下面在reward上展示了加入baseline和没有加baseline的差别。This shifting up of all the rewards has absolutely no e↵ect on the gradient bandit algorithm because of the reward baseline term, which instantaneously adapts to the new level
在这里插入图片描述

1.8.1 Gredient Ascent的理论

根据梯度上升公式：
$H_{t+1}(a)=H_t(a)+\alpha \frac{\partial E[R_t]}{\partial H_t(a)}$
其中， $E[R_t]=\sum_x\pi_t(x)q_*(x)$ ，但实际上 $E(R_t)$ 很难求，因为 $q_*(x)$ 实际上是一个未知的值，
将 $E[R_t]$ 代入梯度上升公式有：
在这里插入图片描述
在上式中， $B_t$ 称为baseline，可以是一个与 $x$ 无关的常数。下一步我们乘上一个 $\pi_t(x)/\pi_t(x)$ ，于是有：（ $\pi_t(x)$ 的定义式在公式(2.11)，代表选择各个action的概率）

就可以这个看成是一个期望：summing over all possible values $x$ of the random variable $A_t$ ：
在这里插入图片描述
由于 $E[R_t|A_t]=q_*(A_t)$ ，另外，令 $B_t=\bar{R_t}$ ，于是有：

假设有：

于是：
在这里插入图片描述
可以写成：

所以：

可以写成

1.9 总结

$\epsilon$ -greedy choose randomly a small fraction of the time
UCB methods 虽然没有引入随机机制，但是通过引入其他action的界限来实现了exploration.
Gradient bandit algorithms estimate not action values, but action preferences, and favor the more preferred actions in a graded, probabilistic manner using a soft-max distribution

在这里插入图片描述

EntropyPlus

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Reinforcement Learning——Chapter 2 Multi-armed Bandits

1. Perface强化学习与其他学习方法最大的区别在于，强化学习 it uses training information that evaluates the actions taken rather than instructs by giving correct actions.1.1 A k-armed Bandit Problem假设你面前有K个不同的选项，每一次选择都会你选择的选项中得到一个量化的reward，你的目标是使得一段时间后获得的reward累积最大。一个具体的例子是这样的：
复制链接

扫一扫

专栏目录