Chapter 2 Multi-armed Bandits 学习总结

目录

前言

 

2.1 A k-armed Bandit Problem

 

 

2.2 Actio

 

2.3 The 10-armed Testbed

 

 

2.4 Incremental Implementation

2.5 Tracking a Nonstationary Problem

2.6 Optimistic Initial Values

2.7 Upper-Confifidence-Bound Action Selection

2.8 Gradient Bandit Algorithms

2.9 Associative Search (Contextual Bandits)

2.10 Summary


前言

In this chapter we study the evaluative aspect of reinforcement learning in a simplifified
setting, one that does not involve learning to act in more than one situation. This
nonassociative setting is the one in which most prior work involving evaluative feedback
has been done, and it avoids much of the complexity of the full reinforcement learning
problem. Studying this case enables us to see most clearly how evaluative feedback diffffers
from, and yet can be combined with, instructive feedback.  
 
 Multi-armed Bandit是最简单的强化学习问题

 

2.1 A k-armed Bandit Problem

动作值:

In our k -armed bandit problem, each of the k actions has an expected or mean reward
given that that action is selected; let us call this the value of that action
 

At,Rt.

 
We denote the action selected on time step t as A t , and the corresponding reward as R t .
 

q*(a)

The value then of an arbitrary action a, denoted q*(a) , is the expected reward given that a is selected:
 

q*(a) = E[Rt | At =a]

Qt(a)

We denote the estimated value of action a at time step t as Q t ( a ). We would like Q t ( a ) to be close
to q*(a)
 

greedy actions

If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions.
 

Exploitation and exploration

 
When you select one of these greedy actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring , because this enables you to improve your estimate of the nongreedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.

 

解决K- armed bandit 问题已有很多方法,但大部分对平稳性和先验知识有很强的假设

2.2 Action-value Methods

Action-value Methods

We begin by looking more closely at methods for estimating the values of actions and for using the estimates to make action selection decisions, which we collectively call action-value methods.

 the true value of an action

Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received:

根据大数定理:

  Q t ( a ) converges to q*(a)

sample-average method

We call this the sample-average method for estimating action values because each estimate is an average of the sample of relevant rewards. Of course this is just one way to estimate action values, and not necessarily the best one
 
greedy action的选择:
 

2.3 The 10-armed Testbed

 为了比较 greedy and ε-greedy action-value methods,随机产生2000个10臂老虎机,每个老虎机试验1000步

 

 此图为生成时: plt.violinplot(dataset=np.random.randn(200, 10) + np.random.randn(10))          #画小提琴图

如果不加入  np.random.randn(10),

2.3.1以下是平稳问题

compares a greedy method with two ε-greedy methods (ε= 0.01 and ε= 0.1), as described above, on the 10-armed testbed. All the methods formed their action-value estimates using the sample-average technique.

换一下episilon的值:0.1,0.05,0.01

 

再跑一次

 

将greedy加进去,跑3000步的参观一下:

如果跑10000步呢,0.01的会超过0.05吗?

 
长时间下,0.01终究会超过0.05
 
 

再来看看当跑的轮数不同时,有什么变化:

(1)一轮

(2)50轮

(3)2000轮:

(4)6000轮

我们发现,跑的轮数越多图像抖动越小,越稳定

(5)最后,跑一个2000轮,5000步的,episilon的值多取几个,本来想看看0.075和0.025的时候,结果数字写成了0.75和0.25,发现0.75的表现非常差,因为它探索太多了,这就好比人不满足,最后结果可能是越来越差,如下图所示

 (7)这次来看:0,0.1,0.075, 0.05, 0.025,0.01时

(8)为了满足我对细节的猜想,这里跑了一组1轮,10步的

总结:(1)ε-greedy方法的优越性取决于具体的问题,如果方差为0,每个action选择一次,它们的真实值也就知道了,这种情况下greedy方法就是最好的,即使是确定性问题,如果保持探索,会削弱我们对模型假设的依赖,如

非平稳问题,它的真实q是变化的。

 

(2)以上8种图的真实值q*的值是self.q_true = np.random.randn(self.k) + self.true_reward  ,self.true_reward是0,也就是说在正太分布中随机初始化一个,然后进行steps步不变。为平稳的问题。

2.3.2以下是非平稳问题

尝试来实现一个非平稳的,在每一步,q*的值都是随机的

 self.q_true = np.random.normal(np.random.normal(loc=0, scale=0.01, size = self.k)) + self.true_reward

self.true_reward为0

相比平稳问题,将此步骤在每一步都执行一次

(1):2000轮,1000步的结果如下:

(2):2000轮,10000步的结果如下:

 

总结:这是课后题2.5,用程序比较平稳问题和非平稳问题,证明非平稳问题的难度。

 

2.4 Incremental Implementation

对观察到的reward取平均,来表示动作值Q,为了提高计算Q的效率,我们采用增量式的实现方法。
 
此时只需要保存上一步的Q和n,以及很小的计算量
此更新方法常用,一般化为:
中括号部分是一个估计误差,该估计误差它通过向下一步迈一步来减少,target我们期望是向最好的方向移动,尽管有时候有噪音,本章中的target为第n步的reward

2.5 Tracking a Nonstationary Problem

The averaging methods discussed so far are appropriate for stationary bandit problems, that is, for bandit problems in which the reward probabilities do not change over time. 平稳问题即奖励的概率不会随时间改变,而大部分RL问题是非稳定的,所以,重视近期的奖励比重视长期的奖励更有意义。最常用的方法之一是使用常量步长参数。

此方法一般叫近期指数加权平均

2.6 Optimistic Initial Values

 即使选择了greedy的行为,其实也进行了大量的探索。

Initially, the optimistic method performs worse because it explores more,  but eventually it performs better because its exploration decreases with time. We call  this technique for encouraging exploration optimistic initial values .
乐观初始值随着时间的变化。探索在减少。我们认为这是一种简单的技巧,在处理平稳问题时非常有效,但它远不是鼓励探索的一种普遍有用的方法。
它不适用于非平稳问题,因为它的探索动力本质上是暂时的。
 
2000轮1000步的实验得到的结果是可靠的,为了探究峰值:
 
如果将步数改为20,则有:
 
 
 
如果改为50步:
 
12步:
 
3000步:
 
 
我们发现不管走多少步,乐观初始值的方法在第10步都有一个峰值。

2.7 Upper-Confifidence-Bound Action Selection

It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. One effffective way of doing this is to select actions according to

 

20步:

 

 

3000步:

为研究峰值,如果将ucb的参数改为1,则:

峰值解释:c = 1时,为什么峰值没有那么明显了?

 

UCB的缺点:如果要将UCB方法扩展到一般的强化学习问题,则比贪婪策略难的多。其次它只适用于平稳问题

 

2.8 Gradient Bandit Algorithms

此方法估计的不是动作值,而是动作的偏好,

action的选择方式:

 

为baseline,是所有时间的reward的平均值

动作选择代码:

 

偏好的更新,此处用q表去记录偏好值

 运行代码结果如下图:

 

2.9 Associative Search (Contextual Bandits)

 

后期准备设计一个关联搜索的多老虎机问题:

考虑当老虎机变色时,这个信号时,该如何决策来最大化收益

2.10 Summary

用参数学习法比较以上几种方法

2000轮1000步:

如果进行1轮,1000步:

作为一个方法,我们不仅要注意它在最佳参数设置时的表现,还要注意它对参数值的敏感性。所有这些算法都是相当不敏感的,在大约一个数量级变化的参数值范围内执行得很好。整体。在这个问题上,UCB似乎表现最好。

在k-武装强盗问题中,平衡勘探和开发的一个研究得很好的方法是计算一种称为Gittins inder的特殊动作值。在某些重要的特殊情况下,这种计算是可处理的,并直接导致最优解,尽管它确实需要对可能的问题的先验分布有完整的知识,而我们通常认为这些知识是不可用的。此外,这种方法的理论和计算可处理性似乎都不能概括为我们在本书其余部分考虑的完全强化学习问题。

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值