Bandit Algorithm教材学习笔记

本文详细介绍了Adversarial Bandits与Stochastic Bandits的理论和算法,包括Exp3与Exp3-IX算法及其regret分析。在Adversarial Bandits中,重点讨论了Exp3算法及其在不同证明下的regret上界,通过放缩证明了regret不超过2nklog(k)。在Stochastic Linear Bandits部分,提到了LinUCB算法,分析了其置信区间构建与regret分析。此外,文章还涵盖了Contextual Bandits,解释了上下文信息如何影响决策,并引入了专家建议的Exp4算法。最后,文章探讨了稀疏线性Bandits的SETC算法和在线线性预测,以及Stochastic Linear Bandits的下界问题。
摘要由CSDN通过智能技术生成


参考资料

  • https://banditalgs.com/2016/08/01/table-of-contents/

    https://tor-lattimore.com/downloads/talks/2018/aaai/finite-armed-bandits.pdf

    教材的官网讲解

  • https://www.bilibili.com/read/cv6567364/
    https://www.bilibili.com/video/BV1ki4y1x7cE
    滴滴推荐系统的讲座,讲到了MBA和Contextual Bandits

  • https://www.cnblogs.com/kuliuheng/p/13808346.html
    U C B UCB UCB的讲解


1 Part 3 Adversarial Bandits with Finitely Many Arms

参考资料

  • https://banditalgs.com/2016/10/01/adversarial-bandits

1.1 adversarial bandit基本内容

和stochastic bandit相比,adversarial bandit的主要区别主要在于 r e w a r d reward reward是如何生成的,stochastic的 r e w a r d reward reward是根据一个确定的分布(如高斯)生成的;adversarial的 r e w a r d reward reward是由环境 ν = ( x 1 , … , x n ) ∈ [ 0 , 1 ] K n \nu = (x_1,\dots,x_n)\in [0,1]^{Kn} ν=(x1,,xn)[0,1]Kn给出,相当于adversary从一个表中给出 r e w a r d reward reward

stochastic的 U C B UCB UCB算法每一轮选择的arm是确定的,即当置信区间上界(UCB)最高的那一个;而adversarial的 E x p 3 Exp3 Exp3算法选择arm是不确定的,每一轮会生成一个关于 k k k个arm的概率分布 P t 1 , P t 2 , . . . P t k P_{t1},P_{t2},...P_{tk} Pt1,Pt2,...Ptk,由概率分布来确定本轮选择的arm。


1.2 Exp3算法及其regret分析

在这里插入图片描述

其中估计值 X ^ t i = 1 − I { A t = i } P t i   ( 1 − X t )   \hat X_{ti} = 1- \frac{\mathbb{I}{\{A_t=i}\}}{P_{ti}}\,(1-X_t)\, X^ti=1PtiI{ At=i}(1Xt)

每一轮根据估计值,计算每个arm的概率,根据概率分布去随机选择arm,并不是像 U C B UCB UCB那样选择得分最高的那个arm,因此每轮 r e w a r d reward reward就是 E t − 1 [ X t ] = ∑ i = 1 k P t i x t i \mathbb{E}_{t-1}[X_t]=\sum_{i=1}^kP_{ti}x_{ti} Et1[Xt]=i=1kPtixti

希望 r e g r e t regret regret的期望比较小,是 n n n的sublinear,达到 o ( n ) o(n) o(n),这样 lim ⁡ n → + ∞ R n n = 0 \lim_{n\rightarrow +\infty}\frac{R_n}{n}=0 limn+nRn=0


1.2.1 证明: R ( π , x ) ≤ 2 n k l o g ( k ) R(\pi,x) \leq 2 \sqrt{nklog(k)} R(π,x)2nklog(k)

定义 R n , i R_{n,i} Rn,i,就是把最优arm换成了 i i i,来计算 r e g r e t regret regret
R n , i = ∑ t = 1 n x t i – E [ ∑ t = 1 n X t ] = E [ S ^ n i ] − E [ ∑ t = 1 n ∑ i = 1 k P t i X ^ t i ] = E [ S ^ n i − S ^ n ] R_{n,i} = \sum_{t=1}^n x_{ti} – \mathbb{E}\left[{ \sum_{t=1}^n X_t }\right]=\mathbb{E}\left[{ \hat S_{ni} } \right]-\mathbb{E}\left[\sum_{t=1}^n\sum_{i=1}^k P_{ti} \hat X_{ti} \right]=\mathbb{E}\left[ \hat S_{ni}- \hat S_{n} \right] Rn,i=t=1nxtiE[t=1nXt]=E[S^ni]E[t=1ni=1kPtiX^ti]=E[S^niS^n] S ^ n i = ∑ t X ^ t i E [ S ^ n i ] = ∑ t = 1 n x t i S ^ n = ∑ t , i P t i X ^ t i \\\hat S_{ni}=\sum_{t} \hat X_{ti} \\\mathbb{E}\left[{ \hat S_{ni} }\right] = \sum_{t=1}^n x_{ti} \\ \hat S_n = \sum_{t,i} P_{ti} \hat X_{ti} S^ni=tX^tiE[S^ni]=t=1nxtiS^n=t,iPtiX^ti根据 P t i P_{ti} Pti的格式带着 e x p exp exp,仿照其分母,定义
W t = ∑ j = 1 k exp ⁡ ( η S ^ t j ) W_t = \sum_{j=1}^k \exp\left(\eta\hat S_{tj}\right) Wt=j=1kexp(ηS^tj)

W t W t − 1 = ∑ j exp ⁡ ( η S ^ t − 1 , j ) W t − 1 exp ⁡ ( η X ^ t j ) = ∑ j P t j exp ⁡ ( η X ^ t j )   ≤ 1 + η ∑ j P t j X ^ t j + η 2 ∑ j P t j X ^ t j 2 ≤ exp ⁡ ( η ∑ j P t j X ^ t j + η 2 ∑ j P t j X ^ t j 2 )   \frac{W_t}{W_{t-1}} = \sum_j \frac{\exp(\eta \hat S_{t-1,j} )}{W_{t-1}} \exp(\eta \hat X_{tj} ) = \sum_j P_{tj} \exp(\eta \hat X_{tj} )\,\\ \le 1 + \eta \sum_j P_{tj} \hat X_{tj} + \eta^2 \sum_j P_{tj} \hat X_{tj}^2 \\\le \exp( \eta \sum_j P_{tj} \hat X_{tj} + \eta^2 \sum_j P_{tj} \hat X_{tj}^2 )\, Wt1Wt=jWt1exp(ηS^t1,j)exp(ηX^tj)=jPtjexp(ηX^tj)1+ηjPtjX^tj+η2jPtjX^tj2exp(ηjPtjX^tj+η2jPtjX^tj2)

利用了 exp ⁡ ( x ) ≤ 1 + x + x 2 \exp(x) \le 1 + x + x^2 exp(x)1+x+x2 1 + x ≤ e x p ( x ) 1+x \le exp(x) 1+xexp(x),满足 x ^ t j ≤ 1 \hat x_{tj} \le 1 x^tj1

在这里,使用另外一种放缩,可以得到更好的 R ( π , x ) ≤ 2 n k l o g ( k ) R(\pi,x) \leq \sqrt{2nklog(k)} R(π,x)2nklog(k) ,在后面讲

exp ⁡ ( η S ^ n i ) \exp(\eta \hat S_{ni} ) exp(ηS^ni)放缩同时将上面代入,有
exp ⁡ ( η S ^ n i ) ≤ ∑ j exp ⁡ ( η ( S ^ n j ) ) = W n = W 0 W 1 W 0 … W n W n − 1   ≤ k exp ⁡ ( η ∑ j P t j X ^ t j + η 2 ∑ j P t j X ^ t j 2 )   \exp(\eta \hat S_{ni} ) \le \sum_{j} \exp(\eta(\hat S_{nj})) = W_n = W_0 \frac{W_1}{W_0} \dots \frac{W_n}{W_{n-1}}\,\\ \le k\exp( \eta \sum_j P_{tj} \hat X_{tj} + \eta^2 \sum_j P_{tj} \hat X_{tj}^2 )\, exp(ηS^ni)jexp(η(S^nj))=Wn=W0W0W1Wn1Wnkexp(ηjPtjX^tj+η2jPtjX^tj2)得到
S ^ n i – S ^ n ≤ log ⁡ ( K ) η + η ∑ t , j P t j X ^ t j 2 \hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \eta \sum_{t,j} P_{tj} \hat X_{tj}^2 S^niS^nηlog(K)+ηt,jPtjX^tj2其中 y t j = 1 − x t j y_{tj} = 1-x_{tj} ytj=1xtj Y t = 1 − X t Y_t=1-X_t Yt=1Xt,拆开平方

在这里插入图片描述

η = log ⁡ ( K ) / ( n k ) \eta = \sqrt{\log(K)/(nk)} η=log(K)/(nk) 时,得到结论
R n ≤ R n i ≤ log ⁡ ( K ) η + η n k = 2 n k l o g ( k ) R_n\le R_{ni} \le \frac{\log(K)}{\eta} + \eta n k=2\sqrt{nklog(k)} RnRniηlog(K)+ηnk=2nklog(k)


1.2.2 证明: R ( π , x ) ≤ 2 n k l o g ( k ) R(\pi,x) \leq \sqrt{2nklog(k)} R(π,x)2nklog(k)

放缩的时候改变一下策略,应用 e x p ( x ) ≤ 1 + x + x 2 2 exp(x)\le 1+x+\frac{x^2}{2} exp(x)1+x+2x2,满足 ( X ^ t j − 1 ) ≤ 0 ( \hat X_{tj}-1)\le 0 (X^tj1)0,以及 1 + x ≤ e x p ( x ) 1+x \le exp(x) 1+xexp(x)
exp ⁡ ( η X ^ t j ) = exp ⁡ ( η ) exp ⁡ ( η ( X ^ t j − 1 ) ) ≤ exp ⁡ ( η ) { 1 + η ( X ^ t j − 1 ) + η 2 2 ( X ^ t j − 1 ) 2 } \exp(\eta \hat X_{tj} ) = \exp(\eta) \exp( \eta (\hat X_{tj}-1) ) \le \exp(\eta) \left\{1+ \eta (\hat X_{tj}-1) + \frac{\eta^2}{2} (\hat X_{tj}-1)^2\right\} exp(ηX^tj)=exp(η)exp(η(X^tj1))exp(η){ 1+η(X^tj1)+2η2(X^tj1)2} ∑ j P t j = 1 \sum_j P_{tj}=1 jPtj=1
W t W t − 1 = ∑ j P t j exp ⁡ ( η X ^ t j ) ≤ e x p ( η ) [ 1 − η + ∑ j P t j ( η X ^ t j + η 2 2 ( X ^ t j − 1 ) 2 ) ] \frac{W_t}{W_{t-1}}=\sum_j P_{tj} \exp(\eta \hat X_{tj} ) \le exp(\eta)\left[1-\eta +\sum_j P_{tj}\left(\eta \hat X_{tj} + \frac{\eta^2}{2} (\hat X_{tj}-1)^2\right)\right] \\ Wt1Wt=jPtjexp(ηX^tj)exp(η)[1η+jPtj(ηX^tj+2η2(X^tj1)2)] exp ⁡ ( η ∑ j P t j X ^ t j + η 2 2 ∑ j P t j ( X ^ t j − 1 ) 2 )   \exp\left( \eta \sum_j P_{tj} \hat X_{tj} + \frac{\eta^2}{2} \sum_j P_{tj}(\hat X_{tj}-1)^2\right)\, exp(ηjPtjX^tj+2η2jPtj(X^tj1)2)
Y ^ t j = 1 − X ^ t j = A t j P t j y t j \hat Y_{tj} = 1-\hat X_{tj} = \frac{A_{tj}}{P_{tj}} y_{tj} Y^tj=1X^tj=PtjAtjytj
P t j ( X ^ t j − 1 ) 2 = P t j Y ^ t j Y ^ t j = A t j y t j Y ^ t j ≤ Y ^ t j 、 P_{tj} (\hat X_{tj}-1)^2 = P_{tj} \hat Y_{tj}\hat Y_{tj} = A_{tj} y_{tj}\hat Y_{tj}\le \hat Y_{tj} 、 Ptj(X^tj1)2=PtjY^tjY^tj=AtjytjY^tjY^tj
因此
W t W t − 1 ≤ = exp ⁡ ( η ∑ j P t j X ^ t j + η 2 2 ∑ j Y ^ t j ) \frac{W_t}{W_{t-1}} \le %\exp(\eta) \sum_j P_{tj} \left(1+ \eta (\hat X_{tj}-1) + \frac{\eta^2}{2} (\hat X_{tj}-1)^2\right) \ =\exp\left( \eta \sum_j P_{tj} \hat X_{tj} + \frac{\eta^2}{2}\sum_j \hat Y_{tj} \right) Wt1Wt=exp(ηjPtjX^tj+2η2jY^tj)和之前一样,将 exp ⁡ ( η S ^ n i ) \exp(\eta \hat S_{ni} ) exp(ηS^ni)的放缩代入,有
S ^ n i – S ^ n ≤ log ⁡ ( K ) η + η 2 ∑ t , j Y ^ t j \hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \frac{\eta}{2} \sum_{t,j} \hat Y_{tj} S^niS^nηlog(K)+2ηt,jY^tj E ( x ) = E ( E t − 1 x ) \mathbb{E}( x)=\mathbb{E}(\mathbb{E_{t-1}}x) E(x)=E(Et1x)
E ( ∑ j Y ^ t j ) = E ( ∑ j E t − 1 Y ^ t j ) = E ( ∑ j y t j ) ≤ n k \mathbb{E}\left(\sum_j \hat Y_{tj}\right)=\mathbb{E}\left(\sum_j \mathbb{E_{t-1}}\hat Y_{tj}\right)=\mathbb{E}\left(\sum_j y_{tj}\right)\le nk E(jY^tj)=E(jEt1Y^tj)=E(jytj)nk
η = 2 l o g ( K ) / ( n k ) \eta = \sqrt{2log(K)/(nk)} η=2log(K)/(nk) 时,得到结论
R n ≤ R n i ≤ log ⁡ ( K ) η + η n k 2 = 2 n k l o g ( k ) R_n\le R_{ni} \le \frac{\log(K)}{\eta} + \frac{\eta n k}{2}=\sqrt{2nklog(k)} RnRniηlog(K)+2ηnk=2nklog(k)


1.2.3 两种证明的比较

比较两个不同的结果,区别就是在放缩 W t W t − 1 = ∑ j P t j exp ⁡ ( η X ^ t j ) \frac{W_t}{W_{t-1}}=\sum_j P_{tj} \exp(\eta \hat X_{tj} ) Wt1Wt=jPtjexp(ηX^tj) exp ⁡ ( η X ^ t j ) \exp(\eta \hat X_{tj} ) exp(ηX^tj)部分采用了不同的不等式,最终的差别结果是:
S ^ n i – S ^ n ≤ log ⁡ ( K ) η + η ∑ t , j P t j X ^ t j 2 \hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \eta \sum_{t,j} P_{tj} \hat X_{tj}^2 S^niS^nηlog(K)+ηt,jPtjX^tj2

S ^ n i – S ^ n ≤ log ⁡ ( K ) η + η 2 ∑ t , j Y ^ t j \hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \frac{\eta}{2} \sum_{t,j} \hat Y_{tj} S^niS^nηlog(K)+2ηt,jY^tj

可以看到只有后半项是不同的,是采用了不同的不等式的结果。


1.3 Exp3-IX算法及其regret分析

E x p 3 Exp3 Exp3 E x p 3 − I X Exp3-IX Exp3IX的区别:

  • E x p 3 Exp3 Exp
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值