文章目录
参考资料
-
https://banditalgs.com/2016/08/01/table-of-contents/
https://tor-lattimore.com/downloads/talks/2018/aaai/finite-armed-bandits.pdf
教材的官网讲解
-
https://www.bilibili.com/read/cv6567364/
https://www.bilibili.com/video/BV1ki4y1x7cE
滴滴推荐系统的讲座,讲到了MBA和Contextual Bandits -
https://www.cnblogs.com/kuliuheng/p/13808346.html
U C B UCB UCB的讲解
1 Part 3 Adversarial Bandits with Finitely Many Arms
参考资料
- https://banditalgs.com/2016/10/01/adversarial-bandits
1.1 adversarial bandit基本内容
和stochastic bandit相比,adversarial bandit的主要区别主要在于 r e w a r d reward reward是如何生成的,stochastic的 r e w a r d reward reward是根据一个确定的分布(如高斯)生成的;adversarial的 r e w a r d reward reward是由环境 ν = ( x 1 , … , x n ) ∈ [ 0 , 1 ] K n \nu = (x_1,\dots,x_n)\in [0,1]^{Kn} ν=(x1,…,xn)∈[0,1]Kn给出,相当于adversary从一个表中给出 r e w a r d reward reward。
stochastic的 U C B UCB UCB算法每一轮选择的arm是确定的,即当置信区间上界(UCB)最高的那一个;而adversarial的 E x p 3 Exp3 Exp3算法选择arm是不确定的,每一轮会生成一个关于 k k k个arm的概率分布 P t 1 , P t 2 , . . . P t k P_{t1},P_{t2},...P_{tk} Pt1,Pt2,...Ptk,由概率分布来确定本轮选择的arm。
1.2 Exp3算法及其regret分析
其中估计值 X ^ t i = 1 − I { A t = i } P t i ( 1 − X t ) \hat X_{ti} = 1- \frac{\mathbb{I}{\{A_t=i}\}}{P_{ti}}\,(1-X_t)\, X^ti=1−PtiI{ At=i}(1−Xt)
每一轮根据估计值,计算每个arm的概率,根据概率分布去随机选择arm,并不是像 U C B UCB UCB那样选择得分最高的那个arm,因此每轮 r e w a r d reward reward就是 E t − 1 [ X t ] = ∑ i = 1 k P t i x t i \mathbb{E}_{t-1}[X_t]=\sum_{i=1}^kP_{ti}x_{ti} Et−1[Xt]=∑i=1kPtixti
希望 r e g r e t regret regret的期望比较小,是 n n n的sublinear,达到 o ( n ) o(n) o(n),这样 lim n → + ∞ R n n = 0 \lim_{n\rightarrow +\infty}\frac{R_n}{n}=0 limn→+∞nRn=0
1.2.1 证明: R ( π , x ) ≤ 2 n k l o g ( k ) R(\pi,x) \leq 2 \sqrt{nklog(k)} R(π,x)≤2nklog(k)
定义 R n , i R_{n,i} Rn,i,就是把最优arm换成了 i i i,来计算 r e g r e t regret regret
R n , i = ∑ t = 1 n x t i – E [ ∑ t = 1 n X t ] = E [ S ^ n i ] − E [ ∑ t = 1 n ∑ i = 1 k P t i X ^ t i ] = E [ S ^ n i − S ^ n ] R_{n,i} = \sum_{t=1}^n x_{ti} – \mathbb{E}\left[{ \sum_{t=1}^n X_t }\right]=\mathbb{E}\left[{ \hat S_{ni} } \right]-\mathbb{E}\left[\sum_{t=1}^n\sum_{i=1}^k P_{ti} \hat X_{ti} \right]=\mathbb{E}\left[ \hat S_{ni}- \hat S_{n} \right] Rn,i=t=1∑nxti–E[t=1∑nXt]=E[S^ni]−E[t=1∑ni=1∑kPtiX^ti]=E[S^ni−S^n] S ^ n i = ∑ t X ^ t i E [ S ^ n i ] = ∑ t = 1 n x t i S ^ n = ∑ t , i P t i X ^ t i \\\hat S_{ni}=\sum_{t} \hat X_{ti} \\\mathbb{E}\left[{ \hat S_{ni} }\right] = \sum_{t=1}^n x_{ti} \\ \hat S_n = \sum_{t,i} P_{ti} \hat X_{ti} S^ni=t∑X^tiE[S^ni]=t=1∑nxtiS^n=t,i∑PtiX^ti根据 P t i P_{ti} Pti的格式带着 e x p exp exp,仿照其分母,定义
W t = ∑ j = 1 k exp ( η S ^ t j ) W_t = \sum_{j=1}^k \exp\left(\eta\hat S_{tj}\right) Wt=j=1∑kexp(ηS^tj)
W t W t − 1 = ∑ j exp ( η S ^ t − 1 , j ) W t − 1 exp ( η X ^ t j ) = ∑ j P t j exp ( η X ^ t j ) ≤ 1 + η ∑ j P t j X ^ t j + η 2 ∑ j P t j X ^ t j 2 ≤ exp ( η ∑ j P t j X ^ t j + η 2 ∑ j P t j X ^ t j 2 ) \frac{W_t}{W_{t-1}} = \sum_j \frac{\exp(\eta \hat S_{t-1,j} )}{W_{t-1}} \exp(\eta \hat X_{tj} ) = \sum_j P_{tj} \exp(\eta \hat X_{tj} )\,\\ \le 1 + \eta \sum_j P_{tj} \hat X_{tj} + \eta^2 \sum_j P_{tj} \hat X_{tj}^2 \\\le \exp( \eta \sum_j P_{tj} \hat X_{tj} + \eta^2 \sum_j P_{tj} \hat X_{tj}^2 )\, Wt−1Wt=j∑Wt−1exp(ηS^t−1,j)exp(ηX^tj)=j∑Ptjexp(ηX^tj)≤1+ηj∑PtjX^tj+η2j∑PtjX^tj2≤exp(ηj∑PtjX^tj+η2j∑PtjX^tj2)
利用了 exp ( x ) ≤ 1 + x + x 2 \exp(x) \le 1 + x + x^2 exp(x)≤1+x+x2, 1 + x ≤ e x p ( x ) 1+x \le exp(x) 1+x≤exp(x),满足 x ^ t j ≤ 1 \hat x_{tj} \le 1 x^tj≤1
在这里,使用另外一种放缩,可以得到更好的 R ( π , x ) ≤ 2 n k l o g ( k ) R(\pi,x) \leq \sqrt{2nklog(k)} R(π,x)≤2nklog(k),在后面讲
对 exp ( η S ^ n i ) \exp(\eta \hat S_{ni} ) exp(ηS^ni)放缩同时将上面代入,有
exp ( η S ^ n i ) ≤ ∑ j exp ( η ( S ^ n j ) ) = W n = W 0 W 1 W 0 … W n W n − 1 ≤ k exp ( η ∑ j P t j X ^ t j + η 2 ∑ j P t j X ^ t j 2 ) \exp(\eta \hat S_{ni} ) \le \sum_{j} \exp(\eta(\hat S_{nj})) = W_n = W_0 \frac{W_1}{W_0} \dots \frac{W_n}{W_{n-1}}\,\\ \le k\exp( \eta \sum_j P_{tj} \hat X_{tj} + \eta^2 \sum_j P_{tj} \hat X_{tj}^2 )\, exp(ηS^ni)≤j∑exp(η(S^nj))=Wn=W0W0W1…Wn−1Wn≤kexp(ηj∑PtjX^tj+η2j∑PtjX^tj2)得到
S ^ n i – S ^ n ≤ log ( K ) η + η ∑ t , j P t j X ^ t j 2 \hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \eta \sum_{t,j} P_{tj} \hat X_{tj}^2 S^ni–S^n≤ηlog(K)+ηt,j∑PtjX^tj2其中 y t j = 1 − x t j y_{tj} = 1-x_{tj} ytj=1−xtj, Y t = 1 − X t Y_t=1-X_t Yt=1−Xt,拆开平方
令 η = log ( K ) / ( n k ) \eta = \sqrt{\log(K)/(nk)} η=log(K)/(nk)时,得到结论
R n ≤ R n i ≤ log ( K ) η + η n k = 2 n k l o g ( k ) R_n\le R_{ni} \le \frac{\log(K)}{\eta} + \eta n k=2\sqrt{nklog(k)} Rn≤Rni≤ηlog(K)+ηnk=2nklog(k)
1.2.2 证明: R ( π , x ) ≤ 2 n k l o g ( k ) R(\pi,x) \leq \sqrt{2nklog(k)} R(π,x)≤2nklog(k)
放缩的时候改变一下策略,应用 e x p ( x ) ≤ 1 + x + x 2 2 exp(x)\le 1+x+\frac{x^2}{2} exp(x)≤1+x+2x2,满足 ( X ^ t j − 1 ) ≤ 0 ( \hat X_{tj}-1)\le 0 (X^tj−1)≤0,以及 1 + x ≤ e x p ( x ) 1+x \le exp(x) 1+x≤exp(x)
exp ( η X ^ t j ) = exp ( η ) exp ( η ( X ^ t j − 1 ) ) ≤ exp ( η ) { 1 + η ( X ^ t j − 1 ) + η 2 2 ( X ^ t j − 1 ) 2 } \exp(\eta \hat X_{tj} ) = \exp(\eta) \exp( \eta (\hat X_{tj}-1) ) \le \exp(\eta) \left\{1+ \eta (\hat X_{tj}-1) + \frac{\eta^2}{2} (\hat X_{tj}-1)^2\right\} exp(ηX^tj)=exp(η)exp(η(X^tj−1))≤exp(η){
1+η(X^tj−1)+2η2(X^tj−1)2}由 ∑ j P t j = 1 \sum_j P_{tj}=1 ∑jPtj=1
W t W t − 1 = ∑ j P t j exp ( η X ^ t j ) ≤ e x p ( η ) [ 1 − η + ∑ j P t j ( η X ^ t j + η 2 2 ( X ^ t j − 1 ) 2 ) ] \frac{W_t}{W_{t-1}}=\sum_j P_{tj} \exp(\eta \hat X_{tj} ) \le exp(\eta)\left[1-\eta +\sum_j P_{tj}\left(\eta \hat X_{tj} + \frac{\eta^2}{2} (\hat X_{tj}-1)^2\right)\right] \\ Wt−1Wt=j∑Ptjexp(ηX^tj)≤exp(η)[1−η+j∑Ptj(ηX^tj+2η2(X^tj−1)2)] exp ( η ∑ j P t j X ^ t j + η 2 2 ∑ j P t j ( X ^ t j − 1 ) 2 ) \exp\left( \eta \sum_j P_{tj} \hat X_{tj} + \frac{\eta^2}{2} \sum_j P_{tj}(\hat X_{tj}-1)^2\right)\, exp(ηj∑PtjX^tj+2η2j∑Ptj(X^tj−1)2)
令 Y ^ t j = 1 − X ^ t j = A t j P t j y t j \hat Y_{tj} = 1-\hat X_{tj} = \frac{A_{tj}}{P_{tj}} y_{tj} Y^tj=1−X^tj=PtjAtjytj,
P t j ( X ^ t j − 1 ) 2 = P t j Y ^ t j Y ^ t j = A t j y t j Y ^ t j ≤ Y ^ t j 、 P_{tj} (\hat X_{tj}-1)^2 = P_{tj} \hat Y_{tj}\hat Y_{tj} = A_{tj} y_{tj}\hat Y_{tj}\le \hat Y_{tj} 、 Ptj(X^tj−1)2=PtjY^tjY^tj=AtjytjY^tj≤Y^tj、
因此
W t W t − 1 ≤ = exp ( η ∑ j P t j X ^ t j + η 2 2 ∑ j Y ^ t j ) \frac{W_t}{W_{t-1}} \le %\exp(\eta) \sum_j P_{tj} \left(1+ \eta (\hat X_{tj}-1) + \frac{\eta^2}{2} (\hat X_{tj}-1)^2\right) \ =\exp\left( \eta \sum_j P_{tj} \hat X_{tj} + \frac{\eta^2}{2}\sum_j \hat Y_{tj} \right) Wt−1Wt≤=exp(ηj∑PtjX^tj+2η2j∑Y^tj)和之前一样,将 exp ( η S ^ n i ) \exp(\eta \hat S_{ni} ) exp(ηS^ni)的放缩代入,有
S ^ n i – S ^ n ≤ log ( K ) η + η 2 ∑ t , j Y ^ t j \hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \frac{\eta}{2} \sum_{t,j} \hat Y_{tj} S^ni–S^n≤ηlog(K)+2ηt,j∑Y^tj由 E ( x ) = E ( E t − 1 x ) \mathbb{E}( x)=\mathbb{E}(\mathbb{E_{t-1}}x) E(x)=E(Et−1x)
E ( ∑ j Y ^ t j ) = E ( ∑ j E t − 1 Y ^ t j ) = E ( ∑ j y t j ) ≤ n k \mathbb{E}\left(\sum_j \hat Y_{tj}\right)=\mathbb{E}\left(\sum_j \mathbb{E_{t-1}}\hat Y_{tj}\right)=\mathbb{E}\left(\sum_j y_{tj}\right)\le nk E(j∑Y^tj)=E(j∑Et−1Y^tj)=E(j∑ytj)≤nk
令 η = 2 l o g ( K ) / ( n k ) \eta = \sqrt{2log(K)/(nk)} η=2log(K)/(nk)时,得到结论
R n ≤ R n i ≤ log ( K ) η + η n k 2 = 2 n k l o g ( k ) R_n\le R_{ni} \le \frac{\log(K)}{\eta} + \frac{\eta n k}{2}=\sqrt{2nklog(k)} Rn≤Rni≤ηlog(K)+2ηnk=2nklog(k)
1.2.3 两种证明的比较
比较两个不同的结果,区别就是在放缩 W t W t − 1 = ∑ j P t j exp ( η X ^ t j ) \frac{W_t}{W_{t-1}}=\sum_j P_{tj} \exp(\eta \hat X_{tj} ) Wt−1Wt=∑jPtjexp(ηX^tj)的 exp ( η X ^ t j ) \exp(\eta \hat X_{tj} ) exp(ηX^tj)部分采用了不同的不等式,最终的差别结果是:
S ^ n i – S ^ n ≤ log ( K ) η + η ∑ t , j P t j X ^ t j 2 \hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \eta \sum_{t,j} P_{tj} \hat X_{tj}^2 S^ni–S^n≤ηlog(K)+ηt,j∑PtjX^tj2
S ^ n i – S ^ n ≤ log ( K ) η + η 2 ∑ t , j Y ^ t j \hat S_{ni} – \hat S_n \le \frac{\log(K)}{\eta} + \frac{\eta}{2} \sum_{t,j} \hat Y_{tj} S^ni–S^n≤ηlog(K)+2ηt,j∑Y^tj
可以看到只有后半项是不同的,是采用了不同的不等式的结果。
1.3 Exp3-IX算法及其regret分析
E x p 3 Exp3 Exp3和 E x p 3 − I X Exp3-IX Exp3−IX的区别:
- E x p 3 Exp3 Exp