Online Learning 7: Linear Bandits

1 Setting (linear contextual bandits)

1.1 Setting

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

1.2 Model

Feature map

Map the context and arm to some d d d-dimensional vector, then dot the θ \theta θ to get the reward, by plusing some noise.
在这里插入图片描述

Model

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

1.3 Regret

在这里插入图片描述

2 Linear UCB

2.1 Algorithm

  1. Given H t − 1 H_{t-1} Ht1, estimate a set of C t ∈ R d C_t \in \mathbb{R}^d CtRd such that:
    • θ ∗ ∈ C t \theta^* \in C_t θCt, w.h.p.
    • C t C_t Ct gets smaller with increasing t t t.
  2. Use C t C_t Ct to determine an optimistic estimate of θ ∗ \theta^* θ using the index for an arm a ∈ R d a\in\mathbb{R}^d aRd.
    • UCB ( a ) = max ⁡ θ ∈ C t < θ , a > (a)=\max_{\theta\in C_t}<\theta,a> (a)=maxθCt<θ,a>
  3. Play A t = arg max ⁡ a ∈ A t U t ( a ) A_t=\argmax_{a\in A_t}U_t(a) At=aAtargmaxUt(a)

The approach used: Use a regularized least-square estimator to estimate θ ^ t − 1 \hat{\theta}_{t-1} θ^t1, and use an ellipsoid centered at θ ^ t − 1 \hat{\theta}_{t-1} θ^t1, and with axes depending on the covariance matrix of the estimator.

C t C_t Ct is the confidence set.

1you have an infinite number of arms, but nevertheless we’re able to not get a regret which depends on the possible number of arms, but only the dimension of the problem. By generalizing UCB in this setting, we’re using information from any particular vector to give essentially information about theta star, which is all possible parameters, which is the unknown parameter vector,

we consider score, play according to the score, explore, build a confidence ellipsoid, make sure that the confidence ellipsoid is created such that theta star always lies within that, scale up your confidence ellipsoid over time.

Essentially, scale the confidence level over time to make sure that you’re more and more confident as time progresses, but nevertheless, you want the confidence ellipsoid itself to shrink over time, because the variances itself are shrinking over time.

So even if sigma is extremely small, three sigma is also extremely small. So you want to come up with a, essentially the algorithm such that theta star always lies within your confidence ellipsoid, and your confidence ellipsoid is shrinking smaller and smaller so that you essentially estimate, and theta star actually are close to each other.

whenever I see a V norm your mental model should be, am I inside a confidence ball? Whenever I see a V inverse, your mental model should be, how much variance is there along the X direction?

2.2 Regret

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值