Online Learning 7: Linear Bandits

xiwang_chn

于 2021-05-17 03:39:59 发布

阅读量477

点赞数

分类专栏： Online Learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/116630028

版权

Online Learning 专栏收录该内容

9 篇文章 4 订阅

订阅专栏

Online Learning 7: Linear Bandits

1 Setting (linear contextual bandits)
2 Linear UCB
- 2.1 Algorithm
- 2.2 Regret

1 Setting (linear contextual bandits)

1.1 Setting

在这里插入图片描述

1.2 Model

Feature map

Map the context and arm to some $d$ -dimensional vector, then dot the $\theta$ to get the reward, by plusing some noise.
在这里插入图片描述

Model

在这里插入图片描述

1.3 Regret

在这里插入图片描述

2 Linear UCB

2.1 Algorithm

Given $H_{t-1}$ , estimate a set of $C_t \in \mathbb{R}^d$ such that：
- $\theta^* \in C_t$ , w.h.p.
- $C_t$ gets smaller with increasing $t$ .
Use $C_t$ to determine an optimistic estimate of $\theta^*$ using the index for an arm $a\in\mathbb{R}^d$ .
- UCB $(a)=\max_{\theta\in C_t}<\theta,a>$
Play $A_t=\argmax_{a\in A_t}U_t(a)$

The approach used: Use a regularized least-square estimator to estimate $\hat{\theta}_{t-1}$ , and use an ellipsoid centered at $\hat{\theta}_{t-1}$ , and with axes depending on the covariance matrix of the estimator.

$C_t$ is the confidence set.

1you have an infinite number of arms, but nevertheless we’re able to not get a regret which depends on the possible number of arms, but only the dimension of the problem. By generalizing UCB in this setting, we’re using information from any particular vector to give essentially information about theta star, which is all possible parameters, which is the unknown parameter vector,

we consider score, play according to the score, explore, build a confidence ellipsoid, make sure that the confidence ellipsoid is created such that theta star always lies within that, scale up your confidence ellipsoid over time.

Essentially, scale the confidence level over time to make sure that you’re more and more confident as time progresses, but nevertheless, you want the confidence ellipsoid itself to shrink over time, because the variances itself are shrinking over time.

So even if sigma is extremely small, three sigma is also extremely small. So you want to come up with a, essentially the algorithm such that theta star always lies within your confidence ellipsoid, and your confidence ellipsoid is shrinking smaller and smaller so that you essentially estimate, and theta star actually are close to each other.

whenever I see a V norm your mental model should be, am I inside a confidence ball? Whenever I see a V inverse, your mental model should be, how much variance is there along the X direction?

2.2 Regret

在这里插入图片描述

xiwang_chn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Online Learning 7: Linear Bandits

Online Learning 7: Linear Bandits1 Setting (linear contextual bandits)1.1 Setting1.2 ModelFeature mapModel1.3 Regret2 Linear UCB2.1 Algorithm2.2 Regret1 Setting (linear contextual bandits)1.1 Setting1.2 ModelFeature mapMap the context and arm to so
复制链接

扫一扫