Online Learning 7: Linear Bandits
1 Setting (linear contextual bandits)
1.1 Setting
1.2 Model
Feature map
Map the context and arm to some
d
d
d-dimensional vector, then dot the
θ
\theta
θ to get the reward, by plusing some noise.
Model
1.3 Regret
2 Linear UCB
2.1 Algorithm
- Given
H
t
−
1
H_{t-1}
Ht−1, estimate a set of
C
t
∈
R
d
C_t \in \mathbb{R}^d
Ct∈Rd such that:
- θ ∗ ∈ C t \theta^* \in C_t θ∗∈Ct, w.h.p.
- C t C_t Ct gets smaller with increasing t t t.
- Use
C
t
C_t
Ct to determine an optimistic estimate of
θ
∗
\theta^*
θ∗ using the index for an arm
a
∈
R
d
a\in\mathbb{R}^d
a∈Rd.
- UCB ( a ) = max θ ∈ C t < θ , a > (a)=\max_{\theta\in C_t}<\theta,a> (a)=maxθ∈Ct<θ,a>
- Play A t = arg max a ∈ A t U t ( a ) A_t=\argmax_{a\in A_t}U_t(a) At=a∈AtargmaxUt(a)
The approach used: Use a regularized least-square estimator to estimate θ ^ t − 1 \hat{\theta}_{t-1} θ^t−1, and use an ellipsoid centered at θ ^ t − 1 \hat{\theta}_{t-1} θ^t−1, and with axes depending on the covariance matrix of the estimator.
C t C_t Ct is the confidence set.
1you have an infinite number of arms, but nevertheless we’re able to not get a regret which depends on the possible number of arms, but only the dimension of the problem. By generalizing UCB in this setting, we’re using information from any particular vector to give essentially information about theta star, which is all possible parameters, which is the unknown parameter vector,
we consider score, play according to the score, explore, build a confidence ellipsoid, make sure that the confidence ellipsoid is created such that theta star always lies within that, scale up your confidence ellipsoid over time.
Essentially, scale the confidence level over time to make sure that you’re more and more confident as time progresses, but nevertheless, you want the confidence ellipsoid itself to shrink over time, because the variances itself are shrinking over time.
So even if sigma is extremely small, three sigma is also extremely small. So you want to come up with a, essentially the algorithm such that theta star always lies within your confidence ellipsoid, and your confidence ellipsoid is shrinking smaller and smaller so that you essentially estimate, and theta star actually are close to each other.
whenever I see a V norm your mental model should be, am I inside a confidence ball? Whenever I see a V inverse, your mental model should be, how much variance is there along the X direction?
2.2 Regret