Online Learning 6: Experts, Least squares
1 Setting
1.1 Player, adversity, experts
There’s a player, there’s and adversity, but now you also have several experts who can help the player make decisions.
Here instead what the player does is to chose an expert at each time. The experts do not learn. What you are doing, is to learn which is the right expert, but experts are frozen.
Player’s selection of action: The player chooses a distribution of experts ( Q t 1 × m Q_t^{1\times m} Qt1×m), multiplied by the experts’ distribution for different arms ( E t m × k E_t^{m\times k} Etm×k). P t = Q t E t P_t=Q_t E_t Pt=QtEt is the distribution of playing different k k k arms.
1.2 Adversarial setting
- At each time t, the adversary needs to choose an award for each arm j.
- It knows the distribution with which the player is going to play the arms.
- The adversary also knows all the previous sample actions and rewards that the player receives.
- Based on all this, it’s allowed to set a reward for each one of these arms.
- Player has information about it’s action and rewards. Now the player using that policy, that distribution, essentially, tosses the coin and depending on how that coin lands, it decides to play that particular arm. And that, in which case, the player incurs a reward of x at time t for that particular arm j.
1.3 Regret
Regret is calculated using the best fixed expert in hindsight.
2 Exp4 (exponential weighting algorithm for exploration and exploitation with experts) algorithm
2.1 Algorithm
- Experts reveal E ( t ) E^{(t)} E(t)
- Player chooses arm A t k A_t^{k} Atk~ Q t 1 × m E t m × k Q_t^{1\times m}E_t^{m\times k} Qt1×mEtm×k
- Player sees the reward x t x_t xt
- Do importance sampling for the rewadrs/losses.
- Update expert distribution Q t Q_t Qt.
Now, we have chose an expert, who in turn chose an arm and I got a reward. Using that reward, using the importance sampling estimation, we’ve got an expert level estimate of rewards for all possible experts. Now that I have this expert level estimates for all possible rewards. I’m going to use that to boast up my good experts. I don’t actually have the reward for the experts, so I use the estimator for rewards, and we know that this is an unbiased estimate.
2.2 Regret
η
=
2
ln
m
n
k
\eta=\sqrt{\frac{2\ln m}{nk}}
η=nk2lnm, then the regret:
R
n
≤
2
n
k
ln
m
R_n\leq \sqrt{2nk\ln m}
Rn≤2nklnm