Online Learning 6: Experts, Least squares

1 Setting

1.1 Player, adversity, experts

There’s a player, there’s and adversity, but now you also have several experts who can help the player make decisions.

Here instead what the player does is to chose an expert at each time. The experts do not learn. What you are doing, is to learn which is the right expert, but experts are frozen.

Player’s selection of action: The player chooses a distribution of experts ( Q t 1 × m Q_t^{1\times m} Qt1×m), multiplied by the experts’ distribution for different arms ( E t m × k E_t^{m\times k} Etm×k). P t = Q t E t P_t=Q_t E_t Pt=QtEt is the distribution of playing different k k k arms.

1.2 Adversarial setting

  1. At each time t, the adversary needs to choose an award for each arm j.
    • It knows the distribution with which the player is going to play the arms.
    • The adversary also knows all the previous sample actions and rewards that the player receives.
    • Based on all this, it’s allowed to set a reward for each one of these arms.
  2. Player has information about it’s action and rewards. Now the player using that policy, that distribution, essentially, tosses the coin and depending on how that coin lands, it decides to play that particular arm. And that, in which case, the player incurs a reward of x at time t for that particular arm j.

1.3 Regret

Regret is calculated using the best fixed expert in hindsight.

2 Exp4 (exponential weighting algorithm for exploration and exploitation with experts) algorithm

2.1 Algorithm

  • Experts reveal E ( t ) E^{(t)} E(t)
  • Player chooses arm A t k A_t^{k} Atk~ Q t 1 × m E t m × k Q_t^{1\times m}E_t^{m\times k} Qt1×mEtm×k
  • Player sees the reward x t x_t xt
  • Do importance sampling for the rewadrs/losses.
  • Update expert distribution Q t Q_t Qt.

Now, we have chose an expert, who in turn chose an arm and I got a reward. Using that reward, using the importance sampling estimation, we’ve got an expert level estimate of rewards for all possible experts. Now that I have this expert level estimates for all possible rewards. I’m going to use that to boast up my good experts. I don’t actually have the reward for the experts, so I use the estimator for rewards, and we know that this is an unbiased estimate.

2.2 Regret

η = 2 ln ⁡ m n k \eta=\sqrt{\frac{2\ln m}{nk}} η=nk2lnm , then the regret:
R n ≤ 2 n k ln ⁡ m R_n\leq \sqrt{2nk\ln m} Rn2nklnm

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值