Online Learning 6: Experts, Least squares

最新推荐文章于 2021-09-10 17:40:06 发布

xiwang_chn

最新推荐文章于 2021-09-10 17:40:06 发布

阅读量178

点赞数

分类专栏： Online Learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/116626199

版权

Online Learning 专栏收录该内容

9 篇文章 4 订阅

订阅专栏

Online Learning 6: Experts, Least squares

1 Setting
2 Exp4 (exponential weighting algorithm for exploration and exploitation with experts) algorithm
- 2.1 Algorithm
- 2.2 Regret

1 Setting

1.1 Player, adversity, experts

There’s a player, there’s and adversity, but now you also have several experts who can help the player make decisions.

Here instead what the player does is to chose an expert at each time. The experts do not learn. What you are doing, is to learn which is the right expert, but experts are frozen.

Player’s selection of action: The player chooses a distribution of experts ( $Q_t^{1\times m}$ ), multiplied by the experts’ distribution for different arms ( $E_t^{m\times k}$ ). $P_t=Q_t E_t$ is the distribution of playing different $k$ arms.

1.2 Adversarial setting

At each time t, the adversary needs to choose an award for each arm j.
- It knows the distribution with which the player is going to play the arms.
- The adversary also knows all the previous sample actions and rewards that the player receives.
- Based on all this, it’s allowed to set a reward for each one of these arms.
Player has information about it’s action and rewards. Now the player using that policy, that distribution, essentially, tosses the coin and depending on how that coin lands, it decides to play that particular arm. And that, in which case, the player incurs a reward of x at time t for that particular arm j.

1.3 Regret

Regret is calculated using the best fixed expert in hindsight.

2 Exp4 (exponential weighting algorithm for exploration and exploitation with experts) algorithm

2.1 Algorithm

Experts reveal $E^{(t)}$
Player chooses arm $A_t^{k}$ ~ $Q_t^{1\times m}E_t^{m\times k}$
Player sees the reward $x_t$
Do importance sampling for the rewadrs/losses.
Update expert distribution $Q_t$ .

Now, we have chose an expert, who in turn chose an arm and I got a reward. Using that reward, using the importance sampling estimation, we’ve got an expert level estimate of rewards for all possible experts. Now that I have this expert level estimates for all possible rewards. I’m going to use that to boast up my good experts. I don’t actually have the reward for the experts, so I use the estimator for rewards, and we know that this is an unbiased estimate.

2.2 Regret

$\eta=\sqrt{\frac{2\ln m}{nk}}$ , then the regret:
$R_n\leq \sqrt{2nk\ln m}$

xiwang_chn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Online Learning 6: Experts, Least squares

Online Learning 6: Experts, Least squares1 Setting1.1 Player, adversity, experts1.2 Adversarial setting1.3 Regret2 Exp4 (exponential weighting algorithm for exploration and exploitation with experts) algorithm2.1 Algorithm2.2 Regret1 Setting1.1 Player, a
复制链接

扫一扫

专栏目录