The Epsilon-Greedy Algorithm

The epsilon-greedy algorithm (often written using the actual Greek letter epsilon, as in the image below), is very simple and occurs in several areas of machine learning. One common use of epsilon-greedy is in the so-called multi-armed bandit problem.

Suppose you are standing in front of k = 3 slot machines. Each machine pays out according to a different probability distribution, and these distributions are unknown to you. And suppose you can play a total of 100 times.

You have two goals. The first goal is to experiment with a few coins to try and determine which machine pays out the best. The second, related, goal is to get as much money as possible. The terms “explore” and “exploit” are used to indicate that you have to use some coins to explore to find the best machine, and you want to use as many coins as possible on the best machine to exploit your knowledge.

Epsilon-greedy is almost too simple. As you play the machines, you keep track of the average payout of each machine. Then, you select the machine with the highest current average payout with probability = (1 – epsilon) + (epsilon / k) where epsilon is a small value like 0.10. And you select machines that don’t have the highest current payout average with probability = epsilon / k.

It much easier to understand with a concrete example. Suppose, after your first 12 pulls, you played machine #1 four times and won $1 two times and $0 two times. The average for machine #1 is $2/4 = $0.50.

And suppose you’ve played machine #2 five times and won $1 three times and $0 two times. The average payout for machine #2 is $3/5 = $0.60.

And suppose you’ve played machine #3 three times and won $1 one time and $0 two times. The average payout for machine #3 is $1/3 = $0.33.

Now you have to select a machine to play on try number 13. You generate a random number p, between 0.0 and 1.0. Suppose you have set epsilon = 0.10. If p > 0.10 (which it will be 90% of the time), you select machine #2 because it has the current highest average payout. But if p < 0.10 (which it will be only 10% of the time), you select a random machine, so each machine has a 1/3 chance of being selected.

Notice that machine #2 might get picked anyway because you select randomly from all machines.

Over time, the best machine will be played more and more often because it will pay out more often. In short, epsilon-greedy means pick the current best option ("greedy") most of the time, but pick a random option with a small (epsilon) probability sometimes.

There are many other algorithms for the multi-armed bandit problem. But epsilon-greedy is incredibly simple, and often works as well as, or even better than, more sophisticated algorithms such as UCB ("upper confidence bound") variations.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
1. Softmax算法: Softmax算法是一种基于概率的多臂老虎机算法。它的本质是将每个臂的平均奖励值转化为概率分布,选择最大概率的臂进行探索。 算法步骤: - 初始化每个臂的计数器和平均奖励值为0 - 对于每一轮,计算每个臂的概率分布,即p(i)=exp(q(i)/tau)/sum(exp(q(j)/tau)) - 从概率分布中选择一个臂i - 执行臂i并得到奖励r(i) - 更新臂i的计数器和平均奖励值 2. Epsilon-Greedy算法: Epsilon-Greedy算法是一种简单的多臂老虎机算法。它以1-epsilon的概率选择当前平均奖励值最高的臂,以epsilon的概率随机选择一个臂进行探索。 算法步骤: - 初始化每个臂的计数器和平均奖励值为0 - 对于每一轮,以1-epsilon的概率选择当前平均奖励值最高的臂,以epsilon的概率随机选择一个臂 - 执行选择的臂并得到奖励r(i) - 更新臂i的计数器和平均奖励值 3. BetaThompson sampling算法: BetaThompson sampling算法是一种贝叶斯多臂老虎机算法。它根据每个臂的奖励概率分布,使用贝叶斯推断方法计算每个臂被选择的概率。 算法步骤: - 初始化每个臂的计数器和奖励计数器为0 - 对于每一轮,计算每个臂的奖励概率分布,并从中抽取一个值作为当前臂的奖励概率 - 选择奖励概率最高的臂i - 执行臂i并得到奖励r(i) - 更新臂i的计数器和奖励计数器 4. UCB算法: UCB算法是一种基于置信区间的多臂老虎机算法。它使用置信区间作为选择臂的依据,同时平衡探索和利用的策略。 算法步骤: - 初始化每个臂的计数器和平均奖励值为0 - 对于每一轮,计算每个臂的置信区间上界,即UCB(i)=q(i)+c*sqrt(ln(t)/N(i)) - 选择置信区间上界最大的臂i - 执行臂i并得到奖励r(i) - 更新臂i的计数器和平均奖励值 5. LinUCB算法: LinUCB算法是一种基于线性模型的多臂老虎机算法。它使用线性回归模型来估计每个臂的奖励值,并使用置信区间来选择臂。 算法步骤: - 初始化每个臂的特征向量和奖励计数器为0 - 对于每一轮,计算每个臂的置信区间上界,即UCB(i)=theta(i)*x(t)+c*sqrt(x(t)T*A(i)^-1*x(t)) - 选择置信区间上界最大的臂i - 执行臂i并得到奖励r(i) - 更新臂i的特征向量和奖励计数器,并更新线性回归模型的参数
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值