强化学习的第k个土匪问题

最新推荐文章于 2025-05-23 09:16:47 发布

weixin_26739079

最新推荐文章于 2025-05-23 09:16:47 发布

阅读量314

点赞数 1

文章标签： python java 人工智能强化学习

原文链接：https://towardsdatascience.com/the-k-bandit-problem-with-reinforcement-learning-440b2f3ddee0

版权

强化学习的基本问题(Basic Problems in Reinforcement Learning)

A very important part in reinforcement learning is how to evaluate the actions an agent performs. In this post we will use the K-bandit problem to show different ways of evaluating these actions. It’s important to have in mind that the K-bandit problem is just a simple version of the many reinforcement learning situations. It’s simple because the actions that the agent performs are evaluated individually and are not like other techniques where the evaluation is done over a series of actions.

强化学习中非常重要的部分是如何评估代理执行的动作。在这篇文章中，我们将使用K-bandit问题来展示评估这些动作的不同方法。重要的是要记住，K-bandit问题只是许多强化学习情况的简单版本。这很简单，因为代理执行的操作是单独评估的，而不像其他技术通过一系列操作来评估。

K匪问题 (The K-bandit Problem)

Imagine that you are in a casino surrounded by 4 different slot machines. Each machine gives you a different prize under a different probability law. Your job will be to maximize the prize those machines give you and discover which is the machine with the better prizes. You will do this by experimenting and playing on the machines a thousand times! The prizes and the probability of getting these prizes don’t change over time.

想象一下，您在一家赌场，周围有4个不同的老虎机。每台机器根据不同的概率定律给您不同的奖励。您的工作将是最大化那些机器为您提供的奖金，并发现哪台机器的奖金更高。您将通过在机器上进行数千次实验和播放来做到这一点！奖品和获得这些奖品的可能性不会随时间变化。

What strategy would you use to maximize the prize and discover the best machine? Maybe a first approach would be to play in the machines an equal amount of times (250 each). However, wouldn’t it be better to play more on the machine that is giving us the better prizes? What if we choose the wrong machine?

您将使用什么策略来最大化奖金并发现最好的机器？也许第一种方法是在机器中玩相同数量的时间(每次250次)。但是，在能给我们带来更好奖品的机器上玩更多的游戏，不是更好吗？如果我们选择了错误的机器怎么办？

It’s because of these questions that we need to coordinate our actions. Sometimes it will be better to explore (try playing on different machines) and sometimes it will be better to exploit our knowledge (play on the machine that we think is best).

由于这些问题，我们需要协调我们的行动。有时候，探索(尝试在不同的机器上玩)会更好，而有时，利用我们的知识(在我们认为最好的机器上玩)会更好。

Before we continue to solve this questions, I will introduce some notation that will be used throughout the problem:

在继续解决这个问题之前，我将介绍一些将在整个问题中使用的符号：

Aₜ=The action performed at time t. (In the example this will be the machine
Aₜ=在时间t执行的动作。 (在示例中，这将是机器
Rₜ=The prize obtained on step t.
Rₜ=在步骤t获得的奖品。
q∗(a)=The expected prize when performing action a, mathematically this is:
q ∗(a)=执行动作a时的期望奖金，数学上是：

The expected prize q∗(a) is the most important value in the problem, cause if we discover its true value we would know in which machine to play all the time. It’s common that this value is unknown, that is why we need to explore (play) on the different machines to estimate the expected value. As we proceed in time we will get a better approximation of the expected value (if we could play infinitely we would get the exact value). The notation used for the approximate value is Qₜ(a).This is the estimation of the expected value of a at time t.

期望奖金q *(a)是问题中最重要的值，因为如果我们发现它的真实值，我们将始终知道在哪台机器上玩。通常，这个值是未知的，这就是为什么我们需要在不同的机器上进行探索(播放)以估计期望值的原因。随着时间的推移，我们将获得更好的预期值近似值(如果可以无限玩，我们将获得准确的值)。用于近似值的符号为Qₜ(a)，这是在时间t处对a的期望值的估计。

估算代理人行为的期望值 (Estimating the Expected Values of the Agent’s Actions)

There are many ways to estimate the expected values of the actions (Qₜ(a)). Here we will use the approach that for me is the most natural. The approach sums all the prizes that were obtained performing a certain action, this will appear in the numerator and it will be divided by the number of times this action was performed:

有许多方法可以估计这些动作的期望值(Qₜ(a))。在这里，我们将使用对我来说最自然的方法。该方法将执行某项操作获得的所有奖金相加，将出现在分子中，并将其除以执行该操作的次数：

With the help of this equation and as we advance in time the value of Qₜ(a) will get closer to q∗(a) but it will be important to explore and exploit the different set of possible actions. In this article the ε-greedy method will be used to explore. This is a very simple method since the only thing we’re going to do is choose the option that has been calculated as the better option but with a certain probability(ε) try a different action at random. So let’s say that ε=.1 is established, that means that for a thousand steps in time, 900 times the best option will be chosen(exploit) and 100 times exploration will be done.

借助该方程式，随着时间的推移，Qₜ(a)的值将接近q ∗(a)，但是探索和利用不同的可能动作集非常重要。在本文中，将使用ε-贪婪方法进行探索。这是一种非常简单的方法，因为我们唯一要做的就是选择已计算为更好选项的选项，但以一定概率(ε)随机尝试其他操作。因此，假设建立了ε= .1，这意味着在一千个时间步中，将选择(利用)900倍的最佳选项，并进行100倍的探索。

使用Python (Using Python)

Now that the problem has been defined and the methods for solving it described we will solve it using python. So it’s time to open your favorite IDE and start coding! I will describe and explain the code piece by piece and put it fully at the end. You can copy it fully and in case of doubt check the specific piece. For this program we will only use two libraries so we will proceed to import them:

现在已经定义了问题并描述了解决问题的方法，我们将使用python解决问题。现在是时候打开您喜欢的IDE并开始编码了！我将逐段描述和解释代码，并将其完整地放在最后。您可以完全复制它，如有疑问，请检查特定的部分。对于此程序，我们将仅使用两个库，因此我们将继续导入它们：

###Import necessary libraries
import random as rd
import numpy as np

The second step will be to create two auxiliary functions. The first auxiliary function will help us emulate probabilistic outcomes. This function will throw a random number between 0 and 1. If the result of the random number is lower than the number we defined, the function will return true, and it will return false otherwise.

第二步是创建两个辅助功能。第一个辅助功能将帮助我们模拟概率结果。该函数将抛出一个介于0和1之间的随机数。如果该随机数的结果小于我们定义的数字，则该函数将返回true，否则将返回false。

###Define a random number in the interval [0,1] to simulate results of
###probabilistic experiments.
def decision(probability):
return rd.random() < probability

The second function is the ε-greedy algorithm which will decide on which machine to play. It can be the machine with highest expected value or a random one.

第二个功能是ε贪心算法，它将决定在哪台机器上播放。它可以是期望值最高的机器，也可以是随机的机器。

### Choose which machine to play following the E-greedy method.
def greedy(no_machines,probability):
aux=decision(probability)
if(aux==True):
index=rd.randint(0,len(no_machines)-1)

else:
index=np.argmax(no_machines)
return index

To test different ideas we will perform the algo rithm several times. In this case we’ll play 1,000 times for each cycle, repeat the cycle 10,000 times and try for ε=[0,.05,.1,.15,.2,.25,.3,.35,.4,.45,.5]. This way we will know how greedy we should be with our knowledge. Let’s first define the variables that will aid on the experiment.

为了测试不同的想法，我们将多次执行算法。在这种情况下，我们将为每个循环播放1,000次，重复该循环10,000次，并尝试ε= [0，.05，.1，.15，.2，.25，.3，.35，.4， .45，.5]。这样，我们将知道我们应该以自己的知识多么贪婪。让我们首先定义有助于实验的变量。

### This variable holds the real probability of winning that each machine has.
### This variable is unknown to the player and it is what we'll try to estimate. 
prob_win=[.8,.85,.9,.7]
### We will try different epsilons to see which one is better. 
epsilon=[0,.05,.1,.15,.2,.25,.3,.35,.4,.45,.5]
###Variables that hold the total for each different simulation(E=(0,.1,.2,...).
p_total_reward=[]
p_chosen_machine=[]

Given the construction of the problem 3 for loops will be necessary:

考虑到问题3的构造，循环将是必要的：

The first loop will be used to go through all ε=[0,.05,.1,.15,.2,.25,.3,.35,.4,.45,.5].
第一个循环将用于遍历所有ε= [0，.05，.1，.15，.2，.25，.3，.35，.4，.45，.5]。
The second loop for al 10,000 cicles in each gameplay.
第二个循环中，每个游戏中有10,000条冰柱。
The third cycle for the 1000 times played.
第三轮为1000次。

The code:

代码：

for j in range(len(epsilon)):
    ### Here the evolution of the algorithm can be seen. This variable shows
    ### the evolution of the average prize. With the average prize the performance
    ### of the algorithm is shown.
    average_prize=[]
    ###Variable that holds the total prize for each iteration.
    total_reward_acum_g=[]
    ###At the end of each cycle we will choose the machine that has the highest
    ###expected prize and save it in this variable.
    chosen_machine_g=[]
    for x in range(10000):
        ###The algorithm will be tested many times to see it's performance
        ### variable that indicates the prize by playing 1000 times.
        total_reward=0
        ### Number of times played
        i=0
        ### Númber of times played over each machine.
        iteraciones_por_accion=[0,0,0,0]
        ### The expected prize over each machine. The value is started at 10
        ### so that initially all machines are tested.
        expected_prize_action=[10,10,10,10]
        for x in range(100):
          ###Index is the machine that was chosen to play with
          index=greedy(expected_prize_action,epsilon[j])
          ###Esta parte emula si ganaste o perdiste   
          res=decision(prob_win[index])
          if (res==True):
              g=2
          else:
              g=1
          ###Total reward   
          total_reward=total_reward+g
          i=i+1 
          #Total average prize
          average_prize.append(total_reward/i)
          ###Number of times played per machine.
          iteraciones_por_accion[index]=iteraciones_por_accion[index]+1
          ###Update the value of the expected prize
          expected_prize_action[index]=(expected_prize_action[index])+(1/iteraciones_por_accion[index])*(g-expected_prize_action[index])
        ###results after playing 1000 times
        total_reward_acum_g.append(total_reward)
        chosen_machine_g.append(np.argmax(expected_prize_action))
    print(epsilon[j])
    print("On average "+str(sum(total_reward_acum_g)/len(total_reward_acum_g))+" points were obtained.")
    print("The machine was chosen correctly " +str(chosen_machine_g.count(np.argmax(prob_win)))+" times.")
    p_total_reward.append(sum(total_reward_acum_g)/len(total_reward_acum_g))
    p_chosen_machine.append(chosen_machine_g.count(np.argmax(prob_win)))

Finally we will use the matplot library to visualize the results in a plot.

最后，我们将使用matplot库在绘图中可视化结果。

import matplotlib.pyplot as plt
values=p_total_reward
values2=p_chosen_machine
eje_x=epsilon
eje_x[-1]
fig, ax = plt.subplots(figsize=(20, 14)) 
plt.xticks(rotation=90)
plt.plot(eje_x,values , marker ="o",label = "Average Total Prize");

ylabels = ['{:,}'.format(int(x)) + "K" for x in ax.get_yticks()*(1/1000)]
plt.legend(prop={'size': 24})
plt.title("Figure 1", fontdict=None, loc='center', pad=None,fontsize=18)
plt.xlabel("Epsilon", fontdict=None, labelpad=None,fontsize=18)

Observing “Figure 1” we realize that the maximum prize is reached when we set ε=0.15

观察“图1”，我们意识到当设置ε= 0.15时达到了最高奖金

this means that it is convenient to explore 15% of the time and be greedy the other 85%. Now, for the second part of the problem we asked the algorithm to tell us which machine it thought was best. Let’s see this graphically too:

这意味着您可以方便地探索15％的时间，而贪婪地探索其他85％的时间。现在，对于问题的第二部分，我们要求算法告诉我们它认为最好的机器。让我们也以图形方式查看：

import matplotlib.pyplot as plt
valores2=p_chosen_machine
eje_x=epsilon
eje_x[-1]
fig, ax = plt.subplots(figsize=(20, 14)) 
plt.xticks(rotation=90)
plt.plot(eje_x,values2 , marker ="o",label = "Númber of times the best machine was chosen correctly");
#plt.plot(x, y, marker ="o", label = "Modelo Atribución");
ylabels = ['{:,}'.format(int(x)) + "K" for x in ax.get_yticks()*(1/1000)]
plt.legend(prop={'size': 16})
plt.title("Figure 2", fontdict=None, loc='center', pad=None,fontsize=18)
plt.xlabel("Epsilon", fontdict=None, labelpad=None,fontsize=18)

Figure 2 shows us that with more exploration the algorithm tends to choose the best machine more accurately. This is a somewhat obvious result since by having more information about the different machines we would choose better. However, this comes with a prize since doing more exploration comes with the cost of reducing the prize.

图2向我们显示，通过更多的探索，该算法倾向于更准确地选择最佳机器。这是一个显而易见的结果，因为有了关于不同机器的更多信息，我们会更好地选择。但是，这是有奖的，因为进行更多探索会带来减少奖品的成本。

走得更远 (Going Further)

This article is just a first approach to the k-bandit problem so for the interested reader I would like to leave some questions for you to ponder upon:

本文只是解决k-bandit问题的第一种方法，因此对于感兴趣的读者，我想提出一些问题供您思考：

Would the result of ε be different if instead of a thousand times we played a hundred times?
如果我们玩一百次而不是一千次，ε的结果会有所不同吗？
How would ε change if the prizes the machines gave were very different from one another?
如果机器所提供的奖金彼此之间有很大差异，则ε会如何变化？
Could it be convenient to make ε change as time proceeds?
随着时间的推移使ε改变是否方便？
How are these three questions related?
这三个问题有什么关系？

Thank you very much for your attention, and I hope to see you again. You can find and run the code ahead.

非常感谢您的关注，希望再次见到您。您可以提前查找并运行代码。

Originally published at https://datasciencestreet.com on September 22, 2020.

最初于2020年9月22日发布在https://datasciencestreet.com 。