multi-arm-bandits问题python代码

最新推荐文章于 2022-11-27 13:15:51 发布

eowyn0406

最新推荐文章于 2022-11-27 13:15:51 发布

阅读量543

点赞数

分类专栏：强化学习文章标签：强化学习老虎机

本文链接：https://blog.csdn.net/eowyn0406/article/details/88354325

版权

强化学习专栏收录该内容

11 篇文章 3 订阅

订阅专栏

假设有k=10个摇臂的老虎机，其奖励分布满足高斯正态分布，每个摇臂对应的正态分布的均值与方差分别为：

#the real mean value of each ation's reward
qa_star = np.array([0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6])
#the vars of each action's reward
var_qa = np.array([1,1,1,1,1,1,1,1,1,1])

下面是 $\epsilon-$ 贪心算法的实现：

#the action is to choose an arm
#Qa is the evaluation of the rewards of actions
#qa_star is the real mean rewards of actions
#reward is the total reward of the policy from the 1st step to the last step
import numpy as np
import matplotlib.pyplot as plt

steps = 10000
armNum = 10
alpha = 0.1
totalReward = np.zeros(steps)

#the number of arms
#steps = 1000
#armNum = 10
#the real mean value of each ation's reward
qa_star = np.array([0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6])
#the vars of each action's reward
var_qa = np.array([1,1,1,1,1,1,1,1,1,1])
#the evaluation of the rewards of actions during each step
Qa = np.zeros(armNum)
#the times each action had been taken
actionTimes = np.zeros(armNum)

def selectAnArm():
    temp = np.random.randint(0,1000)
    return int(temp/100)

def getReward(selectedAction):
    meanQa = qa_star[selectedAction]
    print("meanQa=",meanQa)
    varQa = var_qa[selectedAction]
    print("varQa=",varQa)
    temp = np.random.normal(meanQa,varQa,1)
    return temp[0]

def updateQa(selectedAction,t,Ra):
    Qa[selectedAction] = Qa[selectedAction] + alpha*(Ra-Qa[selectedAction])
    actionTimes[selectedAction] = actionTimes[selectedAction] + 1

def main():
    #initialize the enviroment
    
    #loop:
    for t in range(1,steps):
        #for the 1st time, select an arm randomly
        if t==1:
            selectedAction = selectAnArm()
        else:
            #choose an action randomly in a probabilty of 0.01
            select = np.random.randint(1000)
            if select>100 and select<110:
                selectedAction = selectAnArm()
                #print("keci happened...")
            else:
                #choose the action with the biggest reward in a probabilty of 0.9
                #print("choose the best action")
                temp = Qa
                #if there are more than 1 maximum, choose one randomly
                index = np.where(temp == np.max(temp))
                #print("index=",index)
                ishape = np.shape(index)
                numMax = ishape[1]
                #if there are more than 1 maximum, choose one randomly
                if numMax>1:
                    i = np.random.randint(0,numMax-1)
                    selectedAction = index[0][i]
                else:
                	selectedAction = index[0][0]
        print("t=",t,", action=",selectedAction)
        #get the reward
        Ra = getReward(selectedAction)
        #print("...after get reward")
        #print("selectedAction=",selectedAction)
        #use the selected action to update its Qa
        updateQa(selectedAction,t,Ra)
        #print("...after update Qa")
        #print(Qa)
        totalReward[t] = ((t-1)/t)*totalReward[t-1] + Ra/t
    x = np.linspace(1,steps,steps)
    plt.plot(x,totalReward)
    plt.legend()
    print(Qa)
    plt.show()

if __name__ == '__main__':
    main()

下面我们对 $\epsilon$ 取不同的值，观察他对算法的影响。动作奖励的真实均值为[0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6]。

1. 当取 $\epsilon = 0$ 时，算法完全贪心，每一步都取当前最大值，其平均奖励随迭代步骤的变化情况如图所示，算法收敛速度非常快，但根据动作奖励的最终估计发现，算法基本没有估计到真实值[-0.02251624 0. 0. 0. 0. 0.
0. 0. 1.26768416 0. ]。

2. 取 $\epsilon = 0.1$ 时，情况如下图所示，可见收敛速度变慢，但奖励均值变大，说明在算法结束后，我们对每个动作的估计值更接近于真实值。最终估计为[ 0.18721383 -0.46207661 1.05806519 0.34051736 0.71894154 -1.6338036
0.03203917 -1.01940141 0.74095258 -0.81726031]，比贪心算法估计的强很多。

3. 取 $\epsilon = 0.01$ 时，最终估计为[ 0.16915131 -0.20132821 0.61555592 0.30789267 0.40873467 -0.91459323
-0.16957235 -0.46381608 1.15304414 -0.2309847 ]，总的来说估计效果与2相差不大，收敛速度更慢，但奖励均值更高。

eowyn0406

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
multi-arm-bandits问题python代码

假设有k=10个摇臂的老虎机，其奖励分布满足高斯正态分布，每个摇臂对应的正态分布的均值与方差分别为：#the real mean value of each ation's rewardqa_star = np.array([0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6])#the vars of each action's rewardvar_...
复制链接

扫一扫

专栏目录