multi-arm-bandits问题python代码

假设有k=10个摇臂的老虎机,其奖励分布满足高斯正态分布,每个摇臂对应的正态分布的均值与方差分别为:

#the real mean value of each ation's reward
qa_star = np.array([0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6])
#the vars of each action's reward
var_qa = np.array([1,1,1,1,1,1,1,1,1,1])

下面是\epsilon-贪心算法的实现:

#the action is to choose an arm
#Qa is the evaluation of the rewards of actions
#qa_star is the real mean rewards of actions
#reward is the total reward of the policy from the 1st step to the last step
import numpy as np
import matplotlib.pyplot as plt

steps = 10000
armNum = 10
alpha = 0.1
totalReward = np.zeros(steps)

#the number of arms
#steps = 1000
#armNum = 10
#the real mean value of each ation's reward
qa_star = np.array([0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6])
#the vars of each action's reward
var_qa = np.array([1,1,1,1,1,1,1,1,1,1])
#the evaluation of the rewards of actions during each step
Qa = np.zeros(armNum)
#the times each action had been taken
actionTimes = np.zeros(armNum)

def selectAnArm():
    temp = np.random.randint(0,1000)
    return int(temp/100)

def getReward(selectedAction):
    meanQa = qa_star[selectedAction]
    print("meanQa=",meanQa)
    varQa = var_qa[selectedAction]
    print("varQa=",varQa)
    temp = np.random.normal(meanQa,varQa,1)
    return temp[0]

def updateQa(selectedAction,t,Ra):
    Qa[selectedAction] = Qa[selectedAction] + alpha*(Ra-Qa[selectedAction])
    actionTimes[selectedAction] = actionTimes[selectedAction] + 1

def main():
    #initialize the enviroment
    
    #loop:
    for t in range(1,steps):
        #for the 1st time, select an arm randomly
        if t==1:
            selectedAction = selectAnArm()
        else:
            #choose an action randomly in a probabilty of 0.01
            select = np.random.randint(1000)
            if select>100 and select<110:
                selectedAction = selectAnArm()
                #print("keci happened...")
            else:
                #choose the action with the biggest reward in a probabilty of 0.9
                #print("choose the best action")
                temp = Qa
                #if there are more than 1 maximum, choose one randomly
                index = np.where(temp == np.max(temp))
                #print("index=",index)
                ishape = np.shape(index)
                numMax = ishape[1]
                #if there are more than 1 maximum, choose one randomly
                if numMax>1:
                    i = np.random.randint(0,numMax-1)
                    selectedAction = index[0][i]
                else:
                	selectedAction = index[0][0]
        print("t=",t,", action=",selectedAction)
        #get the reward
        Ra = getReward(selectedAction)
        #print("...after get reward")
        #print("selectedAction=",selectedAction)
        #use the selected action to update its Qa
        updateQa(selectedAction,t,Ra)
        #print("...after update Qa")
        #print(Qa)
        totalReward[t] = ((t-1)/t)*totalReward[t-1] + Ra/t
    x = np.linspace(1,steps,steps)
    plt.plot(x,totalReward)
    plt.legend()
    print(Qa)
    plt.show()

if __name__ == '__main__':
    main()

下面我们对\epsilon取不同的值,观察他对算法的影响。动作奖励的真实均值为[0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6]。

1. 当取\epsilon = 0时,算法完全贪心,每一步都取当前最大值,其平均奖励随迭代步骤的变化情况如图所示,算法收敛速度非常快,但根据动作奖励的最终估计发现,算法基本没有估计到真实值[-0.02251624  0.          0.          0.          0.          0.
  0.          0.          1.26768416  0.        ]。

2. 取\epsilon = 0.1时,情况如下图所示,可见收敛速度变慢,但奖励均值变大,说明在算法结束后,我们对每个动作的估计值更接近于真实值。最终估计为[ 0.18721383 -0.46207661  1.05806519  0.34051736  0.71894154 -1.6338036
  0.03203917 -1.01940141  0.74095258 -0.81726031],比贪心算法估计的强很多。

3. 取\epsilon = 0.01时,最终估计为[ 0.16915131 -0.20132821  0.61555592  0.30789267  0.40873467 -0.91459323
 -0.16957235 -0.46381608  1.15304414 -0.2309847 ],总的来说估计效果与2相差不大,收敛速度更慢,但奖励均值更高。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值