假设有k=10个摇臂的老虎机,其奖励分布满足高斯正态分布,每个摇臂对应的正态分布的均值与方差分别为:
#the real mean value of each ation's reward
qa_star = np.array([0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6])
#the vars of each action's reward
var_qa = np.array([1,1,1,1,1,1,1,1,1,1])
下面是贪心算法的实现:
#the action is to choose an arm
#Qa is the evaluation of the rewards of actions
#qa_star is the real mean rewards of actions
#reward is the total reward of the policy from the 1st step to the last step
import numpy as np
import matplotlib.pyplot as plt
steps = 10000
armNum = 10
alpha = 0.1
totalReward = np.zeros(steps)
#the number of arms
#steps = 1000
#armNum = 10
#the real mean value of each ation's reward
qa_star = np.array([0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6])
#the vars of each action's reward
var_qa = np.array([1,1,1,1,1,1,1,1,1,1])
#the evaluation of the rewards of actions during each step
Qa = np.zeros(armNum)
#the times each action had been taken
actionTimes = np.zeros(armNum)
def selectAnArm():
temp = np.random.randint(0,1000)
return int(temp/100)
def getReward(selectedAction):
meanQa = qa_star[selectedAction]
print("meanQa=",meanQa)
varQa = var_qa[selectedAction]
print("varQa=",varQa)
temp = np.random.normal(meanQa,varQa,1)
return temp[0]
def updateQa(selectedAction,t,Ra):
Qa[selectedAction] = Qa[selectedAction] + alpha*(Ra-Qa[selectedAction])
actionTimes[selectedAction] = actionTimes[selectedAction] + 1
def main():
#initialize the enviroment
#loop:
for t in range(1,steps):
#for the 1st time, select an arm randomly
if t==1:
selectedAction = selectAnArm()
else:
#choose an action randomly in a probabilty of 0.01
select = np.random.randint(1000)
if select>100 and select<110:
selectedAction = selectAnArm()
#print("keci happened...")
else:
#choose the action with the biggest reward in a probabilty of 0.9
#print("choose the best action")
temp = Qa
#if there are more than 1 maximum, choose one randomly
index = np.where(temp == np.max(temp))
#print("index=",index)
ishape = np.shape(index)
numMax = ishape[1]
#if there are more than 1 maximum, choose one randomly
if numMax>1:
i = np.random.randint(0,numMax-1)
selectedAction = index[0][i]
else:
selectedAction = index[0][0]
print("t=",t,", action=",selectedAction)
#get the reward
Ra = getReward(selectedAction)
#print("...after get reward")
#print("selectedAction=",selectedAction)
#use the selected action to update its Qa
updateQa(selectedAction,t,Ra)
#print("...after update Qa")
#print(Qa)
totalReward[t] = ((t-1)/t)*totalReward[t-1] + Ra/t
x = np.linspace(1,steps,steps)
plt.plot(x,totalReward)
plt.legend()
print(Qa)
plt.show()
if __name__ == '__main__':
main()
下面我们对取不同的值,观察他对算法的影响。动作奖励的真实均值为[0.2,-0.3,1.5,0.5,1.2,-1.6,-0.2,-1,1.1,-0.6]。
1. 当取时,算法完全贪心,每一步都取当前最大值,其平均奖励随迭代步骤的变化情况如图所示,算法收敛速度非常快,但根据动作奖励的最终估计发现,算法基本没有估计到真实值[-0.02251624 0. 0. 0. 0. 0.
0. 0. 1.26768416 0. ]。
2. 取时,情况如下图所示,可见收敛速度变慢,但奖励均值变大,说明在算法结束后,我们对每个动作的估计值更接近于真实值。最终估计为[ 0.18721383 -0.46207661 1.05806519 0.34051736 0.71894154 -1.6338036
0.03203917 -1.01940141 0.74095258 -0.81726031],比贪心算法估计的强很多。
3. 取时,最终估计为[ 0.16915131 -0.20132821 0.61555592 0.30789267 0.40873467 -0.91459323
-0.16957235 -0.46381608 1.15304414 -0.2309847 ],总的来说估计效果与2相差不大,收敛速度更慢,但奖励均值更高。