《强化学习》k臂赌博机问题python编程练习2.5

在这里插入图片描述

100次独立模拟100次循环的数据(算力有限)

# -*- coding: utf-8 -*-
"""
Created on Fri May  1 14:47:35 2020

@author: Ziz
"""

import numpy as np
import random 
import time
import matplotlib.pyplot as plt
import pickle
# fr = open('dataFile.txt','wb')
# pickle.dumps([x,y],fr,-1)
# fr.close()

# fr = open('dataFile.txt','rb')
# pickle.load(fr)
# fr.close()


x = []
y = []
inner_loop = 5
outer_loop = 5
for loop in range(1,51):
    start_time =time.time()
    train_steps = loop*20
    average_expect_out = 0
    for loop in range(outer_loop):
        q_values = np.zeros((10,1))
        
        for i,c in enumerate(q_values):
            q_values[i]=np.random.randn()
            
            
        def get_Rt(a):
            return np.random.randn()+q_values[a]
        
        alpha = 0.1
        average_expect=0  
        for n in range(inner_loop):
            q_n = np.zeros((10,1))
            Q = np.zeros((10,1))
            total_expect = 0
            
            for i in range(train_steps):
                if(i==0) :
                    a = np.random.randint(0,10)
                else:
                    if(np.random.rand()<0.1):
                        a = np.random.randint(0,10)
                    else:
                        a = np.where(Q==np.max(Q))
                        a = a[0][0]
                R_n = get_Rt(a)
                total_expect += R_n
                # q_n[a]+=1
                Q[a] = Q[a] + (R_n-Q[a])*alpha #/q_n[a]
            total_expect/=train_steps
            average_expect += total_expect
        average_expect/=inner_loop
        average_expect_out+=average_expect
    average_expect_out/=outer_loop
    x.append(train_steps)
    y.append(average_expect_out)
    end_time = time.time()
    print(x)
    print(y)
    print('time consume = ',end_time-start_time)
plt.plot(x,y)

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值