Atari 游戏训练--5. 经验回放(Experience Reply) 在目标网络(Target Network)下的学习过程

待完善。。。

详解:

1.在收集一定数量的的experience后, 随机提取batch size个样本进行Q-learning
def learn(self, model, target_model, memory, gamma, batch_size):
    samples = random.sample(memory, batch_size)   # shape_samples  = (64, 5)           
2.将训练需要的元素(state, action,rewards,next_state,dones)按照字段分别提出
    states, actions, rewards, next_states, dones = map(np.array, zip(*samples)) #shape_next_states  = (64, 4, 84, 84)
    print('shape_states  =', np.shape(states))
    print('shape_actions  =', np.shape(actions))
    print('shape_next_states  =', np.shape(next_states))
    print('shape_dones  =', np.shape(dones))

    #shape_states  = (64, 4, 84, 84)
    #shape_actions  = (64,)
    #shape_next_states  = (64, 4, 84, 84)
    #shape_dones  = (64,)
    
3.
    next_Qs = target_model.forward(torch.from_numpy(next_states))    #shape_next_Qs  = torch.Size([64, 9])  
4.
    next_Q = np.amax(next_Qs.detach().numpy(), axis =1)      # shape_next_Q  = (64,)




5.
    print('dones  =', dones)
    print('np.invert(dones) =', np.invert(dones).astype(np.float))
    
    #dones  = [False False...False False False]
    #np.invert(dones) = [1. 1. ... 1. 1. 1.]

    targets = rewards + np.invert(dones).astype(np.float)*gamma*next_Q
    #shape_targets = (64,)

整段代码

def learn(self, model, target_model, memory, gamma, batch_size):
            samples = random.sample(memory, batch_size)    # shape_samples  = (64, 5)           
            states, actions, rewards, next_states, dones = map(np.array, zip(*samples)) #shape_next_states  = (64, 4, 84, 84)
            next_Qs = target_model.forward(torch.from_numpy(next_states))    #shape_next_Qs  = torch.Size([64, 9])           
            next_Q = np.amax(next_Qs.detach().numpy(), axis =1)      # shape_next_Q  = (64,)
            
            print('dones  =', dones)
            print('np.invert(dones) =', np.invert(dones).astype(np.float))
            targets = rewards + np.invert(dones).astype(np.float)*gamma*next_Q
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Enzo 想砸电脑

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值