Deep QLearning算法详解(强化学习 Reinforcement Learning)

一、算法详解

文章最后附有博主自己实现的深度qlearning玩space invader游戏
在这里插入图片描述

本文介绍的是基于神经网络的qlearning算法。我们知道传统的qlearning算法只能处理状态和动作有限的情况,对于无穷多,则是无法有效处理的。现实生活中,环境的状态肯定是无穷多的,而神经网络正好可以处理这样的情况。这里深度qlearning算法使用一个神经网络来表示一个q表,无论环境的状态有没有出现过,我们都可以将状态输入到神经网络,去评估价值函数。

1.1 几个概念

1.1.1 什么是critic?

critic:批评家,评论家。
在这里算法要更新就是一个critic,而不是一个actor(agent),使用critic来间接指导actor做决策,critic的作用就是评估一个actor有多好,水平高不高。

1.1.2 状态价值函数 V π ( s ) V ^{\pi }(s) Vπ(s)和状态动作价值函数 Q π ( s , a ) Q^{\pi }(s,a) Qπ(s,a),以及他们之间的关系

π \pi π表示一个actor,s是一个状态,a是指actor看到状态s时做出的动作

  1. V π ( s ) V ^{\pi }(s) Vπ(s)指使用actor π \pi π时,当看到状态s时,计算累计奖励的期望值
    举个例子吧,不然有点难以理解,这里计算的依然是累计奖励的期望值。我个人理解,如果要写出具体 V π ( s ) V ^{\pi }(s) Vπ(s)公式的话,应该是这样
    V π ( s ) = ∑ τ R ( τ , s ) p ( τ ) (1) V^{\pi}(s)=\sum_{\tau }R(\tau ,s)p(\tau)\tag{1} Vπ(s)=τR(τ,s)p(τ)(1)
    公式解释: R ( τ , s ) R(\tau ,s) R(τ,s)表示在一个eposide内,只计算s之后的所有累计奖励的和。比如 τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . , s T , a T , r T , E n d } \tau =\{s_{1},a_{1},r_{1},s_{2},a_{2},r_{2},...,s_{T},a_{T},r_{T},End\} τ={s1,a1,r1,s2,a2,r2,...,sT,aT,rT,End},我们计算 R ( τ , s 2 ) R(\tau ,s_{2}) R(τ,s2),由上面的定义知道, R ( τ , s 2 ) = r 2 + . . . + r T R(\tau ,s_{2})=r_{2}+...+r_{T} R(τ,s2)=r2+...+rT,这里没有计算 r 1 r_{1} r1,只计算看到 s 2 s_{2} s2之后的所有奖励。 p ( τ ) p(\tau) p(τ)表示eposide τ \tau τ出现的概率。
    举一个sutton强化学习书上的例子。
    例子1:
    假设我们采样得到8个eposides,分别如下:
    1、 s a , r = 0 , s b , r = 0 , E n d s_{a},r=0,s_{b},r=0,End sa,r=0,sb,r=0,End
    2、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    3、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    4、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    5、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    6、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    7、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    8、 s b , r = 0 , E n d s_{b},r=0,End sb,r=0,End
    这里给出了8个采样得到的eposide,并且忽略所采取的动作,这里是采样得到8个eposide来逼近 V π ( s ) V^{\pi}(s) Vπ(s),选择采样来逼近 V π ( s ) V^{\pi}(s) Vπ(s),是因为actor所处的环境和actor本身都具有随机性,由上面的公式(1)可以看到,如果所有的eposides有无穷多个,那么计算机根本无法实现计算 V π ( s ) V^{\pi}(s) Vπ(s)
    根据上面的8个eposides,可以计算得到:
    V π ( s b ) = 1 + 1 + 1 + 1 + 1 + 1 8 = 3 4 V^{\pi}(s_{b})=\frac{1+1+1+1+1+1}{8}=\frac{3}{4} Vπ(sb)=81+1+1+1+1+1=43
    只有在第二到第七个eposide时,遇到状态 s b s_{b} sb才有奖励,第一个eposide中,当遇到状态 s b s_{b} sb时,所有累积的奖励值是0。第8个eposide也是如此。就算第一个eposide变成 s a , r = 1 , s b , r = 0 , E n d s_{a},r=1,s_{b},r=0,End sa,r=1,sb,r=0,End V π ( s ) V^{\pi}(s) Vπ(s)依然不变还是 3 4 \frac{3}{4} 43.

  2. Q π ( s , a ) Q^{\pi }(s,a) Qπ(s,a)指actor看到状态 s s s之后确定选择动作 a a a之后的奖励累积期望
    为什么说确定选择动作 a a a呢?看到状态 s s s,其实有很多个动作可以选择,但是在这里就是确定选择动作 a a a,而不是别的动作。还有一个因素就是选择动作的时候具有随机性,比如采用 ε − g r e e d y \varepsilon -greedy εgreedy方法,会有一定的几率随机选择动作。

  3. 两者之间的关系
    假设有 . . . , s t , a t , r t , s t + 1 , . . . ...,s_{t},a_{t},r_{t},s_{t+1},... ...,st,at,rt,st+1,...,那么就有 Q π ( s t , a t ) = E ( r t + V π ( s t + 1 ) ) Q^{\pi }(s_{t},a_{t})=E(r_{t}+V ^{\pi }(s_{t+1})) Qπ(st,at)=E(rt+Vπ(st+1))
    这里还是要求期望的。不过在实际操作的时候需要采样或者直接把变成 Q π ( s t , a t ) = r t + V π ( s t + 1 ) Q^{\pi }(s_{t},a_{t})=r_{t}+V ^{\pi }(s_{t+1}) Qπ(st,at)=rt+Vπ(st+1)。这种情况就是只采样一个用来逼近期望。这样网络收敛的效果可能不是太好,毕竟用一条样本来逼近还是效果不大好的。

1.1.3 如何计算 V π ( s ) V ^{\pi }(s) Vπ(s)

  1. Monte-Carlo(MC)方法
    如下图,我们使用一个神经网络来计算 V π ( s ) V ^{\pi }(s) Vπ(s),并且根据actor玩游戏的实际情况,来优化这个神经网络,并且利用这个神经网络(q表,我们前面说到了使用神经网络来表示这个q表以处理状态极其复杂多变的情况,所以神经网络==q表)来指导actor进行决策。
    在这里插入图片描述
    例子:假设有状态 s a 和 s b s_{a}和 s_{b} sasb,计算 V π ( s a ) 和 V π ( s b ) V ^{\pi }(s_{a})和V ^{\pi }(s_{b}) Vπ(sa)Vπ(sb)的步骤如下:
    对于 V π ( s a ) V ^{\pi }(s_{a}) Vπ(sa):
    (1) actor(agent)玩游戏或者和环境互动.
    (2) 状态 s a s_{a} sa 出现在某个回合(episode)中.
    (3) 记录从状态 s a s_{a} sa 出现到这个回合结束后的累积的奖励,我们将其记为 G a G_{a} Ga.
    (4) 将状态 s a s_{a} sa输入到神经网络然后输出 V π ( s a ) V ^{\pi }(s_{a}) Vπ(sa),这个神经网络输出的是个标量.
    (5) 利用 回归(regression) 的方法来优化神经网络,使得神经网络的输出的 V π ( s a ) V ^{\pi }(s_{a}) Vπ(sa)来逼近 G a G_{a} Ga.
    (6) 对于任何一个状态都可以重复以上过程,直到神经网络收敛。
    对于 V π ( s b ) V ^{\pi }(s_{b}) Vπ(sb)的计算也是如此。

  2. Temporal-Difference-Approach(时间差分方法)
    时间差分方法,顾名思义,肯定是需要两个连续时间步上的状态才能达到训练神经网络的目的。
    假设actor玩游戏或者与环境互动的一个episode中的连续两步是 . . . , s t , a t , r t , s t + 1 , . . . ...,s_{t},a_{t},r_{t},s_{t+1},... ...,st,at,rt,st+1,...
    首先, V π ( s t ) = r t + V π ( s t + 1 ) (2) V^{\pi }(s_{t})=r_{t}+V^{\pi }(s_{t+1})\tag{2} Vπ(st)=rt+Vπ(st+1)(2)
    π {\pi} π表示agent或者actor
    时间差分方法每连续的两个时间步都会训练一次神经网络,因此收敛的速度也会相对来说较快。步骤如下
    (1) 假设在时间步 t t t 观测到状态 s t s_{t} st
    (2) actor根据现在的状态 s t s_{t} st 做出动作 a t a_{t} at ,得到奖励 r t r_{t} rt
    (3) actor观测到下一个时间步 t + 1 t+1 t+1 的状态 s t + 1 s_{t+1} st+1
    (4) 将状态 s t s_{t} st, s t + 1 s_{t+1} st+1 输入进神经网络得到 V π ( s t ) V^{\pi }(s_{t}) Vπ(st) V π ( s t + 1 ) V^{\pi }(s_{t+1}) Vπ(st+1)
    (5) 利用公式2,我们将 V π ( s t ) V^{\pi }(s_{t}) Vπ(st) V π ( s t + 1 ) V^{\pi }(s_{t+1}) Vπ(st+1)的差值逼近时间步的奖励 r t r_{t} rt. 还是利用回归(Regression)的方法来训练神经网络。
    在这里插入图片描述
    图片来自李宏毅老师的强化学习课程,侵删!!!

  3. 两种方法之间的关系
    蒙特卡洛方法有更大的方差,方差大说明效果不好,相比来说时间差分方法的训练速度更加高效,收敛的更快,时间差分方法只需要使用两个时间步的信息就可以训练网络,而蒙特卡洛方法却需要等待一个完整的episode完成之后才可以进行训练。不知道大家发现没有,突然觉得这个方法也是有点类似监督学习方法,这个深度qlearning算法里面的“监督信息是环境反馈过来的奖励”,我们需要使用奖励来指引actor的学习。
    对于同一个采样出来的样本,使用两种方法计算出来的同一个状态 s a s_{a} sa 的价值函数是不一样的
    比如下面的例子:
    假设我们采样得到8个eposides,分别如下:
    1、 s a , r = 0 , s b , r = 0 , E n d s_{a},r=0,s_{b},r=0,End sa,r=0,sb,r=0,End
    2、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    3、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    4、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    5、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    6、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    7、 s b , r = 1 , E n d s_{b},r=1,End sb,r=1,End
    8、 s b , r = 0 , E n d s_{b},r=0,End sb,r=0,End
    我们先使用蒙特卡洛方法计算 V π ( s a ) V^{\pi}(s_{a}) Vπ(sa):
    观察采样出来的数据,我们可以看到状态 s a s_{a} sa 只是出现在 第一条样本里面,而且直到游戏结束,得到的两个奖励值都是0,所以 G a = 0 G_{a}=0 Ga=0,因为我们需要让 V π ( s a ) V^{\pi}(s_{a}) Vπ(sa) 逼近 G a G_{a} Ga,所以理想情况下有 V π ( s a ) = 0 V^{\pi}(s_{a})=0 Vπ(sa)=0.
    接着使用时间差分方法来计算 V π ( s a ) V^{\pi}(s_{a}) Vπ(sa)
    从第一条采样样本可以看到状态 s a s_{a} sa s b s_{b} sb 是两个相邻的状态,因此由公式2可以知道计算 V π ( s a ) V^{\pi}(s_{a}) Vπ(sa),需要使用 V π ( s b ) V^{\pi}(s_{b}) Vπ(sb)的值。
    根据采样的8条样本,我们使用这八条样本的 $V^{\pi}(s_{b})_{i},i=1,2,…,8的期望值来近似逼近 V π ( s b ) V^{\pi}(s_{b}) Vπ(sb)
    因此有 V π ( s b ) = 1 + 1 + 1 + 1 + 1 + 1 8 = 3 4 V^{\pi}(s_{b})=\frac{1+1+1+1+1+1}{8}=\frac{3}{4} Vπ(sb)=81+1+1+1+1+1=43
    由在第一条样本中在状态 s a s_{a} sa 时actor得到的奖励是0,所以 V π ( s a ) = 0 + V π ( s b ) = 3 4 V^{\pi }(s_{a})=0+V^{\pi }(s_{b})=\frac{3}{4} Vπ(sa)=0+Vπ(sb)=43.
    可以看到,不同的计算方法得到的价值函数的值是不一样的。

二、算法运行流程

步骤一: actor(agent) π {\pi} π和环境(Environment)互动
步骤二: 使用蒙特卡洛方法或者时间差分方法来计算状态价值函数 V π ( s ) V^{\pi}(s) Vπ(s)或者状态动作价值函数 Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a)
步骤三: 使用回归(Regression)方法训练神经网络
步骤四: 价值函数指导actor做出更好的策略,循环以上步骤,直到收敛。

基于TD方法的QLearning具体如下图:
在这里插入图片描述
算法流程里面的butter要改为buffer。

三、几个小技巧

3.1 技巧1 target网络和predict网络

其实在传统的qlearning里面以及涉及到了“q_target和q_predict”的概念了,这里的深度qlearning在训练的时候同样也是由target网络和predict网络,只不过是两个网络需要共享参数,其中target是固定的,只有在predict网络以target的输出为目标更新若干次之后采后将predict网络的参数重新赋给target网络。
有需要可以看看传统的qlearning算法。传送门:传统qlearning算法讲解

我们以李宏毅老师的课程讲解如何训练的
在这里插入图片描述
这里使用的是时间差分方法来训练的
右边的 Target 网络的参数在一定时间内是固定的,Target网络输出的值是左边网络需要回归的目标,然后更新这个来更新左边的predict网络,更新若干次之后,然后将predict网络的参数重新赋给Target网络,一直训练,直到收敛。

3.2、技巧2之 ε − g r e e d y \varepsilon -greedy εgreedy选择动作

使用这个技巧有利于actor学会探索,也可以确保当训练的次数足够多时,所有的动作都可以被的更新到。随机探索的可能性会随着训练的进行逐渐变小的
我们之前在强化学习(RL)QLearning算法详解介绍过了这个技巧,不再重复。

3.3 技巧3 Boltzmann选择动作

利用状态动作价值函数的大小来选择动作,值越大,这个对于的动作被选择的概率就越大,对于的动作被选择的概率公式如下:
P ( a ∣ s ) = e x p ( Q ( s , a ) ) ∑ a e x p ( Q ( s , a ) ) (3) P(a|s)=\frac{exp(Q(s,a))}{\sum_{a}exp(Q(s,a))}\tag{3} P(as)=aexp(Q(s,a))exp(Q(s,a))(3)

3.4 技巧4 RePlay Buffer

设计一个Buffer将actor玩过的experience存储起来,可以重复使用这个数据更新网络。
假如使用时间差分方法,我们可以在Buffer存储器里面存储只需要两个时间步就可以,比如其中一条数据可以是: s t , a t , r t , s t + 1 s_{t},a_{t},r_{t},s_{t+1} st,at,rt,st+1,当储存很多时,也可以进行batch学习。

四、小例子

基于卷积神经网络的小例子

4.1 readme

  1. 安装gym
  2. 安装atari-py
    在第二步很容易出现ale_c.dll不存在的问题。
    以下是解决方法
    分三步:
    第一步:先卸载atari-py。pip uninstall atari-py
    第二步:再重新安装这个。pip install --no-index -f https://github.com/Kojoley/atari-py/releases atari_py
    第三步:pip install gym
# -*- coding: utf-8 -*-

import gym
import torch.nn as nn
import torch as t
from torch.nn import functional as F
import random

dicount_factor = 0.9
eplison = 0.1
lr = 0.001
epochs = 50
nums_p2t = 100 #每隔100词将Q_Net_predict的参数赋给Q_Net_target,然后继续固定target网络
env = gym.make("SpaceInvaders-v0") # 构造一个太空入侵者的环境

# 下面这个神经网络是用来预测
class Q_Net_predict(nn.Module):    
    def __init__(self, nums_action):
        super(Q_Net_predict,self).__init__()
        #下面开始定义卷积和全连接层,计划使用两个全连接层和两个卷积层
        self.conv1 = nn.Conv2d(3, 16, 5, 2)
        self.conv2 = nn.Conv2d(16, 16, 5, 2)
        
        self.linear1 = nn.Linear(1728,256)
        self.linear2 = nn.Linear(256,nums_action)
        
    def forward(self, x):
        
        #先进行类型的转换
        state = t.from_numpy(x[:,:,::-1].copy())
        state = state.permute((2,0,1)).unsqueeze(dim=0).float()
        
        #开始使用卷积,最大池化和线性层
        out = self.conv1(state)
        out = F.relu(out)
        out = F.max_pool2d(out,(2,2))
        
        out = self.conv2(out)
        out = F.relu(out)
        out = F.max_pool2d(out,(2,2))
        s = out.size()
        
        out = out.view(1,s[1]*s[2]*s[3])
        
        out = F.relu(self.linear1(out))
        out = self.linear2(out)
        return out
 
    
# 下面这个神经网络是用来作为Q_Net_predict拟合的目标函数
class Q_Net_target(nn.Module):    
    def __init__(self, nums_action):
        super(Q_Net_target,self).__init__()
        #下面开始定义卷积和全连接层,计划使用两个全连接层和两个卷积层
        self.conv1 = nn.Conv2d(3, 16, 5, 2)
        self.conv2 = nn.Conv2d(16, 16, 5, 2)
        
        self.linear1 = nn.Linear(1728,256)
        self.linear2 = nn.Linear(256,nums_action)
        
    def forward(self, x):
        
        #先进行类型的转换
        state = t.from_numpy(x[:,:,::-1].copy())
        state = state.permute((2,0,1)).unsqueeze(dim=0).float()
        
        #开始使用卷积,最大池化和线性层
        out = self.conv1(state)
        out = F.relu(out)
        out = F.max_pool2d(out,(2,2))
        
        out = self.conv2(out)
        out = F.relu(out)
        out = F.max_pool2d(out,(2,2))
        s = out.size()
        
        out = out.view(1,s[1]*s[2]*s[3])
        
        out = F.relu(self.linear1(out))
        out = self.linear2(out)
        return out
    

    
def choose_action(logits):
    ## 使用eplison-greedy选择agent需要执行的动作
    v = random.uniform(0, 1)
    q_value, index = t.topk(logits, 1, dim = 1)
    
    #下面开始eplison-greedy 算法
    if v > eplison:
        
        #这里是求最大的状态价值函数对应的动作
        q_value_t = logits[0,index[0][0]]
        action = index[0][0].item()
        
    else:
        #下面是随机产生动作
        action = random.randint(0, 5)
        q_value_t = logits[0, action]
    return action, q_value_t


def q_learning():

    all_count = 0
    #下面开始
    #先定义两个状态价值函数网络
    q_target = Q_Net_target(6)
    q_predict = Q_Net_predict(6)
    
    #定义一个优化器
    opt_Adam = t.optim.Adam(q_predict.parameters(),lr = lr)
    #将target网络的参数冻结
    for p in q_target.parameters():
        p.requires_grad = False
    
    for _ in range(epochs):
        done = False
        #初始化一个状态
        observation = env.reset() #每个episode的初始状态
        while not done:
            env.render()
            
            #下面开始网络的参数复制
            if all_count % nums_p2t == 0:
                target_paras = q_target.state_dict()
                predict_paras = q_predict.state_dict()
                
                target_paras.update(predict_paras)
                q_target.load_state_dict(target_paras)
            #下面使用q_predict网络的输出选择动作
            predict_logits = q_predict(observation)
            action, q_value_t = choose_action(logits= predict_logits)
            
            #下面根据动作得到奖励以及下一个时间步的状态observation
            observation, reward, done, info = env.step(action)

            #现在有了observation,需要使用使用target网络计算observation的状态价值函数
            target_qvalue = q_target(observation)
            q_value_t_ = max(target_qvalue[0]).item()
            
            #我们需要使reward+q_value_t_ 和 q_value_t接近
            loss = (reward + q_value_t_ - q_value_t)**2
            loss.backward()
            opt_Adam.step()
            all_count +=1
    env.close()
            
            
if __name__ == '__main__':
    q_learning()
    

五、参考文献

1、李宏毅老师的强化学习算法
2、莫烦python的强化学习系列
3、ale_c.dll确实解决方法
4、openai官网

# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值