在gym的MountainCar环境下，用图片帧作为状态训练DQN网络

最新推荐文章于 2024-04-19 16:09:25 发布

蛋总的快乐生活

最新推荐文章于 2024-04-19 16:09:25 发布

阅读量1.2k

点赞数 5

分类专栏： python 强化学习模式识别文章标签： MountainCar gym DQN 连续四帧强化学习

本文链接：https://blog.csdn.net/qwe900/article/details/115046052

版权

python 同时被 3 个专栏收录

10 篇文章 2 订阅

订阅专栏

强化学习

10 篇文章 9 订阅

订阅专栏

模式识别

9 篇文章 0 订阅

订阅专栏

Apply DQN in gym environment in MountainCar-v0

一、Gym Environment

首先启动环境，采取随机的动作后会返回几个变量，简单的代码过程如下：

env = gym.make('MountainCar-v1') # 打开一个环境，这个环境是修改后的后面会讲
env.reset() # 重置环境
action = env.action_space.sample() # 从动作空间里随机采样一个动作
state, reward, done, info = env.step(action) # 采取动作后，获得了状态，奖励，是否完成和info额外信息。

1.1 ACTION SPACE

动作空间有三个，分别是左，原地不动和右，离散的形式为action=[0,1,2]

1.2 STATE SPACE

原本的状态是两个，分别是车辆的位置和速度，离散的形式为state=[position,velocity]

其中，position=[-0.6,0.6],velocity=[-0.1,0.1]

传统的方法是通过确定的状态来更新Q-value，本实验将不同的图片帧作为状态，通过卷积神经网络输出一个Q-value，进一步再选择动作。

1.3 REWARD

奖励为-1 / time step

1.4 DONE

一般是DONE=FALSE，当完成任务后，DONE=TRUE，然而内置版本设定为200个episode没有到达终点，则环境重置env.reset()。

❗️由于重置后Done会重新设置为True，而到达终点时候的Done也是True，多数情况下，很难在200个episode中收敛，不利于训练，因此在注册函数中更改一下信息。

位置为：XXX/anaconda3/envs/py36/lib/python3.6/site-packages/gym/envs/__init__.py

打开后可以看到有如下两个版本，分别是离散和连续的两种情况。

复制粘贴第一个环境，修改最大episode为100000后，注册为MountainCar-v1并且保存。

register(
    id='MountainCar-v0',
    entry_point='gym.envs.classic_control:MountainCarEnv',
    max_episode_steps=200,
    reward_threshold=-110.0,
)
register(
    id='MountainCarContinuous-v0',
    entry_point='gym.envs.classic_control:Continuous_MountainCarEnv',
    max_episode_steps=999,
    reward_threshold=90.0,
)
# 新注册一个环境
register(
    id='MountainCar-v1',
    entry_point='gym.envs.classic_control:MountainCarEnv',
    max_episode_steps=100000,
    reward_threshold=-110.0,
)

2021.3.26 补充一下另一种修改gym最大episode的方式

import gym
env = gym.make("CartPole-v0")
print("The default max episode is",env._max_episode_steps)
env._max_episode_steps = 500
print("After changing, the max episode is",env._max_episode_steps)

The default max episode is 200
After changing, the max episode is 500

二、Deep Q-learning

2.1 Preprocess Frame

事实上，在训练时候并不需要那么大的图片，主要分为以下操作过程：

将RGB的图片转换成黑白图像
（可选择）剪裁掉没有用的区域，例如游戏的计分板区域
归一化，将像素值转换为[0,1]之间，减少计算量
再次改变大小，将图片转换成[84,84]的区域。

def preprocess_frame(frame):
    gray = rgb2gray(frame)
   #crop the frame
   #cropped_frame = gray[:,:]
    normalized_frame = gray/255.0
    preprocessed_frame = transform.resize(normalized_frame, [84,84])
    return preprocessed_frame

2.2 Stack Frames

将每四帧图片堆叠在一起，因此得到的图像大小为[84,84,4]

此时需要分两种情况讨论，首先建立一个双端队列deque，每个位置的大小为[84,84]，最大队列长度为4

新的episode（DONE=True）
- 初始化，将第一帧复制四份填满这个队列
当前episode还未结束（DONE=False）
- 将最新的帧添加到队列中，并且旧的帧自动移除队列

def stack_frames(stacked_frames, state, is_new_episode):
    # Preprocess frame
    frame = preprocess_frame(state)
    
    if is_new_episode:
        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)
        
        # Because we're in a new episode, copy the same frame 4x
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        
        # Stack the frames
        stacked_state = np.stack(stacked_frames, axis=2)
        
    else:
        # Append frame to deque, automatically removes the oldest frame
        stacked_frames.append(frame)

        # Build the stacked state (first dimension specifies different frames)
        stacked_state = np.stack(stacked_frames, axis=2) 
    
    return stacked_state, stacked_frames

在存储图片后，也可以通过读取的方式显示这些图片。

batch_size = 64
# 调用DQN类，创建一个memory
memory = DQNetwork()

replay_batch = memory.sample(batch_size) 

s_batch = [replay[0] for replay in replay_batch][np.random.randint(0,batch_size)] 
# print("the shape of s_batch is",s_batch.shape)
# (s_batch[63]).shape       84*84*4 
next_s_batch = [replay[3] for replay in replay_batch][np.random.randint(0,batch_size)] # (s_batch[63]).shape       84*84*4 
# print("the shape of next_s_batch is",next_s_batch.shape)
# 然后可以输出对应的图片，需要乘以255并且选择通道。

plt.imshow(255*s_batch[:,:,3])

2.3 Replay Buffer

在预测Q值的时候，用的不是当前的状态，而是经验回放池中抽取的状态
依次定义三个过程
- 添加经验
- 从经验中采样
- 再从中采样出一组经验

# 定义 batch_size，每次从Replay Buffer中抽取多少batch
    def add(self, experience):
        self.buffer.append(experience)
        
    def sample(self, batch_size):
        buffer_size = len(self.buffer) 
        index = np.random.choice(np.arange(buffer_size),size = batch_size,replace = True) #如果开始的时候都是0，采样会重复，因此改为True。
        return [self.buffer[i] for i in index]
    
	def train(self, batch_size=64):
        replay_batch = self.sample(batch_size)
        #简化一个过程，抽取了若干batch后，先随机抽取一个batch
        batch_number = np.random.randint(0,batch_size)
        s_batch = [replay[0] for replay in replay_batch][batch_number]
		next_s_batch = [replay[0] for replay in replay_batch][batch_number]

此时来分析一下维度：

replay_batch：大小为 $64\times 5$ ，包含了64个batch，并且每个batch中由5个部分组成
replay：大小为 $1\times 5$ ，而后面的batch_number决定了这个replay是这64中的哪一组。
- replay由5部分组成，里面的内容为[(84,84,4),action=[0,1,2],reward=-1,(84,84,4),False or True]
s_batch：即这个replay中的replay[0]，大小为 $84\times 84 \times 4$ ，由连续四帧堆叠而成。
next_s_batch：即这个replay中的replay[3]，大小为 $84\times 84 \times 4$ ，由连续四帧堆叠而成。
action：replay[1]，动作为[0,1,2]中的一个随机动作
reward：replay[2]，表示当前时间步的奖励
Done：replay[5]，True或者False，表示当前episode是否结束。

2.4 Q-target network and Q network

class DQNetwork:
    def __init__(self):
        self.step = 0
        self.update_freq = 50  # 模型更新频率
    # 省略一些函数，看一下关键语句
    # 每 update_freq 步，将 model 的权重赋值给 target_model
    def train(self):
    	self.step += 1
        if self.step % self.update_freq == 0:
            self.target_model.set_weights(self.model.get_weights())
    # 省略一些语句，看一下关键语句
        Q = self.model.predict(s_batch.reshape(-1, 84, 84, 4))
        Q_next = self.target_model.predict(next_s_batch.reshape(-1, 84, 84, 4))

Q-target network的参数每间隔一定步数，将Q network的权重复制过去。
Q-target network的网络结构和Q network完全一样

这里reshape(-1,84,84,4)是将其增加了一个维度，由(84,84,4)变成了(1,84,84,4)，因为在输入网络的时候是batch=1，输入的格式为(batch_size,length,width,channels)，因此需要改变输入大小。

2.5 Network Model

    def create_model(self):

        # 这个网络是原版Atari的网络架构
        inputs = layers.Input(shape=(84, 84, 4,))

        # Convolutions on the frames on the screen
        layer1 = layers.Conv2D(32, 8, strides=4, activation="relu")(inputs)
        layer2 = layers.Conv2D(64, 4, strides=2, activation="relu")(layer1)
        layer3 = layers.Conv2D(64, 3, strides=1, activation="relu")(layer2)

        layer4 = layers.Flatten()(layer3)

        layer5 = layers.Dense(512, activation="relu")(layer4)
        action = layers.Dense(3, activation="linear")(layer5)

        model=keras.Model(inputs=inputs, outputs=action)

        return model

使用model.summary()后可以看到它的结构：

Model: "model_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_19 (InputLayer)        [(None, 84, 84, 4)]       0         
_________________________________________________________________
conv2d_54 (Conv2D)           (None, 20, 20, 32)        8224      
_________________________________________________________________
conv2d_55 (Conv2D)           (None, 9, 9, 64)          32832     
_________________________________________________________________
conv2d_56 (Conv2D)           (None, 7, 7, 64)          36928     
_________________________________________________________________
flatten_18 (Flatten)         (None, 3136)              0         
_________________________________________________________________
dense_36 (Dense)             (None, 512)               1606144   
_________________________________________________________________
dense_37 (Dense)             (None, 3)                 1539      
=================================================================
Total params: 1,685,667
Trainable params: 1,685,667
Non-trainable params: 0

2.6 Update Q-Value

其主要公式为：
$Q_{new}=(1-\alpha)Q_{old}+\alpha(reward+\gamma maxQ_{future})$
其中， $\alpha$ 表示学习率， $\gamma$ 表示折扣系数。

$Q_{future}$ 用的是Q-target network
$Q_{old}$ 用的是Q network

更新后，将 $Q_{new}$ 的值给Q network

Q[0][a] = ( 1 - lr) * Q[0][a] + lr * (reward + factor * np.amax(Q_next[0]))

Q的值为：

Q = array([[ 0.2  , -0.525,  0.4  ]])

总结一下，更新过程如下：

初始化 Q=Q target

repeat

通过公式，用Q和Q target更新 Q
每隔一段时间，Q target复制Q的权重更新Q target

2.7 Save and Load Weights

class DQNetwork:
    def save_model(self, file_path='MountainCar-v1-dqn.h5'):
        print('model saved')
        self.model.save(file_path)

agent_test = DQNetwork()
agent_test.model.load_weights(r'/home/shy/桌面/MountainCar-v1-dqn.h5')

如何确认权重是否读取成功呢？在TensorFlow 1中，读取后会显示权重对应存储的列表。

而TensorFlow 2中，可能没有显示这些信息，读取后使用get_weights()方法，如果再次创建一个实体发现权重一样，说明读取进去了，否则每次随机初始化的权重不同，说明读取失败。

三、Some Details

3.1 Save Frames

在gym中，使用env.render(mode='rgb_array')便可以渲染出当前的状态图片，但是这种方式消耗CPU过大，在这里也看到了很多人提出了这个问题https://github.com/openai/gym/issues/659，也许可以利用内存回收机制，节约一些内存。

经典控制的游戏并没有提供不渲染就能返回图片的接口，也就是这种写法env.render(mode='rgb_array',close=True)，而在一些Atari的部分游戏中，可以采用此方法节约内存空间。

如果不将图片保存到本地，将env.render(mode='rgb_array')返回给一个变量，然后经过了动作后，再次env.render(mode='rgb_array')返回另外一个变量，显示图片发现指向了同一个空间，而这两个变量的图片是一样的。
因此采用了将图片保存到本地，再次从本地读取的办法处理这个问题。

def save_gym_state(env,i):

    next_state = env.render(mode='rgb_array')

    plt.imsave('/home/shy/state/state{}.png'.format(i),preprocess_frame(next_state)) 
    next_state = plt.imread('/home/shy/state/state{}.png'.format(i))
    
    return next_state

3.2 Train and Test Agent

这里分为三个步骤，

没有权重时，初始化权重训练出Q network
将学习好的权重读入，再次训练Q network
用Q network测试，直接返回Q值最大对应的动作

由于内存空间不足，第二个步骤可以将episode设为1000，反复训练多次可以拟合出更准的Q network

四、Full Code

4.1 Import

import matplotlib.pyplot as plt
import tensorflow as tf
import gym
import scipy
import numpy as np
from skimage import transform # Help us to preprocess the frames
from skimage.color import rgb2gray # Help us to gray our frames
from collections import deque
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
config = ConfigProto()
config.allow_soft_placement=True
config.gpu_options.per_process_gpu_memory_fraction=0.8
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

4.2 Defined Function and Class

# 构造函数，输入episode，结果保存图片或返回处理好的数组
def preprocess_frame(frame):
    gray = rgb2gray(frame)
   #crop the frame
   #cropped_frame = gray[:,:]
    normalized_frame = gray/255.0
    preprocessed_frame = transform.resize(normalized_frame, [84,84])
    return preprocessed_frame


stack_size = 4 # We stack 4 frames

# Initialize deque with zero-images one array for each image
stacked_frames  =  deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)

def stack_frames(stacked_frames, state, is_new_episode):
    # Preprocess frame
    frame = preprocess_frame(state)
    
    if is_new_episode:
        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)
        
        # Because we're in a new episode, copy the same frame 4x
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        
        # Stack the frames
        stacked_state = np.stack(stacked_frames, axis=2)
        
    else:
        # Append frame to deque, automatically removes the oldest frame
        stacked_frames.append(frame)

        # Build the stacked state (first dimension specifies different frames)
        stacked_state = np.stack(stacked_frames, axis=2) 
    
    return stacked_state, stacked_frames

def save_gym_state(env,i):

    next_state = env.render(mode='rgb_array')

    plt.imsave('/home/shy/state/state{}.png'.format(i),preprocess_frame(next_state)) 
    next_state = plt.imread('/home/shy/state/state{}.png'.format(i))
    
    return next_state

class DQNetwork:
    def __init__(self):
        self.step = 0
        self.update_freq = 50  # 模型更新频率
        
        self.buffer = deque(maxlen = 200)
        self.model = self.create_model()

        self.target_model = self.create_model()
    def add(self, experience):
        self.buffer.append(experience)
    
    def sample(self, batch_size ):
        buffer_size = len(self.buffer) #agent.buffer=0??这里有问题
        
        index = np.random.choice(np.arange(buffer_size),
                                size = batch_size,
                                replace = True) #如果开始的时候都是0，不然会出现采样重复的状况，因此改称True
        
        return [self.buffer[i] for i in index]
        
    def create_model(self):

        # 这个网络是原版Atari的网络架构
        inputs = layers.Input(shape=(84, 84, 4,))

        # Convolutions on the frames on the screen
        layer1 = layers.Conv2D(32, 8, strides=4, activation="relu")(inputs)
        layer2 = layers.Conv2D(64, 4, strides=2, activation="relu")(layer1)
        layer3 = layers.Conv2D(64, 3, strides=1, activation="relu")(layer2)

        layer4 = layers.Flatten()(layer3)

        layer5 = layers.Dense(512, activation="relu")(layer4)
        action = layers.Dense(3, activation="linear")(layer5)

        model=keras.Model(inputs=inputs, outputs=action)

        return model

    
    def act(self, state, epsilon=0.1):
        """预测动作"""
        # 刚开始时，加一点随机成分，产生更多的状态
        if np.random.uniform() < epsilon - self.step * 0.0002:
            return np.random.choice([0, 1, 2])
        return np.argmax(self.model.predict(state.reshape(-1, 84, 84, 4)))
                         
    def save_model(self, file_path='MountainCar-v0-dqn.h5'):
        print('model saved')
        self.model.save(file_path)
                         
    def train(self, batch_size=64, lr=0.1, factor=0.95):

        self.step += 1
        # 每 update_freq 步，将 model 的权重赋值给 target_model
        if self.step % self.update_freq == 0:
            self.target_model.set_weights(self.model.get_weights())
        
        replay_batch = self.sample(batch_size) 
        
       #num = np.random.randint(0,batch_size) # 举例子，一个最基本的情况，从该batch中随机抽取一个样本作为输入
        
        s_batch = [replay[0] for replay in replay_batch][np.random.randint(0,batch_size)] 
        #print("the shape of s_batch is",np.array(s_batch).shape)
        # (s_batch[63]).shape       84*84*4   
        next_s_batch = [replay[3] for replay in replay_batch][np.random.randint(0,batch_size)] # 84,84,4 正确
        #print("the shape of next_s_batch is",np.array(next_s_batch).shape)

        Q = self.model.predict(s_batch.reshape(-1, 84, 84, 4))
        Q_next = self.target_model.predict(next_s_batch.reshape(-1, 84, 84, 4))

        # 使用公式更新训练集中的Q值 这里的句子还没有改,这里现打印一下吧
        for i, replay in enumerate(replay_batch):
#             print("the shape of replay_batch is",np.array(replay_batch).shape)
#             print("the shape of replay is ",replay[0].shape)
#             print("the action of replay is",replay[1])
#             print("the reward of replay is ",replay[2])
#             print("the shape of next_replay is ",replay[3].shape)
#             print("the last replay is ",replay[4])
            a = replay[1]
            Q[0][a] = ( 1 - lr) * Q[0][a] + lr * (reward + factor * np.amax(Q_next[0]))
        
        # 传入网络进行训练
        self.model.compile(loss='mean_squared_error',
                           optimizer=keras.optimizers.Adam(0.001))
        self.model.fit(s_batch.reshape(-1, 84, 84, 4), Q, verbose=0)

4.3 Main Function (Random Initialization)

###########################
#这个是初次训练时候的主函数####
###########################
env = gym.make('MountainCar-v1')
episodes = 1000  # 训练1000次
score_list = [] 
agent = DQNetwork()
env.reset()
##########
# 1.修改注册后，要重启内核
# 2.env.render(mode='rgb_array',close=True) 可以节约内存，但这个环境不适用
# 3.在注册里修改里面的max变量后，增加episode，否则默认的episode是200的时候自动结束开启新的周期
###########

#state = env.render(mode='rgb_array')
#plt.imsave('/home/shy/state/state.png',preprocess_frame(state))
#state = plt.imread('/home/shy/state/state.png')

state = save_gym_state(env,'init')

stacked_frames  =  deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)
score = 0
for i in range(episodes):
    #action = env.action_space.sample() # 初始化，随机从环境中选取一个动作 
    action = agent.act(state)
    _ , reward, done, _ = env.step(action)

    if i % 200 == 0:
        print("# It has finished {} episodes".format(i))
    
    #next_state = env.render(mode='rgb_array')
    #plt.imsave('/home/shy/state/state{}.png'.format(i),preprocess_frame(next_state)) 
    #next_state = plt.imread('/home/shy/state/state{}.png'.format(i))
	next_state = save_gym_state(env,i)
    
    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
    if done: # 说明这个游戏暂时结束了
        next_state = np.zeros([84,84,4]) #那么上个周期的最后没有未来记忆，因此需要把最后一个的记忆给成1
        score += reward
        score_list.append(score)
        agent.add((state, action, reward, next_state, done))
        print("# 第{}个episode的reward为".format(i),score)
        env.reset() 
        state = env.render(mode='rgb_array') 
        score = 0 # 分数置为0
        #plt.imsave('/home/shy/state/state{}.png'.format(i),preprocess_frame(state))
        #state = plt.imread('/home/shy/state/state{}.png'.format(i))
  		state =  save_gym_state(env,i)
        
        state, stacked_frames = stack_frames(stacked_frames, state, True)
    else:
        #print("the shape of state is ",state.shape)
        #print("the shape of next_state is ",next_state.shape)
        agent.add((state, action, reward, next_state, done)) # 将获得的状态添加到记忆memroy中
        agent.train() ## 这里要用同一个类，不然添加不上记忆
        score += reward
        score_list.append(score)
        state = next_state
agent.save_model(r'/home/shy/桌面/MountainCar-v1-dqn.h5')

4.4 Main Function (Trained Weights Initialization)

#########################################
#这个是得到了训练权重后，再次训练时候的主函数####
#########################################
env = gym.make('MountainCar-v1')
episodes = 1000  # 训练1000次
score_list = [] 
agent_train = DQNetwork()
agent_train.model.load_weights(r'/home/shy/桌面/MountainCar-v1-dqn.h5')
env.reset()
##########
# 1.修改注册后，要重启内核
# 2.env.render(mode='rgb_array',close=True) 可以节约内存，但这个环境不适用
# 3.在注册里修改里面的max变量后，增加episode，否则默认的episode是200的时候自动结束开启新的周期
###########

#state = env.render(mode='rgb_array')
#plt.imsave('/home/shy/state/state.png',preprocess_frame(state))
#state = plt.imread('/home/shy/state/state.png')

state =  save_gym_state(env,'init')
stacked_frames  =  deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)
score = 0
for i in range(episodes):
    #action = env.action_space.sample() # 初始化，随机从环境中选取一个动作 
    action = agent.act(state)
    _ , reward, done, _ = env.step(action)

    if i % 200 == 0:
        print("# It has finished {} episodes".format(i))
    
    #next_state = env.render(mode='rgb_array')
    #plt.imsave('/home/shy/state/state{}.png'.format(i),preprocess_frame(next_state)) 
    #next_state = plt.imread('/home/shy/state/state{}.png'.format(i))
    next_state = save_gym_state(env,i)
    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
    if done: # 说明这个游戏暂时结束了
        next_state = np.zeros([84,84,4]) #那么上个周期的最后没有未来记忆，因此需要把最后一个的记忆给成1
        score += reward
        score_list.append(score)
        agent_train.add((state, action, reward, next_state, done))
        print("# 第{}个episode的reward为".format(i),score)
        env.reset() 
        state = env.render(mode='rgb_array') 
        score = 0 # 分数置为0
        #plt.imsave('/home/shy/state/state{}.png'.format(i),preprocess_frame(state))
        #state = plt.imread('/home/shy/state/state{}.png'.format(i))
        state =  save_gym_state(env,i)
        
        state, stacked_frames = stack_frames(stacked_frames, state, True)
    else:
        #print("the shape of state is ",state.shape)
        #print("the shape of next_state is ",next_state.shape)
        agent_train.add((state, action, reward, next_state, done)) # 将获得的状态添加到记忆memroy中
        agent_train.train() ## 这里要用同一个类，不然添加不上记忆
        score += reward
        score_list.append(score)
        state = next_state
agent_train.save_model(r'/home/shy/桌面/MountainCar-v1-dqn_weights.h5')

4.4 Main Function (Test)

# 测试时候的改动
     action = agent_weight_train.act(state) # 直接选择Q值最大的动作
# 取消训练过程和添加记忆的过程，只保留将4帧图片保存到队列的过程，作为网络的输入

演示效果：Bilibili

蛋总的快乐生活

关注

5
点赞
踩
14

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

在gym的MountainCar环境下，用图片帧作为状态训练DQN网络

Apply DQN in gym environment in MountainCar-v0

文章目录

一、Gym Environment

1.1 ACTION SPACE

1.2 STATE SPACE

1.3 REWARD

1.4 DONE

二、Deep Q-learning

2.1 Preprocess Frame

2.2 Stack Frames

2.3 Replay Buffer

2.4 Q-target network and Q network

2.5 Network Model

2.6 Update Q-Value

2.7 Save and Load Weights

三、Some Details

3.1 Save Frames

3.2 Train and Test Agent

四、Full Code

4.1 Import

4.2 Defined Function and Class

4.3 Main Function (Random Initialization)

4.4 Main Function (Trained Weights Initialization)

4.4 Main Function (Test)