详细分析莫烦DQN代码

详细分析莫烦DQN代码

Python入门,莫烦是很好的选择,快去b站搜视频吧!
作为一只渣渣白,去看了莫烦的强化学习入门, 现在来回忆总结下DQN,作为笔记记录下来。
主要是对代码做了详细注释
DQN有两个网络,一个eval网络,一个target网络,两个网络结构相同,只是target网络的参数在一段时间后会被eval网络更新。
maze_env.py是环境文件,建立的是一个陷阱游戏的环境,就不用细分析了。
RL_brain.py是建立网络结构的文件:
在类DeepQNetwork中,有五个函数:
n_actions 是动作空间数,环境中上下左右所以是4,n_features是状态特征数,根据位置坐标所以是2.
函数_build_net(self):(讲道理这个注释是详细到不能再详细了)
建立eval网络:

# ------------------ build evaluate_net ------------------
# input 用来接收observation
self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')
# for calculating loss 用来接收q_target的值
self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')
# 两层网络l1,l2,神经元 10个,第二层有多少动作输出多少
# variable_scope()用于定义创建变量(层)的操作的上下文管理器
with tf.variable_scope('eval_net'):
    # c_names(collections_names) are the collections to store variables  在更新target_net参数时会用到
    # \表示没有[],()的换行
    c_names, n_l1, w_initializer, b_initializer = \
        ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
        tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)
    # config of layers  nl1第一层有多少个神经元

    # eval_net 的第一层. collections 是在更新 target_net 参数时会用到
    with tf.variable_scope('l1'):
        w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
        b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
        l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)
        print(l1)

    # eval_net 的第二层. collections 是在更新 target_net 参数时会用到
    with tf.variable_scope('l2'):
        w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
        b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
        self.q_eval = tf.matmul(l1, w2) + b2
        #作为行为的Q值  估计

with tf.variable_scope('loss'): #求误差
    self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
with tf.variable_scope('train'): #梯度下降
    self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

两层全连接,隐藏层神经元个数都是10个,最后输出是q_eval,再求误差。
target网络建立和上面的大致相同,结构也相同。输出是q_next。

函数:store_transition(): 存储记忆

def store_transition(self, s, a, r, s_):
    # hasattr() 函数用于判断对象是否包含对应的属性  如果对象有该属性返回 True,否则返回 False
    if not hasattr(self, 'memory_counter'):
        self.memory_counter = 0
    # 记录一条 [s, a, r, s_] 记录
    transition = np.hstack((s, [a, r], s_))
    # numpy.hstack(tup)参数tup可以是元组,列表,或者numpy数组,返回结果为按顺序堆叠numpy的数组(按列堆叠一个)。

    # 总 memory 大小是固定的, 如果超出总大小, 旧 memory 就被新 memory 替换
    index = self.memory_counter % self.memory_size
    self.memory[index, :] = transition

    self.memory_counter += 1

存储transition,按照记忆池大小,按行插入,超过的则覆盖存储。

函数choose_action():选择动作

def choose_action(self, observation):
    # to have batch dimension when feed into tf placeholder  统一 observation 的 shape (1, size_of_observation)
    observation = observation[np.newaxis, :]
    #np.newaxis增加维度  []变成[[]]多加了一个行轴,一维变二维

    if np.random.uniform() < self.epsilon:
        # forward feed the observation and get q value for every actions
        # 让 eval_net 神经网络生成所有 action 的值, 并选择值最大的 action
        actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})
        action = np.argmax(actions_value)  #返回axis维度的最大值的索引
    else:
        action = np.random.randint(0, self.n_actions)
    return action

如果随机生成的数小于epsilon,则按照q_eval中最大值对应的索引作为action,否则就在动作空间中随机产生动作。

函数learn(): agent学习过程

def learn(self):
    # 检查是否替换 target_net 参数
    if self.learn_step_counter % self.replace_target_iter == 0:
        self.sess.run(self.replace_target_op)  #判断要不要换参数
        print('\ntarget_params_replaced\n')

    # sample batch memory from all memory 随机抽取多少个记忆变成batch memory
    if self.memory_counter > self.memory_size:
        sample_index = np.random.choice(self.memory_size, size=self.batch_size)
    else:
        sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
        # 从 memory 中随机抽取 batch_size 这么多记忆
    batch_memory = self.memory[sample_index, :]  #随机选出的记忆

    #获取 q_next (target_net 产生了 q) 和 q_eval(eval_net 产生的 q)
    q_next, q_eval = self.sess.run(
        [self.q_next, self.q_eval],
        feed_dict={
            self.s_: batch_memory[:, -self.n_features:],  # fixed params
            self.s: batch_memory[:, :self.n_features],  # newest params
        })

    # change q_target w.r.t q_eval's action 先让target = eval
    q_target = q_eval.copy()

    batch_index = np.arange(self.batch_size, dtype=np.int32)
    #返回一个长度为self.batch_size的索引值列表aray([0,1,2,...,31])
    eval_act_index = batch_memory[:, self.n_features].astype(int)
    # 返回一个长度为32的动作列表,从记忆库batch_memory中的标记的第2列,self.n_features=2
    # #即RL.store_transition(observation, action, reward, observation_)中的action
    # #注意从0开始记,所以eval_act_index得到的是action那一列

    reward = batch_memory[:, self.n_features + 1]
    # 返回一个长度为32奖励的列表,提取出记忆库中的reward

    q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

    """
    For example in this batch I have 2 samples and 3 actions:
    q_eval =
    [[1, 2, 3],
     [4, 5, 6]]
    q_target = q_eval =
    [[1, 2, 3],
     [4, 5, 6]]
    Then change q_target with the real q_target value w.r.t the q_eval's action.
    For example in:
        sample 0, I took action 0, and the max q_target value is -1;
        sample 1, I took action 2, and the max q_target value is -2:
    q_target =
    [[-1, 2, 3],
     [4, 5, -2]]
    So the (q_target - q_eval) becomes:       q值并不是对位相减
    [[(-1)-(1), 0, 0],
     [0, 0, (-2)-(6)]]
    We then backpropagate this error w.r.t the corresponding action to network,
    最后我们将这个 (q_target - q_eval) 当成误差, 反向传递会神经网络.
    所有为 0 的 action 值是当时没有选择的 action, 之前有选择的 action 才有不为0的值.
    我们只反向传递之前选择的 action 的值,
    leave other action as error=0 cause we didn't choose it.
    """

    # train eval network
    _, self.cost = self.sess.run([self._train_op, self.loss],
                                 feed_dict={self.s: batch_memory[:, :self.n_features],
                                            self.q_target: q_target})
    self.cost_his.append(self.cost)  # 记录 cost 误差

    # increasing epsilon  逐渐增加 epsilon, 降低行为的随机
    self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
    self.learn_step_counter += 1

每200步替换一次两个网络的参数,eval网络的参数实时更新,并用于训练 target网络的用于求loss,每200步将eval的参数赋给target实现更新。

我也不知道这里为什么没有用onehot,所以莫烦在讲求值相减的时候也有点凌乱。其实就是将q_eval赋给q_target,然后按照被选择的动作索引赋q_next的值,即只改变被选择了动作位置处的q值,其他位置q值不变还是q_eval的值,这样为了方便相减,求loss值,反向传递给神经网络。

run_this.py文件,运行:

def run_maze():
    step = 0   #用来控制什么时候学习
    for episode in range(100):
        # 初始化环境
        observation = env.reset()
        #print(observation)
        while True:
            # 刷新环境
            env.render()
            # dqn根据观测值选择动作
            action = RL.choose_action(observation)
            # 环境根据行为给出下一个state,reward,是否终止
            observation_, reward, done = env.step(action)
            RL.store_transition(observation, action, reward, observation_)
            #dqn存储记忆
            #数量大于200以后再训练,每五步学习一次
            if (step > 200) and (step % 5 == 0):
                RL.learn()
            # 将下一个state_变为下次循环的state
            observation = observation_
            # 如果终止就跳出循环
            if done:
                break
            step += 1

    # end of game
    print('game over')
    env.destroy()

执行过程就显得比较明了了,调用之前的函数,与环境交互获得observation,选择动作,存储记忆,学习,训练网络。

以上是我对DQN代码的理解,感谢莫烦大佬,本人水平有限,以上内容如有错误之处请批评指正,有相关疑问也欢迎讨论。

  • 11
    点赞
  • 82
    收藏
    觉得还不错? 一键收藏
  • 10
    评论
DDPG(Deep Deterministic Policy Gradient)代码实现: ```python import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F import numpy as np import random from collections import deque class ReplayBuffer: def __init__(self, buffer_size): self.buffer_size = buffer_size self.buffer = deque(maxlen=buffer_size) def push(self, state, action, reward, next_state, done): experience = (state, action, np.array([reward]), next_state, done) self.buffer.append(experience) def sample(self, batch_size): state_batch = [] action_batch = [] reward_batch = [] next_state_batch = [] done_batch = [] batch = random.sample(self.buffer, batch_size) for experience in batch: state, action, reward, next_state, done = experience state_batch.append(state) action_batch.append(action) reward_batch.append(reward) next_state_batch.append(next_state) done_batch.append(done) return (state_batch, action_batch, reward_batch, next_state_batch, done_batch) def __len__(self): return len(self.buffer) class Actor(nn.Module): def __init__(self, state_dim, action_dim, max_action): super(Actor, self).__init__() self.fc1 = nn.Linear(state_dim, 256) self.fc2 = nn.Linear(256, 256) self.fc3 = nn.Linear(256, action_dim) self.max_action = max_action def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.max_action * torch.tanh(self.fc3(x)) return x class Critic(nn.Module): def __init__(self, state_dim, action_dim): super(Critic, self).__init__() self.fc1 = nn.Linear(state_dim + action_dim, 256) self.fc2 = nn.Linear(256, 256) self.fc3 = nn.Linear(256, 1) def forward(self, state, action): x = torch.cat([state, action], dim=1) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x class DDPG: def __init__(self, state_dim, action_dim, max_action): self.actor = Actor(state_dim, action_dim, max_action).to(device) self.actor_target = Actor(state_dim, action_dim, max_action).to(device) self.actor_target.load_state_dict(self.actor.state_dict()) self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=1e-3) self.critic = Critic(state_dim, action_dim).to(device) self.critic_target = Critic(state_dim, action_dim).to(device) self.critic_target.load_state_dict(self.critic.state_dict()) self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=1e-3) self.buffer = ReplayBuffer(buffer_size) self.state_dim = state_dim self.action_dim = action_dim self.max_action = max_action def select_action(self, state): state = torch.FloatTensor(state.reshape(1, -1)).to(device) action = self.actor(state).cpu().data.numpy().flatten() return action def train(self, batch_size, gamma, tau): state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.buffer.sample(batch_size) state_batch = torch.FloatTensor(state_batch).to(device) action_batch = torch.FloatTensor(action_batch).to(device) reward_batch = torch.FloatTensor(reward_batch).to(device) next_state_batch = torch.FloatTensor(next_state_batch).to(device) done_batch = torch.FloatTensor(done_batch).to(device) next_action_batch = self.actor_target(next_state_batch) next_q_value = self.critic_target(next_state_batch, next_action_batch) q_value = self.critic(state_batch, action_batch) target_q_value = reward_batch + (1 - done_batch) * gamma * next_q_value critic_loss = F.mse_loss(q_value, target_q_value.detach()) self.critic_optimizer.zero_grad() critic_loss.backward() self.critic_optimizer.step() actor_loss = -self.critic(state_batch, self.actor(state_batch)).mean() self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step() for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()): target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data) for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()): target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") ddpg = DDPG(state_dim=4, action_dim=1, max_action=1) batch_size = 128 gamma = 0.99 tau = 0.001 buffer_size = 100000 max_episodes = 1000 max_steps = 1000 for episode in range(max_episodes): state = env.reset() episode_reward = 0 for step in range(max_steps): action = ddpg.select_action(state) next_state, reward, done, _ = env.step(action) ddpg.buffer.push(state, action, reward, next_state, done) if len(ddpg.buffer) > batch_size: ddpg.train(batch_size, gamma, tau) state = next_state episode_reward += reward if done: break print(f"Episode {episode+1} : Episode Reward {episode_reward:.2f}") ``` DQN(Deep Q-Network)代码实现: ```python import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F import numpy as np import random from collections import deque class ReplayBuffer: def __init__(self, buffer_size): self.buffer_size = buffer_size self.buffer = deque(maxlen=buffer_size) def push(self, state, action, reward, next_state, done): experience = (state, action, np.array([reward]), next_state, done) self.buffer.append(experience) def sample(self, batch_size): state_batch = [] action_batch = [] reward_batch = [] next_state_batch = [] done_batch = [] batch = random.sample(self.buffer, batch_size) for experience in batch: state, action, reward, next_state, done = experience state_batch.append(state) action_batch.append(action) reward_batch.append(reward) next_state_batch.append(next_state) done_batch.append(done) return (state_batch, action_batch, reward_batch, next_state_batch, done_batch) def __len__(self): return len(self.buffer) class DQN(nn.Module): def __init__(self, state_dim, action_dim): super(DQN, self).__init__() self.fc1 = nn.Linear(state_dim, 128) self.fc2 = nn.Linear(128, 64) self.fc3 = nn.Linear(64, action_dim) def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x class DQNAgent: def __init__(self, state_dim, action_dim): self.q_network = DQN(state_dim, action_dim).to(device) self.target_network = DQN(state_dim, action_dim).to(device) self.target_network.load_state_dict(self.q_network.state_dict()) self.optimizer = optim.Adam(self.q_network.parameters(), lr=1e-3) self.buffer = ReplayBuffer(buffer_size) self.state_dim = state_dim self.action_dim = action_dim def select_action(self, state, eps): if random.random() < eps: action = np.random.uniform(-1, 1, size=self.action_dim) else: state = torch.FloatTensor(state).unsqueeze(0).to(device) q_values = self.q_network(state) action = q_values.argmax().cpu().numpy() return action def train(self, batch_size, gamma): state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.buffer.sample(batch_size) state_batch = torch.FloatTensor(state_batch).to(device) action_batch = torch.LongTensor(action_batch).unsqueeze(-1).to(device) reward_batch = torch.FloatTensor(reward_batch).to(device) next_state_batch = torch.FloatTensor(next_state_batch).to(device) done_batch = torch.FloatTensor(done_batch).to(device) q_values = self.q_network(state_batch).gather(-1, action_batch) next_q_values = self.target_network(next_state_batch).max(-1)[0].detach() target_q_values = reward_batch + gamma * next_q_values * (1 - done_batch) loss = F.mse_loss(q_values, target_q_values.unsqueeze(-1)) self.optimizer.zero_grad() loss.backward() self.optimizer.step() for param, target_param in zip(self.q_network.parameters(), self.target_network.parameters()): target_param.data.copy_(param.data) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") dqn_agent = DQNAgent(state_dim=4, action_dim=2) buffer_size = 10000 batch_size = 128 gamma = 0.99 max_episodes = 1000 max_steps = 1000 eps_start = 1.0 eps_end = 0.01 eps_decay = 0.995 for episode in range(max_episodes): state = env.reset() episode_reward = 0 eps = eps_end + (eps_start - eps_end) * np.exp(-episode / 200) for step in range(max_steps): action = dqn_agent.select_action(state, eps) next_state, reward, done, _ = env.step(action) dqn_agent.buffer.push(state, action, reward, next_state, done) if len(dqn_agent.buffer) > batch_size: dqn_agent.train(batch_size, gamma) state = next_state episode_reward += reward if done: break print(f"Episode {episode+1} : Episode Reward {episode_reward:.2f}") ``` 以上两份代码都是使用 PyTorch 实现的。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值