使用tensorflow快速搭建 DQN环境

使用tensorflow简洁快速的搭建DQN神经网络,只有建立网络、使用网络和训练网络三个代码,结构清楚
摘要由CSDN通过智能技术生成

使用tensorflow快速搭建 DQN环境

本文章主要是用来快速搭建DQN环境的,个人感觉写的还是比较清晰的,借鉴了莫烦python和其他友友们的网络搭建方法。我的代码是有两个agent的,两个玩家交替更新,代码在: https://github.com/nsszlh/tensorflow-DQN

1建立网络

基本需要使用的参数


class AgentNet():
    def __init__(
            self,
            name,
            n_action,
            n_state,
            learning_rate=0.001,
            reward_decay=0.9,
            e_greedy=0.9,
            replace_target_iter=300,
            memory_size=50000,
            batch_size=32,
            e_greedy_increment=None
    ):
        self.n_action = n_action  #action个数
        self.n_state = n_state      #state 个数
        self.alpha = learning_rate     #学习率
        self.gamma = reward_decay	# QLearning函数参数
        self.epsilon_max = e_greedy #贪心
        self.replace_target_iter = replace_target_iter	# 每iter次复制一次eval网络参数给target网络
        self.memory_size = memory_size  #记忆库容量
        self.batch_size = batch_size    #每批学习的个数
        self.epsilon_increment = e_greedy_increment     #逐步增大贪心比例
        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max
        self.name = name	#给多个agent建立网络时
        self.sess = sess
        self.build()

        self.learn_step_counter = 0
        self.memory = []
        self.losses = []

        '''复制网络'''
        t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='target_net')
        e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='eval_net')
        with tf.variable_scope('hard_replacement'):
            self
  • 1
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
DQN是一种深度强化学习算法,可以用于解决强化学习问题。在TensorFlow2中搭建DQN模型需要进行以下步骤: 1.导入相关的库 ``` import tensorflow as tf from tensorflow.keras import layers ``` 2.构建模型 DQN模型由两个部分组成,分别是网络和损失函数。 网络部分: ``` class DQN(tf.keras.Model): def __init__(self, num_actions): super(DQN, self).__init__() self.conv1 = layers.Conv2D(32, 8, strides=4, activation='relu') self.conv2 = layers.Conv2D(64, 4, strides=2, activation='relu') self.conv3 = layers.Conv2D(64, 3, strides=1, activation='relu') self.flatten = layers.Flatten() self.dense1 = layers.Dense(512, activation='relu') self.dense2 = layers.Dense(num_actions) def call(self, inputs): x = self.conv1(inputs) x = self.conv2(x) x = self.conv3(x) x = self.flatten(x) x = self.dense1(x) x = self.dense2(x) return x ``` 损失函数部分: ``` @tf.function def compute_loss(model, target_model, states, actions, rewards, next_states, is_terminal, gamma): Q = model(states) Q_target = target_model(next_states) Q_target = tf.stop_gradient(Q_target) max_Q = tf.reduce_max(Q_target, axis=1) target_Q = rewards + (1 - is_terminal) * gamma * max_Q action_masks = tf.one_hot(actions, num_actions) Q_action = tf.reduce_sum(tf.multiply(Q, action_masks), axis=1) loss = tf.reduce_mean(tf.square(target_Q - Q_action)) return loss ``` 3.定义优化器 ``` optimizer = tf.optimizers.Adam(learning_rate=learning_rate) ``` 4.定义重放缓冲区 ``` class ReplayBuffer(object): def __init__(self, buffer_size): self.buffer_size = buffer_size self.num_experiences = 0 self.buffer = [] def get_batch(self, batch_size): if self.num_experiences < batch_size: return None else: indices = np.random.choice(self.num_experiences, size=batch_size, replace=False) states_batch = np.array([self.buffer[i][0] for i in indices]) actions_batch = np.array([self.buffer[i][1] for i in indices]) rewards_batch = np.array([self.buffer[i][2] for i in indices]) next_states_batch = np.array([self.buffer[i][3] for i in indices]) is_terminal_batch = np.array([self.buffer[i][4] for i in indices]) return states_batch, actions_batch, rewards_batch, next_states_batch, is_terminal_batch def size(self): return self.buffer_size def add(self, state, action, reward, next_state, is_terminal): experience = (state, action, reward, next_state, is_terminal) if self.num_experiences < self.buffer_size: self.buffer.append(experience) self.num_experiences += 1 else: self.buffer.pop(0) self.buffer.append(experience) ``` 5.训练模型 ``` def train(model, target_model, optimizer, replay_buffer, gamma, batch_size): states_batch, action_batch, reward_batch, next_states_batch, is_terminal_batch = replay_buffer.get_batch(batch_size) loss = compute_loss(model, target_model, states_batch, action_batch, reward_batch, next_states_batch, is_terminal_batch, gamma) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) return loss ``` 6.训练过程 ``` for episode in range(max_episodes): state = env.reset() done = False total_reward = 0 while not done: action = choose_action(model, state, num_actions, epsilon) next_state, reward, done, info = env.step(action) total_reward += reward replay_buffer.add(state, action, reward, next_state, done) state = next_state if replay_buffer.size() > batch_size: loss = train(model, target_model, optimizer, replay_buffer, gamma, batch_size) if num_steps % update_target_model_freq == 0: update_target_model(model, target_model) if num_steps % epsilon_decay_steps == 0: epsilon = max(epsilon * epsilon_decay, epsilon_min) num_steps += 1 print("Episode:", episode, "Total Reward:", total_reward) ``` 以上就是用TensorFlow2搭建DQN模型的完整流程,其中需要注意的是,我们在实现时还需实现epsilon-greedy策略和目标网络更新等操作。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值