强化学习(四)-DQN实现

Example of DQN in Pytorch

  1. 回顾DQN原理

DQN使用神经网络来拟合动作-状态价值函数即 Q Q Q函数,同时为了使训练效果更稳定,加入了经验重放和固定目标机制。经验重放指的是每次将转移元组 ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1)存入缓冲器容器 D D D中,然后从容器中随机采样,用作训练数据;固定目标是指使用两个网络,一个用作目标网络,一个用作当前的评估网络,目标网络的参数更新滞后于评估网络,使得评估网络更好地接近目标网络。

  1. DQN实现

首先,需要导入包,具体作用见注释。

import torch as T	# 导入Pytorch
import torch.nn as nn	# 导入神经网络模块
import torch.nn.functional as F		# 导入激活函数模块
import torch.optim as optim		# 导入优化器			
import numpy as np	# 导入向量处理模块

DQN的实现大致可以分为3部分:

  1. 神经网络部分
  2. 缓冲容器部分
  3. 智能体(agent)部分
1. 神经网络部分

在Pytorch中,必须显示地实现神经网络地前向传播数据函数,而梯度回传可以直接使用loss.backward()自动完成。因此在此部分,我们做两件事:

  1. 显式地定义网络结构
  2. 实现前向传播forward()函数

其代码如下:

class DeepQNetwork(nn.Module):
    def __init__(self, lr , input_dims, fc1_dims, fc2_dims, n_actions):
        '''
        initilize the deep q network which has 2 fully connected(fc) layers 
        and take an input as the number of states and ouput the value of each action.
        
        Args:
        	lr : learning rate
        	input_dim : the  number of states
        	fc1_dims : the number of neurons in the first fc layer
        	fc2_dims : the number of neurons in the second fc layer
            n_actions : the number of actions 
        '''
        super(DeepQNetwork, self).__init__()
        # store args
        self.input_dims = input_dims
        self.fc1_dims = fc1_dims
        self.fc2_dims = fc2_dims
        self.n_actions = n_actions
        # build network
        self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims)
        self.fc2 = nn.Linear(self.fc1_dims, self.fc2_dims)
        self.fc3 = nn.Linear(self.fc2_dims, self.n_actions)
		# build optimizer, loss function and device
        self.optimizer = optim.Adam(self.parameters(), lr = lr)
        self.loss = nn.MSELoss()
        T.cuda.current_device()
        self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu')
        self.to(self.device)

    def forward(self, state):
        """
        calculate the ouput of the deep q network
        
        Args:
        	state : current state as input
        """
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        actions = self.fc3(x)

        return actions

__init__()函数可以看到,此网络由两层全连接层和一层输出层构成,使用Adam作为优化器,MSE作为损失函数,在有GPU的情况下,支持使用cuda加速运算。

【填坑Adam】

而在forward()函数中,使用了relu函数作为层激活函数。

【填坑relu】

2. 智能体部分

从设计模式中,智能体与重放容器可以分开也可以合并,因为此处使用列表来表示该容器,故操作时较简单,因此与智能体合并。

回到智能体部分,智能体需要具备以下三个函数:

  1. 初始化参数和重放容器
  2. 存储历史数据至容器
  3. 根据Q函数网络选取动作
  4. 训练,更新参数
  1. 初始化部分

初始函数__init__()代码如下:

class Agent():
    def __init__(self, gamma, epsilon, lr, input_dims, batch_size, n_actions,
            max_mem_size=100000, eps_end = 0.05, eps_dec = 5e-4, target_update_freq = 100):
        """
        initialize the agent
       	
       	Args:
       		gamma : the discounted factor in Bellman Equation
       		lr : learning rate
       		input_dims : the number of states
       		batch_size : the size of one batch
       		n_actions : the number of actions
       		max_mem_size : the size of replay buffer
       		eps_end : the minimum value of epislon in epsilon-greedy trick
       		eps_dec : the value of epsilon decay after each step(linear decay)
       		target_freq : the period for updating the parameters of the target network
        """
        self.gamma = gamma
        self.epsilon = epsilon
        self.eps_min = eps_end
        self.eps_dec = eps_dec
        self.lr = lr
		# get action space
        self.action_space = [i for i in range(n_actions)]   # for epsilon-greedy

        self.batch_size = batch_size
        self.mem_size = max_mem_size
        # the current index in replay buffer
        self.mem_cntr = 0
        # the counter of steps
        self.iter_cntr = 0
        self.target_update_freq = target_update_freq
		# evaluation network
        self.Q_eval = DeepQNetwork(self.lr, input_dims=input_dims, 
                            fc1_dims=256, fc2_dims=256, n_actions=n_actions)
		# target network
        self.Q_target = DeepQNetwork(self.lr, input_dims=input_dims,
                            fc1_dims=256, fc2_dims=256, n_actions=n_actions) 
		# replay buffer(use list for every element)
        self.state_memory = np.zeros((self.mem_size, *input_dims), dtype=np.float32)
        self.action_memory = np.zeros(self.mem_size, dtype=np.int32)
        self.reward_memory = np.zeros(self.mem_size, dtype=np.int32)
        self.new_state_memory = np.zeros((self.mem_size, *input_dims), dtype=np.float32)
        self.terminate_memory = np.zeros(self.mem_size, dtype=np.bool)

__init()函数可以看到:

  1. 此智能体对于转移元组的每一元素都有一个重放容器,这样便于分开管理,而不用记录同一元组的下标顺序
  2. 为了方便训练,存储了“完成标记”(done),用于定义终态的价值为0
  3. 含有两个网络,即评估网络和目标网络
  4. 含有两个计数指针,一个用于定位回放容器,另一个用于记录智能体与环境交互次数,以更新目标网络
  1. 存储历史数据至容器

此函数store_transitions()代码如下:

    def store_transition(self, state, action, reward, state_, terminate):
        """
        store tuple (s_t,a_t,r_t,s_{t+1},done) to replay buffer
        
        Args:
        	state : s_t
            action : a_t
            reward : r_t
            state_ : s_{t+1}
            terminate : done
        """
        index =  self.mem_cntr % self.mem_size
        self.state_memory[index] = state
        self.action_memory[index] = action
        self.reward_memory[index] = reward
        self.new_state_memory[index] = state_
        self.terminate_memory[index] = terminate
        
        self.mem_cntr += 1

此函数实现标记简单,只需将容器计数指针mem_cntr % mem_size处写入数据即可,完成后,将指针后移一位。由于指针数值可能会超过容器大小,因此需要使用模%运算来覆盖掉之前写入的数据。

  1. 根据Q函数网络选取动作

DQN是基于Q-learning,因此使用的是epsilon-greedy策略来选取动作,故此函数代码如下:

    def choose_action(self, observation):   # obs = state
        """
        choose an action in current state by using epsilon-greedy
        
        Args:
        	observation : current state
        """
        if np.random.random() > self.epsilon:
            # choose the best action
            state = T.tensor([observation]).to(self.Q_eval.device)
            actions =   self.Q_eval.forward(state)
            action = T.argmax(actions).item()
        else:
             action = np.random.choice(self.action_space)

        return action

由上述choose_action()函数,以 ε \varepsilon ε的概率选择随机一个动作, 1 − ε 1-\varepsilon 1ε的概率选取当前价值最高的价值。具体做法为:

  1. 随机选择动作时,只需在动作空间采样即可
  2. 选择最有动作时,首先将当前状态代入评估网络,得到每个动作价值,然后取输出值最大的动作即可
  1. 训练,更新参数

此为智能体的核心函数,用于更新网络参数,以达到学习的效果,其代码如下所示:

    def learn(self):
        """
        sample data of size batch_size from replay buffer, and the ues them to update the network(s)
        """
        # when data in replay buffer less than batch, just ignore
        if self.mem_cntr < self.batch_size:
            return
		# release all old gradients at first
        self.Q_eval.optimizer.zero_grad()
		
        max_mem = min(self.mem_size, self.mem_cntr)
        # sample data
        batch = np.random.choice(max_mem, self.batch_size, replace=False)
        # get batch space
        batch_index = np.arange(self.batch_size, dtype=np.int32)

        # get all the data from replay buffer
        state_batch = T.tensor(self.state_memory[batch]).to(self.Q_eval.device)
        action_batch = self.action_memory[batch]    # no need to conv
        reward_batch = T.tensor(self.reward_memory[batch]).to(self.Q_eval.device)
        new_state_batch = T.tensor(self.new_state_memory[batch]).to(self.Q_eval.device)
        terminal_batch = T.tensor(self.terminate_memory[batch]).to(self.Q_eval.device)
		
        # take these data to the evaluation network and get q(s,a)
        q_eval = self.Q_eval.forward(state_batch)[batch_index, action_batch]    # choose sampled action
        # get the target value using s_{t+1} 
        q_next = self.Q_target.forward(new_state_batch)
        # terminal states has a value of 0
        q_next[terminal_batch] = 0.0
		# build q_target : r + max_{a}(Q_target(s_{t+1}, a))
        q_target = reward_batch + self.gamma * T.max(q_next, dim=1)[0]
		# build q-error : q_target - q(s,a) 
        loss = self.Q_eval.loss(q_target, q_eval).to(self.Q_eval.device)
        # uodate q_eval network
        loss.backward()
        self.Q_eval.optimizer.step()

        self.iter_cntr += 1
        self.epsilon = self.epsilon - self.eps_dec if self.epsilon > self.eps_min \
                            else self.eps_min
        
        # periodly update q_target network by q_eval      
        if self.iter_cntr % self.target_update_freq == 0:
            self.Q_target.load_state_dict(self.Q_eval.state_dict())

由上述learn()函数代码可以看到,首先,采样batch数据,然后根据以下更新式来更新网络:
Δ ω = α ( T D _ t a r g e t − Q ^ ( s , a , ω ) ) ∇ ω Q ( s , a , ω ) ^ = α ( r + γ max ⁡ a ′ Q ^ ( s ′ , a ′ , ω − ) − Q ^ ( s , a , ω ) ) ∇ ω Q ( s , a , ω ) ^ \begin{aligned} \Delta{\omega}&={\alpha(\rm{TD\_target}-\hat{Q}(s,a,\omega))}\nabla_\omega\hat{Q(s,a,\omega)} \\ &={\alpha(r+\gamma\max_{a'}\hat{Q}(s',a',\textcolor{red}{\omega^-})-\hat{Q}(s,a,\omega))}\nabla_\omega\hat{Q(s,a,\omega)} \end{aligned} Δω=α(TD_targetQ^(s,a,ω))ωQ(s,a,ω)^=α(r+γamaxQ^(s,a,ω)Q^(s,a,ω))ωQ(s,a,ω)^
Adam中已经使用了学习率 α \alpha α,而梯度值由Pytroch自动计算,故我们只需要计算q-error即可。

3. 环境测试

使用gymlunarlander-v2环境来测试算法的正确性,在这之前,需要下载相应的包:

pip install gym
pip install box2d

根据官方文档,该环境可以描述如下:

着陆板始终位于坐标(0,0)。坐标是状态向量中的前两个数字。从屏幕顶部移动到着陆点的奖励且速度为0时约为100…140点。如果着陆器远离着陆垫,它将失去奖励。如果着陆器坠毁或静止下来,则当前幕结束,并获得额外的-100或+100点。每条腿的接地触点为+10点。点火主引擎每帧为-0.3分。解决的奖励视作200分。可以在着陆垫外面着陆。燃料是无限的,因此特工可以学会飞行,然后首次尝试着陆。共有四个离散操作:不执行任何操作,左向发动机点火,主发动机点火,右发动机点火。

代码如下:

import gym
from dqn import Agent
import numpy as np
from tensorboardX import SummaryWriter

env_name = 'LunarLander-v2'

if __name__ == '__main__':
    # initialize env, agent and some infomation
    env = gym.make(env_name)
    agent = Agent(gamma=0.99, epsilon=1.0, lr=1e-4, 
    				input_dims=env.observation_space.shape[0], batch_size=64,
                  	n_actions=env.action_space.n, eps_end=0.01)
    writer = SummaryWriter(comment="-" + env_name)
    scores, eps_history = [], []
    target_score = 250
    avg_score = 0
    episode = 0
	# training
    while avg_score < target_score:
        score = 0
        done = False
        # get first state
        observation = env.reset()
        while not done:
            # choose action and interact with env
            action = agent.choose_action(observation)
            observation_, reward, done, info = env.step(action)
            score += reward
            # store transitions
            agent.store_transition(observation, action, reward, 
                                        observation_, done)
            agent.learn()
            observation = observation_  # important
        scores.append(score)
        eps_history.append(agent.epsilon)

        avg_score = np.mean(scores[-100:])
        episode += 1
        writer.add_scalar("reward", score, episode)
        writer.add_scalar("avg_reward", avg_score, episode)

        print('episode:', episode, 'score %.2f' % score,
                'average score %.2f' %avg_score,
                'epsilon %.2f' %agent.epsilon)
    writer.close()

在主函数中,智能体每一步都选择一个动作,并于环境交互产生转移元组,同时更新网络参数,直到成功解决该环境。

训练的结果如下:

平均回报:

在这里插入图片描述

每一幕的回报:
在这里插入图片描述

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
DQN(Deep Q-Network)是一种使用深度神经网络实现强化学习算法,用于解决离散动作空间的问题。在PyTorch中实现DQN可以分为以下几个步骤: 1. 定义神经网络:使用PyTorch定义一个包含多个全连接层的神经网络,输入为状态空间的维度,输出为动作空间的维度。 ```python import torch.nn as nn import torch.nn.functional as F class QNet(nn.Module): def __init__(self, state_dim, action_dim): super(QNet, self).__init__() self.fc1 = nn.Linear(state_dim, 64) self.fc2 = nn.Linear(64, 64) self.fc3 = nn.Linear(64, action_dim) def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x ``` 2. 定义经验回放缓存:包含多条经验,每条经验包含一个状态、一个动作、一个奖励和下一个状态。 ```python import random class ReplayBuffer(object): def __init__(self, max_size): self.buffer = [] self.max_size = max_size def push(self, state, action, reward, next_state): if len(self.buffer) < self.max_size: self.buffer.append((state, action, reward, next_state)) else: self.buffer.pop(0) self.buffer.append((state, action, reward, next_state)) def sample(self, batch_size): state, action, reward, next_state = zip(*random.sample(self.buffer, batch_size)) return torch.stack(state), torch.tensor(action), torch.tensor(reward), torch.stack(next_state) ``` 3. 定义DQN算法:使用PyTorch定义DQN算法,包含训练和预测两个方法。 ```python class DQN(object): def __init__(self, state_dim, action_dim, gamma, epsilon, lr): self.qnet = QNet(state_dim, action_dim) self.target_qnet = QNet(state_dim, action_dim) self.gamma = gamma self.epsilon = epsilon self.lr = lr self.optimizer = torch.optim.Adam(self.qnet.parameters(), lr=self.lr) self.buffer = ReplayBuffer(100000) self.loss_fn = nn.MSELoss() def act(self, state): if random.random() < self.epsilon: return random.randint(0, action_dim - 1) else: with torch.no_grad(): q_values = self.qnet(state) return q_values.argmax().item() def train(self, batch_size): state, action, reward, next_state = self.buffer.sample(batch_size) q_values = self.qnet(state).gather(1, action.unsqueeze(1)).squeeze(1) target_q_values = self.target_qnet(next_state).max(1)[0].detach() expected_q_values = reward + self.gamma * target_q_values loss = self.loss_fn(q_values, expected_q_values) self.optimizer.zero_grad() loss.backward() self.optimizer.step() def update_target_qnet(self): self.target_qnet.load_state_dict(self.qnet.state_dict()) ``` 4. 训练模型:使用DQN算法进行训练,并更新目标Q网络。 ```python dqn = DQN(state_dim, action_dim, gamma=0.99, epsilon=1.0, lr=0.001) for episode in range(num_episodes): state = env.reset() total_reward = 0 for step in range(max_steps): action = dqn.act(torch.tensor(state, dtype=torch.float32)) next_state, reward, done, _ = env.step(action) dqn.buffer.push(torch.tensor(state, dtype=torch.float32), action, reward, torch.tensor(next_state, dtype=torch.float32)) state = next_state total_reward += reward if len(dqn.buffer.buffer) > batch_size: dqn.train(batch_size) if step % target_update == 0: dqn.update_target_qnet() if done: break dqn.epsilon = max(0.01, dqn.epsilon * 0.995) ``` 5. 测试模型:使用训练好的模型进行测试。 ```python total_reward = 0 state = env.reset() while True: action = dqn.act(torch.tensor(state, dtype=torch.float32)) next_state, reward, done, _ = env.step(action) state = next_state total_reward += reward if done: break print("Total reward: {}".format(total_reward)) ``` 以上就是在PyTorch中实现DQN强化学习的基本步骤。需要注意的是,DQN算法中还有很多细节和超参数需要调整,具体实现过程需要根据具体问题进行调整。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值