基于强化学习SAC_LSTM算法的移动机器人导航

前言】在人群之间导航的机器人通常使用避碰算法来实现安全高效的导航。针对人群中机器人的导航问题,本文采用强化学习SAC算法,并结合长短期记忆网络(LSTM),提高移动机器人的导航性能。在该方法中,机器人使用奖励来学习避碰策略,这种方法可以惩罚干扰行人运动的机器人行为。
在这里插入图片描述
问题描述
状态 移动机器人在人群中的导航问题可描述为部分可观测马尔可夫决策过程(POMDP)。其中,机器人的状态为st = [sot, sht],由机器人可以观测到的状态 sot和机器人本身隐藏状态 sht组成。其中,sot表示为:

sht表示为:
在这里插入图片描述
动作 机器人的动作由线速度和角速度组成,即:at=[v, w]。在该方法中,设置机器人的动作为连续性动作。
奖励函数 机器人到达其目标将获得奖励,机器人与行人碰撞或离行人太近将获得惩罚。此外,设置一个基于势能的奖励塑性来引导机器人接近目标:
在这里插入图片描述
【训练策略】
长短期记忆网络(LSTM)可以接受可变大小的输入,因此将行人状态(sot)编码为状态序列,作为LSTM的输入,其网络结构如下图所式:
在这里插入图片描述
LSTM 将行人的相关信息存储在其隐藏状态中,并遗忘输入信息中不太重要的部分。输入最终行人的状态后,我们可以将 LSTM 的最终隐藏状态作为编码后的环境状态。在具有大量智能体状态的情况下,为了减轻智能体忘记早期状态的影响,将这些状态以智能体与机器人距离的逆序作为输入,这意味着距离最近的智能体对最终隐藏状态hn将产生最大的影响。将hn与机器人自身的状态拼接joint_state=[self_state, hn],最后,joint_state作为全连接网络的输入。

关于SAC_LSTM的Actor Critic网络结构参考下图,
在这里插入图片描述
策略的更新我们采用自调节熵的SAC算法,关于SAC算法描述参考文献[3]。开源代码:https://github.com/ShelyH/CrowdNavST.git
SAC_LSTM中Q网络和策略部分代码如下:

class DQFunc(nn.Module):
    def __init__(self, self_state_dim, human_dim, action_dim, hidden_size, lstm_hidden_dim):
        super(DQFunc, self).__init__()
        self.action_dim = action_dim
        self.self_state_dim = self_state_dim
        self.lstm_hidden_dim = lstm_hidden_dim

        self.lstm = nn.LSTM(human_dim, lstm_hidden_dim, batch_first=True)
        self.linear1 = nn.Linear(action_dim + lstm_hidden_dim + self_state_dim, hidden_size)
        self.linear2 = nn.Linear(hidden_size, hidden_size)
        self.network1 = nn.Linear(hidden_size, 1)
        self.network2 = nn.Linear(hidden_size, 1)

    def forward(self, state, action):
        size = state.shape
        self_state = state[:, 0, :self.self_state_dim]
        human_state = state[:, :, self.self_state_dim:]
        h0 = torch.zeros(1, size[0], self.lstm_hidden_dim)
        c0 = torch.zeros(1, size[0], self.lstm_hidden_dim)
        output, (hn, cn) = self.lstm(human_state, (h0, c0))
        hn = hn.squeeze(0)
        joint_state = torch.cat([hn, self_state], dim=1)
        action = action.reshape(-1, self.action_dim)
        x = torch.cat((joint_state, action), -1)
        x = torch.tanh(self.linear1(x))
        x = torch.tanh(self.linear2(x))
        x1 = self.network1(x)
        x2 = self.network2(x)
        
        return x1, x2
class PolicyNetGaussian(nn.Module):
    def __init__(self, self_state_dim, human_dim, lstm_hidden_dim, hidden_size, action_dim):
        super(PolicyNetGaussian, self).__init__()
        self.self_state_dim = self_state_dim
        self.lstm_hidden_dim = lstm_hidden_dim
		# LSTM处理行人信息
        self.lstm1 = nn.LSTM(human_dim, lstm_hidden_dim, batch_first=True)
        self.linear1 = nn.Linear(lstm_hidden_dim + self_state_dim, hidden_size)
        self.linear2 = nn.Linear(hidden_size, hidden_size)

        self.mean_linear = nn.Linear(hidden_size, action_dim)
        self.log_std_linear = nn.Linear(hidden_size, action_dim)

    def forward(self, state):
        size = state.shape  
        self_state = state[:, 0, :self.self_state_dim] 
        human_state = state[:, :, self.self_state_dim:]
        h0 = torch.zeros(1, size[0], self.lstm_hidden_dim)
        c0 = torch.zeros(1, size[0], self.lstm_hidden_dim)
        output, (hn, cn) = self.lstm1(human_state, (h0, c0))
        hn = hn.squeeze(0)  
        joint_state = torch.cat([self_state, hn], dim=1)

        x = torch.relu(self.linear1(joint_state))
        x = torch.relu(self.linear2(x))
        mean = self.mean_linear(x)

        log_std = self.log_std_linear(x)
        log_std = torch.clamp(log_std, -20, 2)

        return mean, log_std

策略优化部分:

    def optim_sac(self):
        q1_loss, q2_loss, pi_loss, alpha_loss = 0, 0, 0, 0

        samples = self.replay_pool.sample(self.batchsize)

        state_batch = torch.FloatTensor(np.array(samples.state)).to(device)
        nextstate_batch = torch.FloatTensor(np.array(samples.nextstate)).to(device)
        action_batch = torch.FloatTensor(np.array(samples.action)).to(device)
        reward_batch = torch.FloatTensor(np.array(samples.reward)).to(device).unsqueeze(1)
        done_batch = torch.FloatTensor(samples.real_done).to(device).unsqueeze(1)
        
        # update q-funcs
        with torch.no_grad():
            nextaction_batch, logprobs_batch, _ = self.policynet.sample(nextstate_batch, deterministic=False)

            q_t1, q_t2 = self.target_q(nextstate_batch, nextaction_batch)

            # take min to mitigate positive bias in q-function training
            q_target = torch.min(q_t1, q_t2)
            value_target = reward_batch + (1 - done_batch) * self.gamma * (
                    q_target - self.alpha * logprobs_batch)
        q_1, q_2 = self.qfunsac(state_batch, action_batch)
        q1_loss_step = F.mse_loss(q_1, value_target)
        q2_loss_step = F.mse_loss(q_2, value_target)
        q_loss_step = q1_loss_step + q2_loss_step

        self.q_optimizer.zero_grad()
        q_loss_step.backward()
        self.q_optimizer.step()

        # update policy and temperature parameter
        for p in self.qfunsac.parameters():
            p.requires_grad = False
        action_batch, logprobs_batch, _ = self.policynet.sample(state_batch, deterministic=False)

        q_b1, q_b2 = self.qfunsac(state_batch, action_batch)

        qval_batch = torch.min(q_b1, q_b2)
        pi_loss_step = (self.alpha * logprobs_batch - qval_batch).mean()
        alpha_loss_step = -self.alpha * (logprobs_batch.detach() + self.target_entropy).mean()

        self.policy_optimizer.zero_grad()
        pi_loss_step.backward()
        self.policy_optimizer.step()

        self.temp_optimizer.zero_grad()
        alpha_loss_step.backward()
        self.temp_optimizer.step()

        for p in self.qfunsac.parameters():
            p.requires_grad = True

        q1_loss += q1_loss_step.detach().item()
        q2_loss += q2_loss_step.detach().item()
        pi_loss += pi_loss_step.detach().item()
        alpha_loss += alpha_loss_step.detach().item()
        self.update_target()

        return q1_loss, q2_loss, pi_loss, alpha_loss

训练结果
其中黄色圆形表示机器人,其余圆圈表示移动的行人。

参考文献
1.Collision Avoidance in Pedestrian-Rich Environments with Deep Reinforcement Learning
2.Motion Planning Among Dynamic, Decision-Making Agents with Deep Reinforcement Learning
3.Soft Actor-Critic Algorithms and Applications

  • 8
    点赞
  • 48
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 6
    评论
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

C喳喳

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值