DuelingDQN实现三维路径规划

以悬崖困境为基础,构建三维网格地图环境,agent自主式水下潜器Autonomous Underwater Vehicle,简称AUV运动物AUV的动作空间自行设置(离散运动空间、连续运动空间均可),动作空间维度自行设定,但不得小于4维(上下前后左右)。请结合运动时间、运动成本、安全风险等实际因素进行考量,设计合理的奖励函数。

完成下面问题:障碍物固定,出发、目的地固定,使用DRL方法训练agent到达目的地

目录

1、准备工作——环境设计

3.1运动环境的设计

3.1.1主要组件

3.1.2环境操作

3.1.3奖励机制

3.2运动空间的设计

3.2.1动作空间设计——离散型:

3.2.2状态空间设计——连续型:

2、算法设计

2.1网络架构

2.2算法流程

 2.3算法伪代码

2.3Dueling DQN类的定义

3、结果展示

完整代码


1、准备工作——环境设计

3.1运动环境的设计

3.1.1主要组件

(1)网格地图:三维网格地图,大小为10x10x10。每个单元格表示一个可供AUV运动的位置

(2)AUV(自主水下航行器):在网格地图中运动的代理,负责执行一系列动作以到达目标位置

(3)障碍物:固定设置在地图中的若干障碍物点,AUV不能穿越这些点

(4)起点和终点:从预定义的四个点集中随机选择的起点和终点,确保每次训练的起点和终点有所不同

(5)奖励机制:定义每个状态转移所获得的奖励,用于指导AUV的学习和决策

3.1.2环境操作

(1)初始化:在每个训练回合开始时,环境会随机选择起点和终点,并重置AUV到起点位置

(2)状态转移:根据AUV选择的动作更新其位置,如果AUV选择的动作会导致其进入障碍物或越界,则该动作无效,AUV保持原地不动

(3)动作执行:AUV可以选择六个离散动作中的一个(向上、向下、向前、向后、向左、向右),每个动作会尝试改变其在网格中的位置

(4)检测终止条件:每一步执行后,检查AUV是否到达终点,如果到达终点,则本回合结束

3.1.3奖励机制

(1)到达终点奖励:如果AUV到达终点,给予高额奖励(500分),鼓励AUV尽快到达目标

(2)碰撞惩罚:如果AUV碰到障碍物,给予一定的惩罚(-10分),使AUV学会避开障碍物

(3)移动惩罚:每次移动都给予微小惩罚(-1分),以促使AUV尽快到达终点,避免无效移动

3.2运动空间的设计

3.2.1动作空间设计——离散型:

(1)动作空间定义:动作空间由六个离散动作组成,分别为向上、向下、向前、向后、向左、向右

(2)动作表示:每个动作用一个整数表示,例如:0表示向上,1表示向下,2表示向前,3表示向后,4表示向左,5表示向右

(3)动作约束:每个动作都有边界条件和障碍物检查,确保AUV在执行动作后不会越界或穿越障碍物

3.2.2状态空间设计——连续型:

(1)状态空间定义:状态空间由AUV在网格中的位置组成,每个位置用三维坐标表示(x, y, z)

(2)状态表示:当前状态用AUV的当前位置坐标表示,例如:(x, y, z)

(3)状态转换:根据AUV执行的动作,状态会发生相应的变化,新的状态由新的坐标表示

2、算法设计

2.1网络架构

网络架构是Dueling DQN模型,用于在三维网格环境中训练AUV代理。

(1)输入和输出:

输入:三维状态(3个维度表示位置:x, y, z)

输出:六个离散动作的Q值(上下、前后、左右)

(2)网络层次:

输入层:3个输入节点,对应状态的三个维度(x, y, z)。

隐藏层:一个全连接层(Fully Connected Layer),包含128个神经元,激活函数为ReLU。

分支层

价值分支(Value Stream):全连接层,输出单一值(状态价值)。

优势分支(Advantage Stream):全连接层,输出每个动作的优势值(6个动作)

输出层:价值分支与优势分支的组合,用于计算每个动作的Q值

2.2算法流程

(1)环境初始化:

定义三维网格环境,包含固定障碍物

随机选择起点和终点,从预定义的点集中选择

(2)Dueling DQN网络:

构建Dueling DQN网络,包括输入层、隐藏层、价值分支和优势分支

价值分支输出单一值,优势分支输出每个动作的优势值,最终组合得到Q值

(3)经验回放:

使用Replay Memory存储经验样本,随机采样用于训练

(4)代理选择动作:

使用epsilon-greedy策略选择动作:以一定概率选择最优动作,其他情况下随机选择

(5)优化模型:

从Replay Memory中采样小批量样本,计算损失函数并反向传播优化网络

(6)训练过程:

运行多个回合,每回合进行多个步骤

记录每个回合的得分和步数

每10个回合输出一次训练数据

(7)预测和绘图:

使用训练好的模型进行路径预测

绘制AUV从起点到终点的三维路径

 2.3算法伪代码

Algorithm: Dueling Deep Q-Learning with Experience Replay

1. Initialize the replay memory D to capacity N

2. Initialize the action-value function Q with random weights

3. Initialize the target action-value function Q- with weights θ^- = θ (identical to Q initially)

4. For episode = 1 to M do

    a. Initialize sequence s1 = [x1] and preprocessed sequence φ1 = φ(s1)

    b. For t = 1 to T do

        i. With probability ε, select a random action at

        ii. Otherwise, select at = argmaxa[Q(φ(st), a; θ)] (select action with highest Q-value)

        iii. Execute action at in emulator and observe reward rt and image xt+1

        iv. Set st+1 = (st, at, xt+1) and preprocess φt+1 = φ(st+1)

        v. Store transition (φt, at, rt, φt+1) in D

        vi. Sample a random minibatch of transitions (φj, aj, rj, φj+1) from D

        vii. If episode terminates at step j+1, set yj = rj

        viii. Otherwise, calculate the target value:

               yj = rj + γ * max(Q-(φj+1, a'; θ^-) - b(φj+1))

                     where b(φj+1) = 1/|A| * Σ A(s, a; θ^adv)

        ix. Perform a gradient descent step on (yj - Q(φj, aj; θ)) with respect to the network parameters θ

        x. Every C steps reset Q- = Q (synchronize target network with the main network)

    c. End For (end of episode)

5. End For (end of all episodes)

2.3Dueling DQN类的定义

class DuelingDQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DuelingDQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, NUM_HIDDEN)
        self.fc_value = nn.Linear(NUM_HIDDEN, 1)
        self.fc_advantage = nn.Linear(NUM_HIDDEN, output_dim)
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        value = self.fc_value(x)
        advantage = self.fc_advantage(x)
        return value + (advantage - advantage.mean())

3、结果展示

本次训练要求起始点和终点固定,同时障碍物固定,因此并不需要非常大的训练量,但是为了能够在训练结果图上更加直观的看到收敛等相关信息,故将训练次数设置为1000,将最大运动次数设置为500;本次训练起止点分别为(0,0,0),(9,9,9)。为了更加清晰看出运动体的运动路径,这里采用不同颜色标记不同运动步,结果如下图所示;

 

(1)持续时间(Duration per Episode):

持续时间图表显示了每个训练周期的持续时长,从图中可见,尽管存在波动,即在训练次数为200左右时,训练时间有较大波动,可能是因为程序训练达到饱和;但整体上没有明显上升或下降的趋势,表明训练的持续时间在各个周期间保持相对稳定

(2)得分(Score per Episode):

得分图表反映了智能体在每个训练周期中的性能得分,可以看出初始得分较低,但随着训练的进行,得分逐渐提升,显示出智能体通过学习过程在不断优化其行为策略

(3)步数(Steps per Episode):

步数图表记录了智能体在每个训练周期中所采取的步数,初期波动较大,但随着周期的增加,步数趋于稳定,这可能意味着智能体开始采取更加直接有效的路径来达到目标

综合图表信息和训练设置,可以看出智能体在经过多次训练后,其性能在逐步提升,表现在得分的增加和步数的稳定化上,这符合强化学习中通过与环境交互不断优化行为策略的预期。

完整代码

import numpy as np
import random
import time
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# Constants
GRID_SIZE = 10
NUM_EPISODES = 1000
MAX_STEPS = 10000
BATCH_SIZE = 32
EPSILON_START = 1.0
EPSILON_END = 0.01
EPSILON_DECAY = 0.995
GAMMA = 0.99
TARGET_UPDATE = 10
LR = 0.001
REPLAY_MEMORY_SIZE = 10000
NUM_HIDDEN = 128


# Define AUV environment
class Environment:
    def __init__(self):
        self.grid_size = GRID_SIZE
        self.start = (0, 0, 0)
        self.goal = (GRID_SIZE - 1, GRID_SIZE - 1, GRID_SIZE - 1)
        self.obstacles = [(5, 5, 5), (2, 3, 4), (8, 8, 8)]  # Example obstacles

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        x, y, z = self.state

        # Determine the next state based on the action
        if action == 0:  # Move up
            next_state = (x, y + 1, z) if y < self.grid_size - 1 else (x, y, z)
        elif action == 1:  # Move down
            next_state = (x, y - 1, z) if y > 0 else (x, y, z)
        elif action == 2:  # Move forward
            next_state = (x + 1, y, z) if x < self.grid_size - 1 else (x, y, z)
        elif action == 3:  # Move backward
            next_state = (x - 1, y, z) if x > 0 else (x, y, z)
        elif action == 4:  # Move left
            next_state = (x, y, z - 1) if z > 0 else (x, y, z)
        elif action == 5:  # Move right
            next_state = (x, y, z + 1) if z < self.grid_size - 1 else (x, y, z)
        else:
            raise ValueError("Invalid action")

        # Check for obstacles and adjust if needed
        if next_state in self.obstacles:
            reward = -10
            next_state = self.state  # Stay in the same place if obstacle encountered
        elif next_state[0] < 0 or next_state[0] >= self.grid_size or \
                next_state[1] < 0 or next_state[1] >= self.grid_size or \
                next_state[2] < 0 or next_state[2] >= self.grid_size:
            reward = -10  # Penalty for going out of bounds
            next_state = self.state  # Stay in the same place if out of bounds
        else:
            reward = -1

        # Check if reached the goal
        done = next_state == self.goal
        if done:
            reward = 500

        self.state = next_state
        return next_state, reward, done


# Dueling DQN network
class DuelingDQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DuelingDQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, NUM_HIDDEN)
        self.fc_value = nn.Linear(NUM_HIDDEN, 1)
        self.fc_advantage = nn.Linear(NUM_HIDDEN, output_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        value = self.fc_value(x)
        advantage = self.fc_advantage(x)
        return value + (advantage - advantage.mean())


# Replay memory
class ReplayMemory:
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []

    def push(self, transition):
        self.memory.append(transition)
        if len(self.memory) > self.capacity:
            del self.memory[0]

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)


# Agent class
class Agent:
    def __init__(self, env):
        self.env = env
        self.epsilon = EPSILON_START
        self.policy_net = DuelingDQN(3, 6)  # 3-dimensional state, 6 actions
        self.target_net = DuelingDQN(3, 6)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=LR)
        self.memory = ReplayMemory(REPLAY_MEMORY_SIZE)
        self.steps_done = 0

    def select_action(self, state):
        sample = random.random()
        self.epsilon = EPSILON_END + (EPSILON_START - EPSILON_END) * np.exp(-1.0 * self.steps_done / EPSILON_DECAY)
        self.steps_done += 1
        if sample > self.epsilon:
            with torch.no_grad():
                q_values = self.policy_net(torch.tensor(state, dtype=torch.float32))
                action = torch.argmax(q_values).item()
        else:
            action = random.randint(0, 5)
        return action

    def optimize_model(self):
        if len(self.memory) < BATCH_SIZE:
            return
        transitions = self.memory.sample(BATCH_SIZE)
        batch = list(zip(*transitions))
        state_batch = torch.tensor(batch[0], dtype=torch.float32)
        action_batch = torch.tensor(batch[1], dtype=torch.long).view(-1, 1)
        reward_batch = torch.tensor(batch[2], dtype=torch.float32)
        next_state_batch = torch.tensor(batch[3], dtype=torch.float32)
        done_batch = torch.tensor(batch[4], dtype=torch.float32)  # Convert to float

        state_action_values = self.policy_net(state_batch).gather(1, action_batch)

        with torch.no_grad():
            next_state_values = self.target_net(next_state_batch).max(1)[0].detach()
            expected_state_action_values = reward_batch + (
                        1.0 - done_batch) * GAMMA * next_state_values  # Ensure 1.0 instead of 1

        loss = nn.functional.mse_loss(state_action_values, expected_state_action_values.unsqueeze(1))
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

    def update_target_model(self):
        if self.steps_done % TARGET_UPDATE == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())


# Training function
def train(agent):
    episode_durations = []
    scores = []
    steps_list = []
    start_time = time.time()

    for episode in range(NUM_EPISODES):
        state = env.reset()
        steps = 0
        score = 0

        for t in range(MAX_STEPS):
            action = agent.select_action(state)
            next_state, reward, done = env.step(action)
            agent.memory.push((state, action, reward, next_state, done))
            agent.optimize_model()
            agent.update_target_model()
            state = next_state
            score += reward
            steps += 1

            if done or t == MAX_STEPS - 1:
                episode_durations.append(time.time() - start_time)
                scores.append(score)
                steps_list.append(steps)
                if (episode + 1) % 10 == 0:
                    print(f"Episode {episode + 1}, Steps: {steps}, Time: {episode_durations[-1]:.2f}s, Score: {score}")
                break

    return episode_durations, scores, steps_list


# Function to plot 3D path
def plot_3d_path(path):
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')

    # Unpack path into x, y, z coordinates
    x, y, z = zip(*path)

    # Plot path with different colors
    for i in range(len(x) - 1):
        ax.plot([x[i], x[i + 1]], [y[i], y[i + 1]], [z[i], z[i + 1]], marker='o', color=f'C{i}',
                label=f'Step {i + 1}-{i + 2}')

    # Mark start and goal points
    ax.scatter(x[0], y[0], z[0], marker='o', color='g', label='Start')
    ax.scatter(x[-1], y[-1], z[-1], marker='o', color='r', label='Goal')

    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    ax.legend()
    ax.set_title('3D Path Taken by Agent')
    plt.savefig('q1/path.png')
    plt.close()


# Initialize environment and agent
env = Environment()
agent = Agent(env)

# Train the agent
episode_durations, scores, steps_list = train(agent)

plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.plot(range(1, NUM_EPISODES + 1), episode_durations)
plt.xlabel('Episode')
plt.ylabel('Duration (s)')
plt.title('Duration per Episode')

plt.subplot(1, 3, 2)
plt.plot(range(1, NUM_EPISODES + 1), scores)
plt.xlabel('Episode')
plt.ylabel('Score')
plt.title('Score per Episode')

plt.subplot(1, 3, 3)
plt.plot(range(1, NUM_EPISODES + 1), steps_list)
plt.xlabel('Episode')
plt.ylabel('Steps')
plt.title('Steps per Episode')
plt.tight_layout()
plt.savefig('q1/score_episodes.png')
plt.close()

'''# Predict and output movement steps
state = env.reset()
steps = 0
path = [state]
while state != env.goal:
    action = agent.select_action(state)
    next_state, _, _ = env.step(action)
    state = next_state
    path.append(state)
    steps += 1

print(f"Steps taken to reach the goal: {steps}")

# Plot the 3D path
plot_3d_path(path)'''

  • 11
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值