大模型笔记(一)：混合专家模型MoE代码实践

夏念司

已于 2024-08-19 09:08:29 修改

阅读量3.9k

点赞数 49

文章标签： python 深度学习机器学习

于 2024-08-19 00:40:12 首次发布

本文链接：https://blog.csdn.net/m0_47738450/article/details/141295286

版权

MoE，全称Mixture of Experts，混合专家模型。MoE是大模型架构的一种，其核心工作设计思路是“术业有专攻”，即将任务分门别类，然后分给多个“专家”进行解决。与MoE相对应的概念是稠密（Dense）模型，可以理解为它是一个“通才”模型。一个通才能够处理多个不同的任务，但一群专家能够更高效、更专业地解决多个问题。

1 MoE发展

MoE的概念起源于 1991 年的论文Adaptive Mixture of Local Experts。这个概念与集成学习方法相似，旨在为由多个单独网络组成的系统并建立一个监管机制。在这种系统中，每个网络处理训练样本的不同子集，专注于输入空间的特定区域。

后来，论文Learning Factored Representations in a Deep Mixture of Experts探索了将 MoE 作为更深层网络的一个组件。这种方法允许将 MoE 嵌入到多层网络中的某一层，使得模型既大又高效。而且，由研究开始探索基于输入令牌动态激活或停用网络组件的方法。

论文Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer（2017年）将这一概念应用于 137B 的 LSTM ，通过引入稀疏性，这项工作在保持高规模的同时实现了快速的推理速度。

总之，MoE 的引入使得训练具有数千亿甚至万亿参数的模型成为可能。

2 MoE实现

实现MoE系统的关键在于体系结构的设计，其中一个比较前沿的方法涉及到利用Switch Transformer结构来集成混合的专家模型层。该体系结构使系统能够利用多个专家模型在作出决定和预测方面的综合专门知识。

在构建Switch Transformer体系结构时，必须保证专家混合层的无缝集成。这些层负责从不同的角度处理和分析数据，使系统能够根据各种不同的见解作出明智的决定。典型的实现步骤如下：

实现MoE系统的另一个关键点是动态路由和负载均衡。此策略确保token在整个系统中有效分布，优化性能和资源利用率。动态路由涉及到根据工作量和信息的复杂性实时分配任务和数据给不同的专家模型。通过动态路由token，系统能够适应不断变化的需求，在运行过程中保持最佳效率。负载均衡对于在专家之间均匀分配计算任务和数据处理起着至关重要的作用。这可以防止瓶颈和资源浪费，最终提高系统的整体性能。一般的策略实现方式如下：

（1）监控每个专家模型的工作负载，并动态地重新分发token以保持平衡。

（2）实现基于紧迫性和复杂性对任务进行优先排序的算法，以优化处理效率。

（3）利用反馈机制不断调整负载均衡参数。

模型特点

多样化数据: 该模型是在包含广泛任务或领域的多样化数据集上进行训练的。这种多样性至关重要，因为它使每个专家面临不同类型的数据和问题。
门控机制: MoE 模型有一个门控机制，它决定哪个专家处理输入数据的哪个部分。在训练期间，这个门控网络学会根据不同专家模型的专长向他们发送不同类型的数据。
专家模型: 随着训练的进行，每个专家模型逐渐变得更加擅长处理特定类型的数据或任务。发生这种专门化是因为专家接收并学习他们在处理数据时最有效的数据类型。
反馈回路: 有一个反馈回路在起作用的时候，如果一个专家模型在某种类型的数据方面做得更好，门控机制将更有可能向该专家发送类似的数据。这加强了每个专家的专业化。
正规化和损失函数: 训练过程通常包括正规化技术和专门的损失函数，以鼓励有效的学习，避免出现一个专家成为“万事通”的情况，从而确保专业化的分布。
能力限制: 通过对模型施加能力限制，该模型确保没有任何一位专家任务负担过重，促进所有专家之间学习的平衡分配。
微调和调整: 模型可能经历微调阶段，其中强调某些类型的任务，进一步完善每个组件的专门知识。指令调优已经成为超越传统稠密模型性能的关键策略。通过优化模型中每个专家执行的指令，以在不增加模型复杂性的情况下获得更高的精度和效率。

联合训练
优化MoE模型的一个重要策略是对门控网络与其他组件进行联合训练。在联合训练过程中，门控网络的参数通过整个模型的反向传播进行更新。这个过程允许门控网络根据从专家模型收到的反馈来调整其路由决策。该模型通过对所有组件进行集成优化，可以更好地平衡不同专家的贡献，优化路由机制，达到最优结果。

指令调优
指令调优方法同样可用于训练MoE系统。这些方法着重于调整路由机制和专家模型，以提高模型在不同任务和输入数据分布中的性能。常见的微调策略包括：

软路由: 软路由技术使用连续概率来分配每个专家模型对最终输出的贡献。通过将路由决策建模为概率，特别是在多个专家模型可能提供互补信息的情况下，该模型可以执行更平滑和更灵活的路由。
硬路由: 与软路由不同，硬路由涉及为给定输入选择单个专家模型的离散路由决策。硬路由策略更易于实现和解释，因此适用于需要显式专家选择的任务。
正则化技术: 正则化方法，如 L1或 L2正则化可以应用到路由参数，以防止过拟合并提高泛化能力。通过惩罚过于复杂的路由决策，正则化可以鼓励模型关注更健壮和可解释的路由策略。
自适应路由: 自适应路由机制根据输入数据和模型的当前状态动态调整路由概率或决策。自适应路由策略使模型能够根据数据中不断变化的模式调整自己的路由行为，从而实现更加自适应和高效的专家选择。

在transformer里面加上MoE模块

左边是基础的encoder多头自注意力+前馈神经网络，右边是加上多专家之后的encoder

下面这些开源项目可以用于训练MoE:

Megablocks: https://github.com/stanford-futuredata/megablocks

Fairseq: https://github.com/facebookresearch/fairseq/tree/main/examples/moe_lm

OpenMoE: https://github.com/XueFuzhao/OpenMoE

3 简单案例 MoE

一共三个部分组成：3个专家网络、门控网络、MoE模型架构。

基础专家网络是一个双层感知机，假设专家1只能获得标签为0、1的样本；专家2为标签1、2；专家3为标签0、2。
门控网络是三层感知机，为了防止过拟合，减少计算量，中间加上dropout层。
MoE模型架构：根据输入的训练数据x放进门控网络生成权重w，再把x放进每个专家模型中得到的输出堆叠起来。

在这里，哪个专家训练哪部分数据已经在数据集阶段就分配好了，这里的门控网络主要是每个专家的意见采纳权重网络（全部激活）。输入一批数据，输出每个专家的权重，也就是说在每一轮，对于数据的每个维度采用一样的专家权重，假设数据第i条输出 $y_i=[y_{i1},y_{i2},y_{i3},y_{i4}]$ 每个输出数据都是4维，那么在第i维上 $y_{ij} = w_{i1}*y_{ij1}+w_{i2}*y_{ij2}+w_{i3}*y_{ij3}$ ,也就是同一个数据采纳的专家意见权重是一样的（即在该数据的每个维度上）。

（1）基础专家网络

import torch
import torch.nn as nn
import torch.optim as optim

class Expert(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(Expert, self).__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        return torch.softmax(self.layer2(x), dim=1)

（2）门控网络

输入（batch_size，input_dim）

输出（batch_size，num_experts）

class Gating(nn.Module):
    def __init__(self, input_dim,num_experts, dropout_rate=0.1):
        super(Gating, self).__init__()
        self.layer1 = nn.Linear(input_dim, 128)
        self.dropout1 = nn.Dropout(dropout_rate)

        self.layer2 = nn.Linear(128, 256)
        self.leaky_relu1 = nn.LeakyReLU()
        self.dropout2 = nn.Dropout(dropout_rate)

        self.layer3 = nn.Linear(256, 128)
        self.leaky_relu2 = nn.LeakyReLU()
        self.dropout3 = nn.Dropout(dropout_rate)

        self.layer4 = nn.Linear(128, num_experts)

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        x = self.dropout1(x)

        x = self.layer2(x)
        x = self.leaky_relu1(x)
        x = self.dropout2(x)

        x = self.layer3(x)
        x = self.leaky_relu2(x)
        x = self.dropout3(x)

        return torch.softmax(self.layer4(x), dim=1)

（3）MoE模型架构

class MoE(nn.Module):
    def __init__(self, trained_experts):
        super(MoE, self).__init__()
        self.experts = nn.ModuleList(trained_experts)
        num_experts = len(trained_experts)
        input_dim = trained_experts[0].layer1.in_features
        self.gating = Gating(input_dim, num_experts)

    def forward(self, x):
        weights = self.gating(x)
        outputs = torch.stack(
            [expert(x) for expert in self.experts], dim=2)
        weights = weights.unsqueeze(1).expand_as(outputs)
        return torch.sum(outputs * weights, dim=2)

这个代码里面没有batch，所以一共用于专家训练的数据2500个，一起训练，每个数据输入维度都是4维，形状（2500，4），现在本来每个专家的输出为（2500，3），对每个数据输出三维的预测数据，也就是0、1、2每个标签的选择概率，如果经过MoE在 dim = 2 堆叠起来的话，就变成（2500，4，3）。weight权重也广播至和输出一样的形状，再和weight一起加权求和。

随机生成一些样本数据并整理

num_samples = 5000
input_dim = 4
hidden_dim = 32


y_data = torch.cat([
    torch.zeros(num_samples // 3),
    torch.ones(num_samples // 3),
    torch.full((num_samples - 2 * (num_samples // 3),), 2)
]).long()

# Biasing the data based on the labels
x_data = torch.randn(num_samples, input_dim)
for i in range(num_samples):
    if y_data[i] == 0:
        x_data[i, 0] += 1  # Making x[0] more positive
    elif y_data[i] == 1:
        x_data[i, 1] -= 1  # Making x[1] more negative
    elif y_data[i] == 2:
        x_data[i, 0] -= 1  # Making x[0] more negative


indices = torch.randperm(num_samples)
x_data = x_data[indices]
y_data = y_data[indices]


y_data.bincount()

# Shuffle the data to ensure x_data and y_data remain aligned
shuffled_indices = torch.randperm(num_samples)
x_data = x_data[shuffled_indices]
y_data = y_data[shuffled_indices]


x_train_experts = x_data[:int(num_samples / 2)]
y_train_experts = y_data[:int(num_samples / 2)]

mask_expert1 = (y_train_experts == 0) | (y_train_experts == 1)
mask_expert2 = (y_train_experts == 1) | (y_train_experts == 2)
mask_expert3 = (y_train_experts == 0) | (y_train_experts == 2)


num_samples_per_expert = \
    min(mask_expert1.sum(), mask_expert2.sum(), mask_expert3.sum())


x_expert1 = x_train_experts[mask_expert1][:num_samples_per_expert]
y_expert1 = y_train_experts[mask_expert1][:num_samples_per_expert]

x_expert2 = x_train_experts[mask_expert2][:num_samples_per_expert]
y_expert2 = y_train_experts[mask_expert2][:num_samples_per_expert]

x_expert3 = x_train_experts[mask_expert3][:num_samples_per_expert]
y_expert3 = y_train_experts[mask_expert3][:num_samples_per_expert]


# 剩下的用来训练MoE和测试
x_remaining = x_data[int(num_samples / 2) + 1:]
y_remaining = y_data[int(num_samples / 2) + 1:]

# 训练集
split = int(0.8 * len(x_remaining))
x_train_moe = x_remaining[:split]
y_train_moe = y_remaining[:split]
# 测试集
x_test = x_remaining[split:]
y_test = y_remaining[split:]

print(x_train_moe.shape, "\n", x_test.shape, "\n",
      x_expert1.shape, "\n",
      x_expert2.shape, "\n", x_expert3.shape)

开始训练

先训练专家模型，再训练MoE模型，最后一起测试，因为是多分类问题，所以使用crossEntropy损失函数

# Define hidden dimension
output_dim = 3
hidden_dim = 32

epochs = 500
learning_rate = 0.001

# 实例化专家模型
expert1 = Expert(input_dim, hidden_dim, output_dim)
expert2 = Expert(input_dim, hidden_dim, output_dim)
expert3 = Expert(input_dim, hidden_dim, output_dim)

# Set up loss
criterion = nn.CrossEntropyLoss()

# Optimizers for experts
optimizer_expert1 = optim.Adam(expert1.parameters(), lr=learning_rate)
optimizer_expert2 = optim.Adam(expert2.parameters(), lr=learning_rate)
optimizer_expert3 = optim.Adam(expert3.parameters(), lr=learning_rate)

# Training loop for expert 1
for epoch in range(epochs):
    optimizer_expert1.zero_grad()
    outputs_expert1 = expert1(x_expert1)
    loss_expert1 = criterion(outputs_expert1, y_expert1)
    loss_expert1.backward()
    optimizer_expert1.step()

# Training loop for expert 2
for epoch in range(epochs):
    optimizer_expert2.zero_grad()
    outputs_expert2 = expert2(x_expert2)
    loss_expert2 = criterion(outputs_expert2, y_expert2)
    loss_expert2.backward()
    optimizer_expert2.step()

# Training loop for expert 3
for epoch in range(epochs):
    optimizer_expert3.zero_grad()
    outputs_expert3 = expert3(x_expert3)
    loss_expert3 = criterion(outputs_expert3, y_expert3)
    loss_expert3.backward()

# Create the MoE model with the trained experts
moe_model = MoE([expert1, expert2, expert3])

# Train the MoE model
optimizer_moe = optim.Adam(moe_model.parameters(), lr=learning_rate)
for epoch in range(epochs):
    optimizer_moe.zero_grad()
    outputs_moe = moe_model(x_train_moe)
    loss_moe = criterion(outputs_moe, y_train_moe)
    loss_moe.backward()
    optimizer_moe.step()

# 上面已经全部训练结束了
# Evaluate all models
def evaluate(model, x, y):
    with torch.no_grad():
        # outputs: (batch_size, num_classes)
        # _, predicted = torch.max(outputs, 1)
        # torch.max 函数返回 outputs 张量在维度 1 上的最大值及其索引。
        # 索引即为模型对每个样本的预测类别。_ 是最大值（在这个上下文中并不需要），
        # predicted 是预测的类别，形状为 (batch_size,)。
        outputs = model(x)
        _, predicted = torch.max(outputs, 1)
        correct = (predicted == y).sum().item()
        accuracy = correct / len(y)
    return accuracy

# 测试集是一起测试，测试每个专家&测试MoE
accuracy_expert1 = evaluate(expert1, x_test, y_test)
accuracy_expert2 = evaluate(expert2, x_test, y_test)
accuracy_expert3 = evaluate(expert3, x_test, y_test)
accuracy_moe = evaluate(moe_model, x_test, y_test)

print("Expert 1 Accuracy:", accuracy_expert1)
print("Expert 2 Accuracy:", accuracy_expert2)
print("Expert 3 Accuracy:", accuracy_expert3)
print("Mixture of Experts Accuracy:", accuracy_moe)

# Expert 1 Accuracy: 0.466
# Expert 2 Accuracy: 0.496
# Expert 3 Accuracy: 0.378
# Mixture of Experts Accuracy: 0.614

完整代码

import torch
import torch.nn as nn
import torch.optim as optim


class Expert(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(Expert, self).__init__()
        # 每个专家输入都是[batch_size,input_dim]
        # 双层感知机
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        return torch.softmax(self.layer2(x), dim=1)



class Gating(nn.Module):
    def __init__(self, input_dim,num_experts, dropout_rate=0.1):
        super(Gating, self).__init__()
        # 输入[batch_size,input_dim]
        # 三层感知机 输出[batch_size,num_experts]每个专家的意见听从概率
        self.layer1 = nn.Linear(input_dim, 128)
        self.dropout1 = nn.Dropout(dropout_rate)

        self.layer2 = nn.Linear(128, 256)
        self.leaky_relu1 = nn.LeakyReLU()
        self.dropout2 = nn.Dropout(dropout_rate)

        self.layer3 = nn.Linear(256, 128)
        self.leaky_relu2 = nn.LeakyReLU()
        self.dropout3 = nn.Dropout(dropout_rate)

        self.layer4 = nn.Linear(128, num_experts)

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        x = self.dropout1(x)

        x = self.layer2(x)
        x = self.leaky_relu1(x)
        x = self.dropout2(x)

        x = self.layer3(x)
        x = self.leaky_relu2(x)
        x = self.dropout3(x)

        return torch.softmax(self.layer4(x), dim=1)


class MoE(nn.Module):
    def __init__(self, trained_experts):
        super(MoE, self).__init__()
        # 假设所有专家输入相同
        # trained_experts提供专家个数和input_dim
        self.experts = nn.ModuleList(trained_experts)
        num_experts = len(trained_experts)
        input_dim = trained_experts[0].layer1.in_features
        self.gating = Gating(input_dim, num_experts)

    def forward(self, x):
        # 对输入数据做三层门控放射变换，计算每个专家的意见概率
        # 输入：(batch_size, input_dim)
        # 单个专家的输出expert(x): (batch_size, output_dim)
        # outputs: torch.stack输出: (batch_size, output_dim, num_experts)
        # 本来是(数据条数,数据维度),现在对于每条数据的每个维度，三个专家都给出了自己的预测
        # weights 的初始形状: (batch_size, num_experts)，unsqueeze(1)之后: (batch_size, 1, num_experts)
        # expand_as(outputs) 之后: (batch_size, output_dim, num_experts)
        # 对于这一批，每条数据的采纳方式是一样的，都是input_dim上每个值采纳概率多少
        # 输出：outputs * weights: 逐元素相乘 (batch_size, output_dim, num_experts)
        # torch.sum(..., dim=2): 沿着 num_experts 维度求和，结果形状为 (batch_size, output_dim)
        # 对每个专家的意见加权求和，得到每个数据最终的预测
        weights = self.gating(x)
        outputs = torch.stack(
            [expert(x) for expert in self.experts], dim=2)
        weights = weights.unsqueeze(1).expand_as(outputs)
        return torch.sum(outputs * weights, dim=2)



num_samples = 5000
input_dim = 4
hidden_dim = 32

# Generate equal numbers of labels 0, 1, and 2
# 生成一维张量，其中包含 num_samples // 3 个 0。这是用于标签 0 的部分。
# 生成一维张量，其中包含 num_samples // 3 个 1。这是用于标签 1 的部分。
# 生成一维张量，其中包含 num_samples - 2 * (num_samples // 3) 个 2。这是用于标签 2 的部分。
# torch.cat 用于将上述三个张量连接成一个大张量。它将包含 num_samples // 3 个 0，num_samples // 3 个 1，以及 num_samples - 2 * (num_samples // 3) 个 2。
# 最后，将生成的张量转换为 long 类型。通常标签数据用 long 类型存储，因为它们是离散的整数。
y_data = torch.cat([
    torch.zeros(num_samples // 3),
    torch.ones(num_samples // 3),
    torch.full((num_samples - 2 * (num_samples // 3),), 2)
]).long()

# Biasing the data based on the labels
x_data = torch.randn(num_samples, input_dim)
for i in range(num_samples):
    if y_data[i] == 0:
        x_data[i, 0] += 1  # Making x[0] more positive
    elif y_data[i] == 1:
        x_data[i, 1] -= 1  # Making x[1] more negative
    elif y_data[i] == 2:
        x_data[i, 0] -= 1  # Making x[0] more negative

# Shuffle the data to randomize the order
# 这段代码的目的是打乱 x_data 和 y_data 张量中的数据，使它们的顺序随机化。我们来详细解读每行代码：
# torch.randperm(num_samples) 生成一个包含 0 到 num_samples - 1 的随机排列的张量。
# torch.randperm 函数会生成一个随机的整数序列，序列中的每个整数都恰好出现一次，并且按随机顺序排列。
indices = torch.randperm(num_samples)
x_data = x_data[indices]
y_data = y_data[indices]

# Verify the label distribution
# y_data.bincount() 是 PyTorch 的一个方法，用于计算每个非负整数标签在张量 y_data 中的出现次数。这个方法返回一个张量，其中每个元素的值对应于 y_data 中对应索引值的频次。
y_data.bincount()

# Shuffle the data to ensure x_data and y_data remain aligned
shuffled_indices = torch.randperm(num_samples)
x_data = x_data[shuffled_indices]
y_data = y_data[shuffled_indices]

# 将数据分为两部分：前半部分用于训练专家模型，后半部分用于训练MoE模型和测试。
x_train_experts = x_data[:int(num_samples / 2)]
y_train_experts = y_data[:int(num_samples / 2)]

mask_expert1 = (y_train_experts == 0) | (y_train_experts == 1)
mask_expert2 = (y_train_experts == 1) | (y_train_experts == 2)
mask_expert3 = (y_train_experts == 0) | (y_train_experts == 2)

# 计算每个专家可以使用的样本数量，以确保每个专家的样本数量相同。
num_samples_per_expert = \
    min(mask_expert1.sum(), mask_expert2.sum(), mask_expert3.sum())

# 根据掩码选择每个专家的训练样本，并确保每个专家使用相等数量的样本。
# x_train_experts[mask_expert1]选择布尔掩码中为 True 的样本，其他为False的行不显示，形状变化
# 例如 x_train_experts = torch.tensor([
#     [0.1, 0.2, 0.3],
#     [0.4, 0.5, 0.6],
#     [0.7, 0.8, 0.9],
#     [1.0, 1.1, 1.2],
#     [1.3, 1.4, 1.5],
#     [1.6, 1.7, 1.8]
# ], dtype=torch.float)
# mask_expert1 = torch.tensor([True, False, True, False, True, False])
# x_expert1 = x_train_experts[mask_expert1]
# out: tensor([
#        [0.1, 0.2, 0.3],
#        [0.7, 0.8, 0.9],
#        [1.3, 1.4, 1.5]
#     ], dtype=torch.float)
# x_expert把所有符合y标签条件的摘出来，用来专家的训练
x_expert1 = x_train_experts[mask_expert1][:num_samples_per_expert]
y_expert1 = y_train_experts[mask_expert1][:num_samples_per_expert]

x_expert2 = x_train_experts[mask_expert2][:num_samples_per_expert]
y_expert2 = y_train_experts[mask_expert2][:num_samples_per_expert]

x_expert3 = x_train_experts[mask_expert3][:num_samples_per_expert]
y_expert3 = y_train_experts[mask_expert3][:num_samples_per_expert]


# 剩下的用来训练MoE和测试
x_remaining = x_data[int(num_samples / 2) + 1:]
y_remaining = y_data[int(num_samples / 2) + 1:]

# 训练集
split = int(0.8 * len(x_remaining))
x_train_moe = x_remaining[:split]
y_train_moe = y_remaining[:split]
# 测试集
x_test = x_remaining[split:]
y_test = y_remaining[split:]

print(x_train_moe.shape, "\n", x_test.shape, "\n",
      x_expert1.shape, "\n",
      x_expert2.shape, "\n", x_expert3.shape)

# Define hidden dimension
output_dim = 3
hidden_dim = 32

epochs = 500
learning_rate = 0.001

# 实例化专家模型
expert1 = Expert(input_dim, hidden_dim, output_dim)
expert2 = Expert(input_dim, hidden_dim, output_dim)
expert3 = Expert(input_dim, hidden_dim, output_dim)

# Set up loss
criterion = nn.CrossEntropyLoss()

# Optimizers for experts
optimizer_expert1 = optim.Adam(expert1.parameters(), lr=learning_rate)
optimizer_expert2 = optim.Adam(expert2.parameters(), lr=learning_rate)
optimizer_expert3 = optim.Adam(expert3.parameters(), lr=learning_rate)

# Training loop for expert 1
for epoch in range(epochs):
    optimizer_expert1.zero_grad()
    outputs_expert1 = expert1(x_expert1)
    loss_expert1 = criterion(outputs_expert1, y_expert1)
    loss_expert1.backward()
    optimizer_expert1.step()

# Training loop for expert 2
for epoch in range(epochs):
    optimizer_expert2.zero_grad()
    outputs_expert2 = expert2(x_expert2)
    loss_expert2 = criterion(outputs_expert2, y_expert2)
    loss_expert2.backward()
    optimizer_expert2.step()

# Training loop for expert 3
for epoch in range(epochs):
    optimizer_expert3.zero_grad()
    outputs_expert3 = expert3(x_expert3)
    loss_expert3 = criterion(outputs_expert3, y_expert3)
    loss_expert3.backward()

# Create the MoE model with the trained experts
moe_model = MoE([expert1, expert2, expert3])

# Train the MoE model
optimizer_moe = optim.Adam(moe_model.parameters(), lr=learning_rate)
for epoch in range(epochs):
    optimizer_moe.zero_grad()
    outputs_moe = moe_model(x_train_moe)
    loss_moe = criterion(outputs_moe, y_train_moe)
    loss_moe.backward()
    optimizer_moe.step()

# 上面已经全部训练结束了
# Evaluate all models
def evaluate(model, x, y):
    with torch.no_grad():
        # outputs: (batch_size, num_classes)
        # _, predicted = torch.max(outputs, 1)
        # torch.max 函数返回 outputs 张量在维度 1 上的最大值及其索引。
        # 索引即为模型对每个样本的预测类别。_ 是最大值（在这个上下文中并不需要），
        # predicted 是预测的类别，形状为 (batch_size,)。
        outputs = model(x)
        _, predicted = torch.max(outputs, 1)
        correct = (predicted == y).sum().item()
        accuracy = correct / len(y)
    return accuracy

# 测试集是一起测试，测试每个专家&测试MoE
accuracy_expert1 = evaluate(expert1, x_test, y_test)
accuracy_expert2 = evaluate(expert2, x_test, y_test)
accuracy_expert3 = evaluate(expert3, x_test, y_test)
accuracy_moe = evaluate(moe_model, x_test, y_test)

print("Expert 1 Accuracy:", accuracy_expert1)
print("Expert 2 Accuracy:", accuracy_expert2)
print("Expert 3 Accuracy:", accuracy_expert3)
print("Mixture of Experts Accuracy:", accuracy_moe)

# Expert 1 Accuracy: 0.466
# Expert 2 Accuracy: 0.496
# Expert 3 Accuracy: 0.378
# Mixture of Experts Accuracy: 0.614

4 稀疏门控混合专家模型SMoE

论文 https://arxiv.org/abs/1701.06538

代码 https://github.com/davidmrau/mixture-of-experts

一共两个文件example.py和moe.py

example.py

这里就训练一轮5个数，每个数是1000维度，2个标签类，10个专家网络，每次激活topK=4个专家

import torch
from torch import nn
from torch.optim import Adam
from moe import MoE


def train(x, y, model, loss_fn, optim):
    # model returns the prediction and the loss that encourages all experts to have equal importance and load
    y_hat, aux_loss = model(x.float())
    # calculate prediction loss
    loss = loss_fn(y_hat, y)
    # combine losses
    total_loss = loss + aux_loss
    optim.zero_grad()
    total_loss.backward()
    optim.step()

    print("Training Results - loss: {:.2f}, aux_loss: {:.3f}".format(loss.item(), aux_loss.item()))
    return model


def eval(x, y, model, loss_fn):
    model.eval()
    # model returns the prediction and the loss that encourages all experts to have equal importance and load
    y_hat, aux_loss = model(x.float())
    loss = loss_fn(y_hat, y)
    total_loss = loss + aux_loss
    print("Evaluation Results - loss: {:.2f}, aux_loss: {:.3f}".format(loss.item(), aux_loss.item()))


def dummy_data(batch_size, input_size, num_classes):
    # dummy input
    x = torch.rand(batch_size, input_size)

    # dummy target
    y = torch.randint(num_classes, (batch_size, 1)).squeeze(1)
    return x, y


# arguments
input_size = 1000
num_classes = 20
num_experts = 10
hidden_size = 64
batch_size = 5
k = 4

# determine device
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# instantiate the MoE layer
model = MoE(input_size, num_classes, num_experts, hidden_size, k=k, noisy_gating=True)
model = model.to(device)
loss_fn = nn.CrossEntropyLoss()
optim = Adam(model.parameters())

x, y = dummy_data(batch_size, input_size, num_classes)

# train
model = train(x.to(device), y.to(device), model, loss_fn, optim)
# evaluate
x, y = dummy_data(batch_size, input_size, num_classes)
eval(x.to(device), y.to(device), model, loss_fn)

moe.py

import torch
import torch.nn as nn
from torch.distributions.normal import Normal
import numpy as np


class SparseDispatcher(object):
    """Helper for implementing a mixture of experts.
    The purpose of this class is to create input minibatches for the
    experts and to combine the results of the experts to form a unified
    output tensor.
    There are two functions:
    dispatch - take an input Tensor and create input Tensors for each expert.
    combine - take output Tensors from each expert and form a combined output
      Tensor.  Outputs from different experts for the same batch element are
      summed together, weighted by the provided "gates".
    The class is initialized with a "gates" Tensor, which specifies which
    batch elements go to which experts, and the weights to use when combining
    the outputs.  Batch element b is sent to expert e iff gates[b, e] != 0.
    The inputs and outputs are all two-dimensional [batch, depth].
    Caller is responsible for collapsing additional dimensions prior to
    calling this class and reshaping the output to the original shape.
    See common_layers.reshape_like().
    Example use:
    gates: a float32 `Tensor` with shape `[batch_size, num_experts]`
    inputs: a float32 `Tensor` with shape `[batch_size, input_size]`
    experts: a list of length `num_experts` containing sub-networks.
    dispatcher = SparseDispatcher(num_experts, gates)
    expert_inputs = dispatcher.dispatch(inputs)
    expert_outputs = [experts[i](expert_inputs[i]) for i in range(num_experts)]
    outputs = dispatcher.combine(expert_outputs)
    The preceding code sets the output for a particular example b to:
    output[b] = Sum_i(gates[b, i] * experts[i](inputs[b]))
    This class takes advantage of sparsity in the gate matrix by including in the
    `Tensor`s for expert i only the batch elements for which `gates[b, i] > 0`.
    """

    def __init__(self, num_experts, gates):
        """Create a SparseDispatcher."""
        # gates：(batch_size, num_experts) 每一条数据的专家权重
        #
        self._gates = gates
        self._num_experts = num_experts
        sorted_experts, index_sorted_experts = torch.nonzero(gates).sort(0)
        # drop indices
        _, self._expert_index = sorted_experts.split(1, dim=1)
        # get according batch index for each expert
        self._batch_index = torch.nonzero(gates)[index_sorted_experts[:, 1], 0]
        # calculate num samples that each expert gets
        self._part_sizes = (gates > 0).sum(0).tolist()
        # expand gates to match with self._batch_index
        gates_exp = gates[self._batch_index.flatten()]
        self._nonzero_gates = torch.gather(gates_exp, 1, self._expert_index)

    def dispatch(self, inp):
        """Create one input Tensor for each expert.
        The `Tensor` for a expert `i` contains the slices of `inp` corresponding
        to the batch elements `b` where `gates[b, i] > 0`.
        Args:
          inp: a `Tensor` of shape "[batch_size, <extra_input_dims>]`
        Returns:
          a list of `num_experts` `Tensor`s with shapes
            `[expert_batch_size_i, <extra_input_dims>]`.
        """

        # assigns samples to experts whose gate is nonzero
        # expand according to batch index so we can just split by _part_sizes
        inp_exp = inp[self._batch_index].squeeze(1)
        return torch.split(inp_exp, self._part_sizes, dim=0)

    def combine(self, expert_out, multiply_by_gates=True):
        """Sum together the expert output, weighted by the gates.
        The slice corresponding to a particular batch element `b` is computed
        as the sum over all experts `i` of the expert output, weighted by the
        corresponding gate values.  If `multiply_by_gates` is set to False, the
        gate values are ignored.
        Args:
          expert_out: a list of `num_experts` `Tensor`s, each with shape
            `[expert_batch_size_i, <extra_output_dims>]`.
          multiply_by_gates: a boolean
        Returns:
          a `Tensor` with shape `[batch_size, <extra_output_dims>]`.
        """
        # apply exp to expert outputs, so we are not longer in log space
        stitched = torch.cat(expert_out, 0)

        if multiply_by_gates:
            stitched = stitched.mul(self._nonzero_gates)
        zeros = torch.zeros(self._gates.size(0), expert_out[-1].size(1), requires_grad=True, device=stitched.device)
        # combine samples that have been processed by the same k experts
        combined = zeros.index_add(0, self._batch_index, stitched.float())
        return combined

    def expert_to_gates(self):
        """Gate values corresponding to the examples in the per-expert `Tensor`s.
        Returns:
          a list of `num_experts` one-dimensional `Tensor`s with type `tf.float32`
              and shapes `[expert_batch_size_i]`
        """
        # split nonzero gates for each expert
        return torch.split(self._nonzero_gates, self._part_sizes, dim=0)

class MLP(nn.Module):
    def __init__(self, input_size, output_size, hidden_size):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.soft = nn.Softmax(1)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.soft(out)
        return out


class MoE(nn.Module):

    """Call a Sparsely gated mixture of experts layer with 1-layer Feed-Forward networks as experts.
    Args:
    input_size: integer - size of the input
    output_size: integer - size of the input
    num_experts: an integer - number of experts
    hidden_size: an integer - hidden size of the experts
    noisy_gating: a boolean
    k: an integer - how many experts to use for each batch element
    """

    def __init__(self, input_size, output_size, num_experts, hidden_size, noisy_gating=True, k=4):
        super(MoE, self).__init__()
        self.noisy_gating = noisy_gating
        self.num_experts = num_experts
        self.output_size = output_size
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.k = k
        # instantiate experts
        self.experts = nn.ModuleList([MLP(self.input_size, self.output_size, self.hidden_size) for i in range(self.num_experts)])
        self.w_gate = nn.Parameter(torch.zeros(input_size, num_experts), requires_grad=True)
        self.w_noise = nn.Parameter(torch.zeros(input_size, num_experts), requires_grad=True)

        self.softplus = nn.Softplus()
        self.softmax = nn.Softmax(1)
        self.register_buffer("mean", torch.tensor([0.0]))
        self.register_buffer("std", torch.tensor([1.0]))
        assert(self.k <= self.num_experts)

    def cv_squared(self, x):
        """The squared coefficient of variation of a sample.
        Useful as a loss to encourage a positive distribution to be more uniform.
        Epsilons added for numerical stability.
        Returns 0 for an empty Tensor.
        Args:
        x: a `Tensor`.
        Returns:
        a `Scalar`.
        """
        eps = 1e-10
        # if only num_experts = 1

        if x.shape[0] == 1:
            return torch.tensor([0], device=x.device, dtype=x.dtype)
        return x.float().var() / (x.float().mean()**2 + eps)

    def _gates_to_load(self, gates):
        """Compute the true load per expert, given the gates.
        The load is the number of examples for which the corresponding gate is >0.
        Args:
        gates: a `Tensor` of shape [batch_size, n]
        Returns:
        a float32 `Tensor` of shape [n]
        """
        return (gates > 0).sum(0)

    def _prob_in_top_k(self, clean_values, noisy_values, noise_stddev, noisy_top_values):
        """Helper function to NoisyTopKGating.
        Computes the probability that value is in top k, given different random noise.
        This gives us a way of backpropagating from a loss that balances the number
        of times each expert is in the top k experts per example.
        In the case of no noise, pass in None for noise_stddev, and the result will
        not be differentiable.
        Args:
        clean_values: a `Tensor` of shape [batch, n].
        noisy_values: a `Tensor` of shape [batch, n].  Equal to clean values plus
          normally distributed noise with standard deviation noise_stddev.
        noise_stddev: a `Tensor` of shape [batch, n], or None
        noisy_top_values: a `Tensor` of shape [batch, m].
           "values" Output of tf.top_k(noisy_top_values, m).  m >= k+1
        Returns:
        a `Tensor` of shape [batch, n].
        """
        batch = clean_values.size(0)
        m = noisy_top_values.size(1)
        top_values_flat = noisy_top_values.flatten()

        #[1,2,3]→[3,6,9]→[4,7,10]
        threshold_positions_if_in = torch.arange(batch, device=clean_values.device) * m + self.k  # k是每个批次用到了多少专家
        threshold_if_in = torch.unsqueeze(torch.gather(top_values_flat, 0, threshold_positions_if_in), 1)
        is_in = torch.gt(noisy_values, threshold_if_in) #返回bool向量，是否大于
        threshold_positions_if_out = threshold_positions_if_in - 1
        threshold_if_out = torch.unsqueeze(torch.gather(top_values_flat, 0, threshold_positions_if_out), 1)
        # is each value currently in the top k.
        normal = Normal(self.mean, self.std)
        prob_if_in = normal.cdf((clean_values - threshold_if_in)/noise_stddev)
        prob_if_out = normal.cdf((clean_values - threshold_if_out)/noise_stddev)
        prob = torch.where(is_in, prob_if_in, prob_if_out)
        return prob

    def noisy_top_k_gating(self, x, train, noise_epsilon=1e-2):
        """Noisy top-k gating.
          See paper: https://arxiv.org/abs/1701.06538.
          Args:
            x: input Tensor with shape [batch_size, input_size]
            train: a boolean - we only add noise at training time.
            noise_epsilon: a float
          Returns:
            gates: a Tensor with shape [batch_size, num_experts]
            load: a Tensor with shape [num_experts]
        """
        clean_logits = x @ self.w_gate
        if self.noisy_gating and train:
            raw_noise_stddev = x @ self.w_noise
            noise_stddev = ((self.softplus(raw_noise_stddev) + noise_epsilon))
            noisy_logits = clean_logits + (torch.randn_like(clean_logits) * noise_stddev)
            logits = noisy_logits
        else:
            logits = clean_logits

        # calculate topk + 1 that will be needed for the noisy gates
        logits = self.softmax(logits)
        top_logits, top_indices = logits.topk(min(self.k + 1, self.num_experts), dim=1)
        top_k_logits = top_logits[:, :self.k]
        top_k_indices = top_indices[:, :self.k]
        top_k_gates = top_k_logits / (top_k_logits.sum(1, keepdim=True) + 1e-6)  # normalization

        zeros = torch.zeros_like(logits, requires_grad=True)
        gates = zeros.scatter(1, top_k_indices, top_k_gates)

        if self.noisy_gating and self.k < self.num_experts and train:
            load = (self._prob_in_top_k(clean_logits, noisy_logits, noise_stddev, top_logits)).sum(0)
        else:
            load = self._gates_to_load(gates)
        return gates, load

    def forward(self, x, loss_coef=1e-2):
        """Args:
        x: tensor shape [batch_size, input_size]
        train: a boolean scalar.
        loss_coef: a scalar - multiplier on load-balancing losses

        Returns:
        y: a tensor with shape [batch_size, output_size].
        extra_training_loss: a scalar.  This should be added into the overall
        training loss of the model.  The backpropagation of this loss
        encourages all experts to be approximately equally used across a batch.
        """
        gates, load = self.noisy_top_k_gating(x, self.training)
        # calculate importance loss
        importance = gates.sum(0)
        #
        loss = self.cv_squared(importance) + self.cv_squared(load)
        loss *= loss_coef

        dispatcher = SparseDispatcher(self.num_experts, gates)
        expert_inputs = dispatcher.dispatch(x)
        gates = dispatcher.expert_to_gates()
        expert_outputs = [self.experts[i](expert_inputs[i]) for i in range(self.num_experts)]
        y = dispatcher.combine(expert_outputs)
        return y, loss

5 应用场景

1. 自然语言处理
在自然语言处理任务中，MoE机制可以应用于各种文本分类、生成和翻译等任务。例如，在文本分类任务中，可以将不同类型的文本分配给不同的专家模型进行处理；在文本生成任务中，可以利用MoE机制实现多风格的文本生成；在机器翻译任务中，可以利用MoE机制处理不同语言之间的翻译问题。
2. 计算机视觉
在计算机视觉领域，MoE机制可以应用于图像分类、目标检测和图像生成等任务。例如，在图像分类任务中，可以将不同类型的图像分配给不同的专家模型进行处理；在目标检测任务中，可以利用MoE机制实现多尺度的目标检测；在图像生成任务中，可以利用MoE机制生成多样化的图像风格。
3. 推荐系统
在推荐系统中，MoE机制可以根据用户的兴趣和行为习惯动态地选择最适合的推荐算法或模型进行推荐。例如，对于新用户或冷启动问题，可以利用MoE机制结合多种推荐算法进行联合推荐；对于老用户或活跃用户，则可以根据用户的历史行为和兴趣选择最符合其需求的推荐算法或模型进行个性化推荐。

6 与其他注意力机制的比较

1. 灵活性
MoE机制可以根据输入的特点动态地选择最合适的专家模型进行处理，因此具有很高的灵活性。相比之下，传统的注意力机制通常只能对输入进行固定的处理，无法根据输入的特点进行动态调整。
2. 多样性
MoE机制通过组合多个专家模型来实现复杂任务的处理，因此具有很强的多样性。不同的专家模型可以处理不同类型的输入或任务部分，从而实现对复杂任务的全面覆盖。相比之下，传统的注意力机制通常只能关注输入的一部分信息，无法充分利用输入的全部信息。
3. 可扩展性
MoE机制可以很容易地添加或删除专家模型以适应新的任务或数据。只需简单地训练新的专家模型并将其添加到MoE系统中即可。相比之下，传统的注意力机制通常需要重新设计整个模型以适应新的任务或数据。