从头实现稀疏混合专家大语言模型

最新推荐文章于 2024-10-12 11:06:21 发布

hjyai94

最新推荐文章于 2024-10-12 11:06:21 发布

阅读量140

点赞数

文章标签：语言模型 android 人工智能

原文链接：https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch?continueFlag=b0e4428a70cfe28a6725bd07eec222b9

版权

翻译博客，原文来自 makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch
全文采用GPT4翻译得到，可能存在部分内容不通顺的情况，欢迎评论交流。

本博客详细介绍了如何从零开始实现一个稀疏混合专家语言模型。这个项目受到了Andrej Karpathy的项目“makemore”的启发，并在很大程度上基于该项目，借用了许多可重用的组件。与makemore一样，makeMoE也是一个自回归的字符级语言模型，但使用了前述的稀疏混合专家架构。本博客的其余部分将关注这一架构的关键要素以及它们是如何实现的。我的目标是让您在阅读本博客并查看代码仓库后，对整个过程有一个直观的理解。在这篇博客中，我们将重点介绍稀疏混合专家架构的关键要素以及它们是如何实现的。阅读本博客并查看代码仓库后，您将对整个过程有一个直观的理解。
这个github仓库提供了完整的代码: https://github.com/AviSoori1x/makeMoE/tree/main

随着Mixtral的发布以及关于Llama 3可能是一种混合专家大型语言模型的讨论，人们对这种模型架构产生了极大的兴趣。然而，在稀疏混合专家语言模型中，许多组件与传统的Transformer是共享的。尽管看似简单，但实证证据表明，训练稳定性是这些模型的主要问题之一。可修改的小规模实现，如本文所述，可能有助于快速尝试新方法。在这个实现中，我对makemore架构进行了一些重要的改变：

稀疏混合专家，而不是单一的前馈神经网络。
Top-k门控和带噪声的Top-k门控实现。
初始化，本文使用了Kaiming He初始化，但本文的重点是可修改性，因此您可以将Xavier/Glorot初始化等替换进去，并进行尝试。
然而，以下内容与makemore保持不变：
Andrej最初选择的数据集、预处理（分词）和语言建模任务
生成类似莎士比亚的文本
因果自注意力实现
训练循环 -
推理逻辑

让我们开始吧！

正如预期的那样，稀疏混合专家语言模型依赖于自注意力来理解上下文。稍后，我们将探讨混合专家模块的复杂性。首先，让我们深入了解自注意力以更新我们的理解。

理解因果缩放点积自注意力的直观性

提供的代码展示了自注意力的机制和基本概念，特别关注经典的缩放点积自注意力。在这个变体中，查询、键和值矩阵都来自同一个输入序列。为了确保自回归语言生成过程的完整性，特别是在仅解码器模型中，代码实现了掩蔽。这种掩蔽技术至关重要，因为它遮盖了当前标记位置之后的任何信息，从而使模型的注意力只关注序列的前面部分。这种注意力机制被称为因果自注意力。需要注意的是，稀疏混合专家模型并不局限于仅解码器的Transformer架构。实际上，在这个领域的许多重要工作，特别是Shazeer等人的工作，都围绕着T5架构展开，该架构包括Transformer模型中的编码器和解码器组件。

#This code is borrowed from Andrej Karpathy's makemore repository linked in the repo.
# The self attention layers in Sparse mixture of experts models are the same as in regular transformer models
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)
# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1) #B,T,T
v = value(x) #B,T,H
out = wei @ v # (B,T,T) @ (B,T,H) -> (B,T,H)
out.shape

torch.Size([4, 8, 16])

因果自注意力和多头因果自注意力的代码可以按如下方式组织。多头自注意力并行应用多个注意力头，每个头关注通道（嵌入维度）的不同部分。多头自注意力本质上改善了学习过程，并由于其固有的并行实现提高了模型训练的效率。请注意，在整个实现过程中，我使用了dropout作为正则化，即防止过拟合。

#Causal scaled dot product self-Attention Head
n_embd = 64
n_head = 4
n_layer = 4
head_size = 16
dropout = 0.1

class Head(nn.Module):
    """ one head of self-attention """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

多头注意力实现如下：

#Multi-Headed Self Attention
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

创建一个专家模块，即一个简单的多层感知器

在稀疏混合专家（MoE）架构中，每个Transformer块内的自注意力机制保持不变。然而，在每个块的结构中发生了显著的改变：标准的前馈神经网络被几个稀疏激活的前馈网络替代，这些网络被称为专家。"稀疏激活"是指将序列中的每个标记仅路由到有限数量的专家（通常是一个或两个）的过程，而不是可用的总池。这有助于提高训练和推理速度，因为在每次前向传播中只激活少数专家。然而，所有的专家都必须在GPU内存中，因此当总参数数量达到数千亿甚至数万亿时，会产生有趣的部署问题。
在这里插入图片描述

#Expert module
class Expert(nn.Module):
    """ An MLP is a simple linear layer followed by a non-linearity i.e. each Expert """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

举例说明 Top-k 门控直觉

![[Pasted image 20240129135607.png]]

门控网络（也称为路由器）决定了每个标记从多头注意力输出后，哪个专家网络接收该输出。让我们考虑一个简单的例子：假设有4个专家，标记将被路由到前2个专家。首先，我们通过一个线性层将标记输入到门控网络中。这个层将输入张量从形状（2, 4, 32）（表示（批量大小，标记，n_embed，其中n_embed是输入的通道维度））投影到新的形状（2, 4, 4），对应于（批量大小，标记，num_experts），其中num_experts是专家网络的数量。接下来，我们沿着最后一个维度确定最高的k=2个值及其相应的索引。

#Understanding how gating works
num_experts = 4
top_k=2
n_embed=32


#Example multi-head attention output for a simple illustrative example, consider n_embed=32, context_length=4 and batch_size=2
mh_output = torch.randn(2, 4, n_embed)

topkgate_linear = nn.Linear(n_embed, num_experts) # nn.Linear(32, 4)

logits = topkgate_linear(mh_output)
top_k_logits, top_k_indices = logits.topk(top_k, dim=-1)  # Get top-k experts
top_k_logits, top_k_indices

#output:
(tensor([[[ 0.0246, -0.0190],
          [ 0.1991,  0.1513],
          [ 0.9749,  0.7185],
          [ 0.4406, -0.8357]],
 
         [[ 0.6206, -0.0503],
          [ 0.8635,  0.3784],
          [ 0.6828,  0.5972],
          [ 0.4743,  0.3420]]], grad_fn=<TopkBackward0>),
 tensor([[[2, 3],
          [2, 1],
          [3, 1],
          [2, 1]],
 
         [[0, 2],
          [0, 3],
          [3, 2],
          [3, 0]]]))

通过在最后一个维度的相应索引中仅保留前k个值来获得稀疏门控输出。用“-inf”填充其余部分，并通过softmax激活。这将“-inf”值推向零，使前两个值更加突出，并使它们的和为1。这种求和到1有助于对专家输出进行加权。

zeros = torch.full_like(logits, float('-inf')) #full_like clones a tensor and fills it with a specified value (like infinity) for masking or calculations.
sparse_logits = zeros.scatter(-1, top_k_indices, top_k_logits)
sparse_logits

#output
tensor([[[   -inf,    -inf,  0.0246, -0.0190],
         [   -inf,  0.1513,  0.1991,    -inf],
         [   -inf,  0.7185,    -inf,  0.9749],
         [   -inf, -0.8357,  0.4406,    -inf]],

        [[ 0.6206,    -inf, -0.0503,    -inf],
         [ 0.8635,    -inf,    -inf,  0.3784],
         [   -inf,    -inf,  0.5972,  0.6828],
         [ 0.3420,    -inf,    -inf,  0.4743]]], grad_fn=<ScatterBackward0>)

将上述代码通用化和模块化，并添加噪声 top-k Gating 以实现负载平衡

# First define the top k router module 
class TopkRouter(nn.Module):
    def __init__(self, n_embed, num_experts, top_k):
        super(TopkRouter, self).__init__()
        self.top_k = top_k
        self.linear =nn.Linear(n_embed, num_experts)
    
    def forward(self, mh_ouput):
        # mh_ouput is the output tensor from multihead self attention block
        logits = self.linear(mh_output)
        top_k_logits, indices = logits.topk(self.top_k, dim=-1)
        zeros = torch.full_like(logits, float('-inf'))
        sparse_logits = zeros.scatter(-1, indices, top_k_logits)
        router_output = F.softmax(sparse_logits, dim=-1)
        return router_output, indices

测试一下函数的功能性

#Testing this out:
num_experts = 4
top_k = 2
n_embd = 32

mh_output = torch.randn(2, 4, n_embd)  # Example input
top_k_gate = TopkRouter(n_embd, num_experts, top_k)
gating_output, indices = top_k_gate(mh_output)
gating_output.shape, gating_output, indices
#And it works!!

尽管最近发布的mixtral论文没有提到这一点，但我认为Noisy top-k Gating是训练MoE模型的一个重要工具。本质上，你不希望所有的token都被发送到相同的一组“受欢迎”的专家。你需要在利用和探索之间找到一个微妙的平衡。为此，在负载平衡方面，将标准正态噪声添加到门控线性层的logits中是有帮助的。这使得训练更加高效。
![[Pasted image 20240129141044.png]]

#Changing the above to accomodate noisy top-k gating
class NoisyTopkRouter(nn.Module):
    def __init__(self, n_embed, num_experts, top_k):
        super(NoisyTopkRouter, self).__init__()
        self.top_k = top_k
        #layer for router logits
        self.topkroute_linear = nn.Linear(n_embed, num_experts)
        self.noise_linear =nn.Linear(n_embed, num_experts)

    
    def forward(self, mh_output):
        # mh_ouput is the output tensor from multihead self attention block
        logits = self.topkroute_linear(mh_output)

        #Noise logits
        noise_logits = self.noise_linear(mh_output)

        #Adding scaled unit gaussian noise to the logits
        noise = torch.randn_like(logits)*F.softplus(noise_logits)
        noisy_logits = logits + noise

        top_k_logits, indices = noisy_logits.topk(self.top_k, dim=-1)
        zeros = torch.full_like(noisy_logits, float('-inf'))
        sparse_logits = zeros.scatter(-1, indices, top_k_logits)
        router_output = F.softmax(sparse_logits, dim=-1)
        return router_output, indices

再测试一下代码实现：

#Testing this out, again:
num_experts = 8
top_k = 2
n_embd = 16

mh_output = torch.randn(2, 4, n_embd)  # Example input
noisy_top_k_gate = NoisyTopkRouter(n_embd, num_experts, top_k)
gating_output, indices = noisy_top_k_gate(mh_output)
gating_output.shape, gating_output, indices
#It works!!

#output
(torch.Size([2, 4, 8]),
 tensor([[[0.4181, 0.0000, 0.5819, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.4693, 0.5307, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.0000, 0.4985, 0.5015, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.2641, 0.0000, 0.7359, 0.0000, 0.0000]],
 
         [[0.0000, 0.0000, 0.0000, 0.6301, 0.0000, 0.3699, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000, 0.4766, 0.0000, 0.0000, 0.0000, 0.5234],
          [0.0000, 0.0000, 0.0000, 0.6815, 0.0000, 0.0000, 0.3185, 0.0000],
          [0.4482, 0.5518, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]],
        grad_fn=<SoftmaxBackward0>),
 tensor([[[2, 0],
          [1, 0],
          [2, 1],
          [5, 3]],
 
         [[3, 5],
          [7, 3],
          [3, 6],
          [1, 0]]]))

创建一个稀疏混合专家网络模块

这个过程的主要方面涉及门控网络的输出。在获得这些结果之后，将前k个值有选择地与给定token的相应前k个专家的输出相乘。这种有选择性的乘法形成了一个加权和，构成了SparseMoe块的输出。这个过程中关键且具有挑战性的部分是避免不必要的乘法。只对前k个专家进行前向传播是至关重要的，然后计算这个加权和。对每个专家进行前向传播将违背使用稀疏MoE的目的，因为它将不再是稀疏的。

class SparseMoE(nn.Module):
    def __init__(self, n_embed, num_experts, top_k):
        super(SparseMoE, self).__init__()
        self.router = NoisyTopkRouter(n_embed, num_experts, top_k)
        self.experts = nn.ModuleList([Expert(n_embed) for _ in range(num_experts)])
        self.top_k = top_k

    def forward(self, x):
        gating_output, indices = self.router(x)
        final_output = torch.zeros_like(x)

        # Reshape inputs for batch processing
        flat_x = x.view(-1, x.size(-1))
        flat_gating_output = gating_output.view(-1, gating_output.size(-1))

        # Process each expert in parallel
        for i, expert in enumerate(self.experts):
            # Create a mask for the inputs where the current expert is in top-k
            expert_mask = (indices == i).any(dim=-1)
            flat_mask = expert_mask.view(-1)

            if flat_mask.any():
                expert_input = flat_x[flat_mask]
                expert_output = expert(expert_input)

                # Extract and apply gating scores
                gating_scores = flat_gating_output[flat_mask, i].unsqueeze(1)
                weighted_output = expert_output * gating_scores

                # Update final output
                # We need to scatter_add the weighted outputs to their original positions in the batch
                final_output.masked_scatter_(expert_mask.unsqueeze(-1), weighted_output)

        return final_output.view_as(x)

使用示例输入来测试上述实现方法是否有效是很有帮助的。运行以下代码后，我们可以看到它确实有效！

import torch
import torch.nn as nn

#Let's test this out
num_experts = 8
top_k = 2
n_embd = 16
dropout=0.1

mh_output = torch.randn(4, 8, n_embd)  # Example multi-head attention output
sparse_moe = SparseMoE(n_embd, num_experts, top_k)
final_output = sparse_moe(mh_output)
print("Shape of the final output:", final_output.shape)

需要强调的是，认识到从路由器/门控网络输出的前k个专家的幅度（如上述代码所示）也是非常重要的。这些前k个索引确定了被激活的专家，而那些前k个维度中值的幅度决定了它们各自的权重。这种加权求和的概念在下面的图表中得到了进一步的强调。
![[Pasted image 20240129141553.png]]

放到一起

多头自注意力和稀疏专家混合被组合成一个稀疏专家混合变换器块。就像在普通的变换器块中一样，添加跳过连接以确保训练稳定并避免梯度消失等问题。此外，还采用了层归一化技术来进一步稳定学习过程。

#Create a self attention + mixture of experts block, that may be repeated several number of times 
class Block(nn.Module):
    """ Mixture of Experts Transformer block: communication followed by computation (multi-head self attention + SparseMoE) """

    def __init__(self, n_embed, n_head, num_experts, top_k):
        # n_embed: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.smoe = SparseMoE(n_embed, num_experts, top_k)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.smoe(self.ln2(x))
        return x

最后，将所有内容整合在一起，形成一个稀疏混合专家语言模型

class SparseMoELanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(*[Block(n_embed, n_head=n_head, num_experts=num_experts,top_k=top_k) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embed) # final layer norm
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

初始化对于深度神经网络的高效训练非常重要。由于专家中存在ReLU激活，因此在这里使用了Kaiming He初始化。可以尝试使用在变换器中更常用的Glorot初始化。Jeremy Howard的Fastai Part 2有一个很好的课程，从零开始实现这些方法：https://course.fast.ai/Lessons/lesson17.html。文献中指出，Glorot初始化通常用于变换器模型，因此这是一个可能提高模型性能的机会。

def kaiming_init_weights(m):
    if isinstance (m, (nn.Linear)): 
        init.kaiming_normal_(m.weight)

model = SparseMoELanguageModel()
model.apply(kaiming_init_weights)

我使用 mlflow 跟踪并记录重要指标和训练超参数。我在这里展示的训练循环就包含了这些代码。如果你更喜欢不使用 mlflow 而只进行训练，makeMoE github repo 中的笔记本也有不使用 MLFlow 的代码块。我个人认为跟踪参数和指标非常方便，尤其是在实验时。

#Using MLFlow
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
#mlflow.set_experiment("makeMoE")
with mlflow.start_run():
    #If you use mlflow.autolog() this will be automatically logged. I chose to explicitly log here for completeness
    params = {"batch_size": batch_size , "block_size" : block_size, "max_iters": max_iters, "eval_interval": eval_interval,
              "learning_rate": learning_rate, "device": device, "eval_iters": eval_iters, "dropout" : dropout, "num_experts": num_experts, "top_k": top_k }
    mlflow.log_params(params)
    for iter in range(max_iters):

        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0 or iter == max_iters - 1:
            losses = estimate_loss()
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
            metrics = {"train_loss": losses['train'], "val_loss": losses['val']}
            mlflow.log_metrics(metrics, step=iter)


        # sample a batch of data
        xb, yb = get_batch('train')

        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

8.996545 M parameters
step 0: train loss 5.3223, val loss 5.3166
step 100: train loss 2.7351, val loss 2.7429
step 200: train loss 2.5125, val loss 2.5233
.
.
.

step 4999: train loss 1.5712, val loss 1.7508

记录训练和验证损失可以很好地说明训练的进展情况。图中显示，我可能应该在 4500 步左右（验证损失略有上升时）停止训练
![[Pasted image 20240129142310.png]]

现在，我们可以使用该模型逐个字符自动生成文本。对于一个稀疏激活的 ~9M 参数模型来说，我没什么可抱怨的。

# generate from the model. Not great. Not too bad either
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

DUKE VINCENVENTIO:
If it ever fecond he town sue kigh now,
That thou wold'st is steen 't.

SIMNA:
Angent her; no, my a born Yorthort,
Romeoos soun and lawf to your sawe with ch a woft ttastly defy,
To declay the soul art; and meart smad.

CORPIOLLANUS:
Which I cannot shall do from by born und ot cold warrike,
What king we best anone wrave's going of heard and good
Thus playvage; you have wold the grace.
...

我希望这个解释有助于您理解稀疏专家混合模型的架构以及它是如何组合在一起的。在这个实现中，我主要参考了以下出版物：
专家混合：https://arxiv.org/pdf/2401.04088.pdf
极大的神经网络：稀疏门控混合专家层：https://arxiv.org/pdf/1701.06538.pdf
来自Andrej Karpathy的原始makemore实现： https://github.com/karpathy/makemore
该代码完全在Databricks上使用单个A100开发。如果您在Databricks上运行此代码，您可以在任意大的GPU集群上进行扩展，无需在您选择的云提供商上遇到问题。我选择使用MLFlow（它预装在Databricks中。它是完全开源的，您可以在其他地方轻松地通过pip安装），因为我发现它有助于跟踪和记录所有必要的指标。这完全是可选的。请注意，实现强调可读性和可修改性而非性能，因此您可以通过许多方式改进它。鉴于此，您可以尝试以下几件事：使专家混合模块更高效。我相信在上述实现中，对正确专家的稀疏激活可以做出重大改进。尝试不同的神经网络初始化策略。我列出的来源（Fastai第2部分）非常出色从字符级别转换为子词标记化对专家数量和top_k（每个标记激活的专家数量）进行贝叶斯超参数搜索。这可以被归类为神经结构搜索。这里没有讨论或实现专家容量。探索这个问题绝对是值得的。鉴于对专家混合和多模态的关注程度，观察这两者交汇处将会发展出什么也将是有趣的。祝您编程愉快！