glm4、qwen、MiniCPM-Llama3-V代码层面差异性分析

最新推荐文章于 2024-07-15 22:10:26 发布

路人与大师

最新推荐文章于 2024-07-15 22:10:26 发布

阅读量798

点赞数 24

文章标签：算法语言模型大数据

本文链接：https://blog.csdn.net/weixin_41046245/article/details/139775402

版权

让我们详细解读这个glm4下的 RotaryEmbedding 类，逐行解释代码的工作原理。

类的定义和初始化

class RotaryEmbedding(nn.Module):
    def __init__(self, dim, rope_ratio=1, original_impl=False, device=None, dtype=None):
        super().__init__()
        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim))
        self.register_buffer("inv_freq", inv_freq)
        self.dim = dim
        self.original_impl = original_impl
        self.rope_ratio = rope_ratio

class RotaryEmbedding(nn.Module):
定义了一个继承自 nn.Module 的类 RotaryEmbedding。
def __init__(self, dim, rope_ratio=1, original_impl=False, device=None, dtype=None):
初始化方法，定义了这个类的构造函数。它接受以下参数：
- dim: 旋转嵌入的维度。
- rope_ratio: 调整基础比例的参数，默认值为 1。
- original_impl: 是否使用原始实现，默认值为 False。
- device: 设备信息，指定计算是在 CPU 还是 GPU 上进行。
- dtype: 数据类型。
super().__init__()
调用父类 nn.Module 的初始化方法。
inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim))
计算倒频率 inv_freq，用于生成旋转位置嵌入。
具体计算方法是：
- torch.arange(0, dim, 2, device=device) 生成从 0 到 dim 的步长为 2 的序列。
- 这个序列除以 dim 并转换为指定的数据类型 dtype。
- 最终计算公式为 1.0 / (10000 ** (序列 / dim))。
self.register_buffer("inv_freq", inv_freq)
将 inv_freq 注册为缓冲区，这样在模型训练过程中它不会被优化器更新。
self.dim = dim
存储维度信息。
self.original_impl = original_impl
存储是否使用原始实现的标志。
self.rope_ratio = rope_ratio
存储 rope_ratio 参数。

`forward_impl` 方法

def forward_impl(
        self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000
):
    base = base * self.rope_ratio
    theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=torch.float, device=device) / n_elem))

    seq_idx = torch.arange(seq_len, dtype=torch.float, device=device)

    idx_theta = torch.outer(seq_idx, theta).float()

    cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1)

    if dtype in (torch.float16, torch.bfloat16, torch.int8):
        cache = cache.bfloat16() if dtype == torch.bfloat16 else cache.half()
    return cache

def forward_impl(self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000):
定义了 forward_impl 方法，它接受以下参数：
- seq_len: 序列长度。
- n_elem: 元素数量（嵌入维度的一半）。
- dtype: 数据类型。
- device: 设备信息。
- base: 基础值，默认值为 10000。
base = base * self.rope_ratio
调整基础值，乘以 rope_ratio。
theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=torch.float, device=device) / n_elem))
计算旋转角度 theta，具体步骤如下：
- torch.arange(0, n_elem, 2, dtype=torch.float, device=device) 生成从 0 到 n_elem 的步长为 2 的序列，并指定数据类型和设备。
- 序列除以 n_elem。
- 计算公式为 1.0 / (base ** (序列 / n_elem))。
seq_idx = torch.arange(seq_len, dtype=torch.float, device=device)
生成从 0 到 seq_len - 1 的序列索引，并指定数据类型和设备。
idx_theta = torch.outer(seq_idx, theta).float()
计算位置索引与 theta 的外积，得到一个矩阵。
cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1)
计算 cos 和 sin 值，并沿着最后一个维度堆叠，得到一个包含 cos 和 sin 值的张量。
if dtype in (torch.float16, torch.bfloat16, torch.int8):
检查数据类型是否为 float16, bfloat16 或 int8。
cache = cache.bfloat16() if dtype == torch.bfloat16 else cache.half()
根据数据类型将缓存转换为相应的类型。
return cache
返回计算得到的缓存。

`forward` 方法

def forward(self, max_seq_len, offset=0):
    return self.forward_impl(
        max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device
    )

def forward(self, max_seq_len, offset=0):
定义 forward 方法，它接受以下参数：
- max_seq_len: 最大序列长度。
- offset: 偏移量（默认值为 0，但在这个实现中没有使用）。
return self.forward_impl(max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device)
调用 forward_impl 方法，传递 max_seq_len, self.dim, self.inv_freq.dtype, self.inv_freq.device 作为参数，并返回计算结果。

总结

这个 RotaryEmbedding 类实现了旋转位置嵌入（RoPE），通过将位置信息嵌入到向量中来增强Transformer模型的能力。RoPE使用 cos 和 sin 函数生成位置嵌入，使模型能够更好地捕捉序列中的相对位置信息。该实现特别适用于处理序列数据，例如自然语言处理任务中的文本序列。

让我们详细解读这个glm4下的 RMSNorm 类，逐行解释代码的工作原理。

类的定义和初始化

class RMSNorm(torch.nn.Module):
    def __init__(self, normalized_shape, eps=1e-5, device=None, dtype=None, **kwargs):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.empty(normalized_shape, device=device, dtype=dtype))
        self.eps = eps

class RMSNorm(torch.nn.Module):
定义了一个继承自 torch.nn.Module 的类 RMSNorm。RMSNorm 是一种归一化层，使用均方根（RMS）来进行归一化。
def __init__(self, normalized_shape, eps=1e-5, device=None, dtype=None, **kwargs):
初始化方法，定义了这个类的构造函数。它接受以下参数：
- normalized_shape: 归一化的形状，通常是输入张量的最后一个维度。
- eps: 一个很小的值，防止除零操作，默认值为 1e-5。
- device: 设备信息，指定计算是在 CPU 还是 GPU 上进行。
- dtype: 数据类型。
- **kwargs: 其他可能的参数（这里没有使用）。
super().__init__()
调用父类 torch.nn.Module 的初始化方法。
self.weight = torch.nn.Parameter(torch.empty(normalized_shape, device=device, dtype=dtype))
定义一个可训练的参数 weight，其形状为 normalized_shape。使用 torch.empty 初始化，这只是分配内存，没有具体的数值。
self.eps = eps
存储 eps 参数。

`forward` 方法

def forward(self, hidden_states: torch.Tensor):
    input_dtype = hidden_states.dtype
    variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
    hidden_states = hidden_states * torch.rsqrt(variance + self.eps)

    return (self.weight * hidden_states).to(input_dtype)

def forward(self, hidden_states: torch.Tensor):
定义 forward 方法，它接受一个参数：
- hidden_states: 输入的张量。
input_dtype = hidden_states.dtype
存储输入张量的原始数据类型。
variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
计算输入张量的方差。具体步骤如下：
- hidden_states.to(torch.float32) 将输入张量转换为 float32 类型，以防止数值不稳定性。
- .pow(2) 计算输入张量的元素平方。
- .mean(-1, keepdim=True) 沿着最后一个维度计算均值，并保持维度不变（即保留维度为 1 的形式）。
hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
对输入张量进行均方根归一化。具体步骤如下：
- variance + self.eps 计算方差并加上 eps，防止除零操作。
- torch.rsqrt(...) 计算上述结果的倒平方根。
- hidden_states * ... 将输入张量乘以倒平方根，实现归一化。
return (self.weight * hidden_states).to(input_dtype)
返回归一化后的张量，并将其转换回原始数据类型。具体步骤如下：
- self.weight * hidden_states 将归一化后的张量与 weight 参数相乘，实现缩放。
- .to(input_dtype) 将结果转换回输入张量的原始数据类型。

总结

这个 RMSNorm 类实现了一种归一化层，使用均方根（RMS）来对输入张量进行归一化。与传统的 Batch Normalization 和 Layer Normalization 不同，RMSNorm 只考虑输入张量的元素平方的均值，因此它不需要计算均值和方差，而是直接使用方差。这样可以提高计算效率，并且在一些情况下可以取得更好的效果。

以下是关键步骤的总结：

初始化时，创建一个可训练的权重参数 weight。
在 forward 方法中，计算输入张量的方差，并使用均方根（RMS）对其进行归一化。
使用可训练的权重参数 weight 对归一化后的张量进行缩放，并返回结果。

让我们详细解读这个 CoreAttention 类的代码，逐行解释其工作原理。

类的定义和初始化

class CoreAttention(torch.nn.Module):
    def __init__(self, config: ChatGLMConfig, layer_number):
        super(CoreAttention, self).__init__()

        self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
        self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
        if self.apply_query_key_layer_scaling:
            self.attention_softmax_in_fp32 = True
        self.layer_number = max(1, layer_number)

        projection_size = config.kv_channels * config.num_attention_heads

        # Per attention head and per partition values.
        self.hidden_size_per_partition = projection_size
        self.hidden_size_per_attention_head = projection_size // config.num_attention_heads
        self.num_attention_heads_per_partition = config.num_attention_heads

        coeff = None
        self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
        if self.apply_query_key_layer_scaling:
            coeff = self.layer_number
            self.norm_factor *= coeff
        self.coeff = coeff

        self.attention_dropout = torch.nn.Dropout(config.attention_dropout)

class CoreAttention(torch.nn.Module):
定义了一个继承自 torch.nn.Module 的类 CoreAttention，用于实现自注意力机制。
def __init__(self, config: ChatGLMConfig, layer_number):
初始化方法，接受两个参数：
- config: 包含模型配置的对象。
- layer_number: 当前注意力层的编号。
super(CoreAttention, self).__init__()
调用父类 torch.nn.Module 的初始化方法。
self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
读取配置中是否应用查询-键层缩放。
self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
读取配置中是否在 FP32（32位浮点数）中计算 softmax。
if self.apply_query_key_layer_scaling:
如果应用查询-键层缩放，则在 FP32 中计算 softmax。
self.layer_number = max(1, layer_number)
确保层编号至少为 1。
projection_size = config.kv_channels * config.num_attention_heads
计算投影大小，等于键-值通道数乘以注意力头数。
self.hidden_size_per_partition = projection_size
每个分区的隐藏大小。
self.hidden_size_per_attention_head = projection_size // config.num_attention_heads
每个注意力头的隐藏大小。
self.num_attention_heads_per_partition = config.num_attention_heads
每个分区的注意力头数。
coeff = None
初始化系数为 None。
self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
计算归一化因子。
if self.apply_query_key_layer_scaling:
如果应用查询-键层缩放，则乘以层编号。
self.coeff = coeff
存储系数。
self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
定义注意力丢弃层。

`forward` 方法

def forward(self, query_layer, key_layer, value_layer, attention_mask):
    pytorch_major_version = int(torch.__version__.split('.')[0])
    if pytorch_major_version >= 2:
        if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
            context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
                                                                             is_causal=True)
        else:
            if attention_mask is not None:
                attention_mask = ~attention_mask
            context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
                                                                             attention_mask)
        context_layer = context_layer.transpose(1, 2).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
        context_layer = context_layer.reshape(*new_context_layer_shape)
    else:
        # Raw attention scores

        # [b, np, sq, sk]
        output_size = (query_layer.size(0), query_layer.size(1), query_layer.size(2), key_layer.size(2))

        # [b, np, sq, hn] -> [b * np, sq, hn]
        query_layer = query_layer.view(output_size[0] * output_size[1], output_size[2], -1)
        # [b, np, sk, hn] -> [b * np, sk, hn]
        key_layer = key_layer.view(output_size[0] * output_size[1], output_size[3], -1)

        # preallocating input tensor: [b * np, sq, sk]
        matmul_input_buffer = torch.empty(
            output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype,
            device=query_layer.device
        )

        # Raw attention scores. [b * np, sq, sk]
        matmul_result = torch.baddbmm(
            matmul_input_buffer,
            query_layer,  # [b * np, sq, hn]
            key_layer.transpose(1, 2),  # [b * np, hn, sk]
            beta=0.0,
            alpha=(1.0 / self.norm_factor),
        )

        # change view to [b, np, sq, sk]
        attention_scores = matmul_result.view(*output_size)

        # ===========================
        # Attention probs and dropout
        # ===========================

        # attention scores and attention mask [b, np, sq, sk]
        if self.attention_softmax_in_fp32:
            attention_scores = attention_scores.float()
        if self.coeff is not None:
            attention_scores = attention_scores * self.coeff
        if attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]:
            attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3],
                                        device=attention_scores.device, dtype=torch.bool)
            attention_mask.tril_()
            attention_mask = ~attention_mask
        if attention_mask is not None:
            attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
        attention_probs = F.softmax(attention_scores, dim=-1)
        attention_probs = attention_probs.type_as(value_layer)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.attention_dropout(attention_probs)

        # query layer shape: [b * np, sq, hn]
        # value layer shape: [b, np, sk, hn]
        # attention shape: [b, np, sq, sk]
        # context layer shape: [b, np, sq, hn]
        output_size = (value_layer.size(0), value_layer.size(1), query_layer.size(1), value_layer.size(3))
        # change view [b * np, sk, hn]
        value_layer = value_layer.view(output_size[0] * output_size[1], value_layer.size(2), -1)
        # change view [b * np, sq, sk]
        attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
        # matmul: [b * np, sq, hn]
        context_layer = torch.bmm(attention_probs, value_layer)
        # change view [b, np, sq, hn]
        context_layer = context_layer.view(*output_size)
        # [b, np, sq, hn] --> [b, sq, np, hn]
        context_layer = context_layer.transpose(1, 2).contiguous()
        # [b, sq, np, hn] --> [b, sq, hp]
        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
        context_layer = context_layer.reshape(*new_context_layer_shape)

    return context_layer

主要方法详解

def forward(self, query_layer, key_layer, value_layer, attention_mask):
- query_layer, key_layer, value_layer: 自注意力机制中的查询、键、值层。
- attention_mask: 注意力掩码，用于屏蔽部分注意力权重。
版本检测和分支选择
- pytorch_major_version = int(torch.__version__.split('.')[0])
  检查 PyTorch 的主要版本号。
- 使用不同的方法计算注意力（PyTorch 2.0 及以上版本使用 torch.nn.functional.scaled_dot_product_attention）。
PyTorch 2.0 及以上版本的实现
- if pytorch_major_version >= 2:
  - 如果没有提供 attention_mask 并且 query_layer 和 key_layer 的序列长度相等，则使用因果注意力。
  - 如果提供了 `attention

_mask，则进行掩码反转。 - 使用 scaled_dot_product_attention` 计算上下文层。
- 重新调整上下文层的形状。

PyTorch 2.0 以下版本的实现
- 计算原始注意力分数：
  - output_size: 确定输出的尺寸。
  - 调整 query_layer 和 key_layer 的形状，以便进行批量矩阵乘法。
  - 预分配输入张量 matmul_input_buffer，进行 torch.baddbmm 操作。
  - 调整 attention_scores 的形状。
- 计算注意力概率和应用丢弃：
  - 在 FP32 中计算 softmax（如果需要）。
  - 应用层缩放系数（如果存在）。
  - 如果没有提供 attention_mask，则创建一个下三角掩码。
  - 应用掩码并使用 softmax 计算注意力概率。
  - 使用丢弃层随机丢弃部分注意力概率。
- 计算上下文层：
  - 调整 value_layer 和 attention_probs 的形状。
  - 使用 torch.bmm 进行批量矩阵乘法，得到上下文层。
  - 调整上下文层的形状。

总结

这个 CoreAttention 类实现了一个核心注意力机制，用于计算查询、键和值层之间的注意力权重和上下文表示。它包括：

初始化时，设置了一些配置参数，如是否应用查询-键层缩放和 softmax 的数据类型。
在 forward 方法中，基于 PyTorch 版本选择不同的方法计算注意力。
如果使用较新版本的 PyTorch，则使用内置的 scaled_dot_product_attention 函数。
否则，手动计算注意力分数和上下文层，包括处理注意力掩码和应用丢弃。

通过这种方式，CoreAttention 类能够灵活地适应不同版本的 PyTorch，并实现高效的自注意力机制。

那么最后引入chatgpt4o对以上的过程进行总结

好的，下面我将详细介绍代码中每一个部分，并用LaTeX公式来表示各个步骤的数学运算。

类 GLMBlock

该类继承自 torch.nn.Module，表示一个单一的Transformer层。Transformer层接收尺寸为 [s, b, h] 的输入，并返回相同尺寸的输出。

class GLMBlock(torch.nn.Module):
    """A single transformer layer.

    Transformer layer takes input with size [s, b, h] and returns an
    output of the same size.
    """

初始化方法

初始化方法定义了该层的各个组件，包括输入层的层归一化、自注意力机制、注意力输出的层归一化和MLP层。

def __init__(self, config: ChatGLMConfig, layer_number, device=None):
    super(GLMBlock, self).__init__()
    self.layer_number = layer_number

    self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
    self.fp32_residual_connection = config.fp32_residual_connection

    LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
    self.input_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
                                         dtype=config.torch_dtype)

    self.self_attention = SelfAttention(config, layer_number, device=device)
    self.hidden_dropout = config.hidden_dropout

    self.post_attention_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
                                                  dtype=config.torch_dtype)

    self.mlp = MLP(config, device=device)

前向传播方法

前向传播方法定义了数据流经各个组件的方式。

def forward(self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True):
    # hidden_states: [s, b, h]

1. 输入层归一化

对输入进行层归一化。

$layernorm_output = LayerNorm ( hidden_states ) \text{layernorm\_output} = \text{LayerNorm}(\text{hidden\_states})$

layernorm_output = self.input_layernorm(hidden_states)

2. 自注意力机制

将归一化后的输出传递给自注意力层，并获取注意力输出和更新后的缓存。

$attention_output , kv_cache = SelfAttention ( layernorm_output , attention_mask , rotary_pos_emb , kv_cache = kv_cache , use_cache = use_cache ) \text{attention\_output}, \text{kv\_cache} = \text{SelfAttention}(\text{layernorm\_output}, \text{attention\_mask}, \text{rotary\_pos\_emb}, \text{kv\_cache}=\text{kv\_cache}, \text{use\_cache}=\text{use\_cache})$

attention_output, kv_cache = self.self_attention(
    layernorm_output,
    attention_mask,
    rotary_pos_emb,
    kv_cache=kv_cache,
    use_cache=use_cache
)

3. 残差连接

根据配置决定残差连接的位置。

$layernorm_output if apply_residual_connection_post_layernorm hidden_states otherwise \text{residual} = \begin{cases} \text{layernorm\_output} & \text{if apply\_residual\_connection\_post\_layernorm} \\ \text{hidden\_states} & \text{otherwise} \end{cases}$

if self.apply_residual_connection_post_layernorm:
    residual = layernorm_output
else:
    residual = hidden_states

4. 添加Dropout并进行第二次层归一化

$layernorm_input = Dropout ( attention_output , p = self.hidden_dropout ) \text{layernorm\_input} = \text{Dropout}(\text{attention\_output}, p=\text{self.hidden\_dropout})$
$layernorm_input = residual + layernorm_input \text{layernorm\_input} = \text{residual} + \text{layernorm\_input}$
$layernorm_output = LayerNorm ( layernorm_input ) \text{layernorm\_output} = \text{LayerNorm}(\text{layernorm\_input})$

layernorm_input = torch.nn.functional.dropout(attention_output, p=self.hidden_dropout, training=self.training)
layernorm_input = residual + layernorm_input

layernorm_output = self.post_attention_layernorm(layernorm_input)

5. MLP层

$mlp_output = MLP ( layernorm_output ) \text{mlp\_output} = \text{MLP}(\text{layernorm\_output})$

mlp_output = self.mlp(layernorm_output)

6. 第二次残差连接和输出Dropout

$layernorm_output if apply_residual_connection_post_layernorm layernorm_input otherwise \text{residual} = \begin{cases} \text{layernorm\_output} & \text{if apply\_residual\_connection\_post\_layernorm} \\ \text{layernorm\_input} & \text{otherwise} \end{cases}$
$mlp_output , p = self.hidden_dropout ) \text{output} = \text{Dropout}(\text{mlp\_output}, p=\text{self.hidden\_dropout})$
$\text{output} = \text{residual} + \text{output}$

if self.apply_residual_connection_post_layernorm:
    residual = layernorm_output
else:
    residual = layernorm_input

output = torch.nn.functional.dropout(mlp_output, p=self.hidden_dropout, training=self.training)
output = residual + output

返回输出和缓存

return output, kv_cache

总结

通过这种方式，GLMBlock类实现了一个Transformer层，其中包括层归一化、自注意力机制、残差连接、Dropout和MLP层。各个步骤通过LaTeX公式表示如下：

输入层归一化：
$layernorm_output = LayerNorm ( hidden_states ) \text{layernorm\_output} = \text{LayerNorm}(\text{hidden\_states})$
自注意力机制：
$attention_output , kv_cache = SelfAttention ( layernorm_output , attention_mask , rotary_pos_emb , kv_cache = kv_cache , use_cache = use_cache ) \text{attention\_output}, \text{kv\_cache} = \text{SelfAttention}(\text{layernorm\_output}, \text{attention\_mask}, \text{rotary\_pos\_emb}, \text{kv\_cache}=\text{kv\_cache}, \text{use\_cache}=\text{use\_cache})$
残差连接：
$layernorm_output if apply_residual_connection_post_layernorm hidden_states otherwise \text{residual} = \begin{cases} \text{layernorm\_output} & \text{if apply\_residual\_connection\_post\_layernorm} \\ \text{hidden\_states} & \text{otherwise} \end{cases}$
添加Dropout并进行第二次层归一化：
$layernorm_input = Dropout ( attention_output , p = self.hidden_dropout ) \text{layernorm\_input} = \text{Dropout}(\text{attention\_output}, p=\text{self.hidden\_dropout})$
$layernorm_input = residual + layernorm_input \text{layernorm\_input} = \text{residual} + \text{layernorm\_input}$
$layernorm_output = LayerNorm ( layernorm_input ) \text{layernorm\_output} = \text{LayerNorm}(\text{layernorm\_input})$
MLP层：
$mlp_output = MLP ( layernorm_output ) \text{mlp\_output} = \text{MLP}(\text{layernorm\_output})$
第二次残差连接和输出Dropout：
$layernorm_output if apply_residual_connection_post_layernorm layernorm_input otherwise \text{residual} = \begin{cases} \text{layernorm\_output} & \text{if apply\_residual\_connection\_post\_layernorm} \\ \text{layernorm\_input} & \text{otherwise} \end{cases}$
$mlp_output , p = self.hidden_dropout ) \text{output} = \text{Dropout}(\text{mlp\_output}, p=\text{self.hidden\_dropout})$
$\text{output} = \text{residual} + \text{output}$
llama3 我没找到模型文件

不过最近很火的面壁智能的模型可以作为一个参考

这段代码定义了一个名为 MiniCPMV 的类，它继承自 MiniCPMVPreTrainedModel。这个类似乎是一个用于处理多模态数据（文本和图像）的预训练模型。下面是对代码中各个部分的解读：

__init__ 方法：这是类的构造函数，用于初始化模型的各个组件。
- self.llm：一个用于文本生成的语言模型组件。
- self.vpm：一个初始化的视觉模块，用于处理图像数据。
- self.vision_dim 和 self.embed_dim：分别表示视觉模块和语言模型的嵌入维度。
- self.resampler：一个重新采样器，用于调整视觉和文本嵌入的维度。
- self.transform：一个图像变换序列，用于图像预处理。
init_vision_module 方法：初始化视觉模块，使用 Idefics2VisionTransformer 类。
init_resampler 和 init_transform 方法：分别用于初始化重新采样器和图像变换序列。
get_input_embeddings 和 set_input_embeddings 方法：用于获取和设置模型的输入嵌入。
get_vllm_embedding 方法：获取视觉-语言-语言模型（VLLM）的嵌入表示。这个方法处理图像数据，并将其与文本嵌入结合。
forward 方法：定义了模型的前向传播过程，它接收数据并返回语言模型的输出。
_convert_to_tensors 方法：将输入的文本和图像转换为张量。
_process_list 方法：处理输入列表，将它们转换为模型所需的格式。
_decode 和 _decode_stream 方法：用于生成文本输出。_decode 是标准生成方法，而 _decode_stream 用于流式生成。
_decode_text 方法：将生成的文本ID转换为可读文本。
slice_image 和 get_slice_image_placeholder 方法：用于将图像分割并生成图像占位符。
reshape_by_patch 方法：根据图像的补丁大小重塑图像张量。
generate 方法：是模型的主要生成方法，它可以接收文本和图像输入，并生成文本输出。
chat 方法：一个用于聊天的特定方法，它可以接收图像和消息，并生成回复。

整体来看，这段代码是一个多模态模型的实现，它结合了文本和图像数据来进行生成任务。代码中使用了一些深度学习库，如 PyTorch，以及自定义的类和方法来处理特定的多模态任务。

好的，让我们更细致地了解 MiniCPMV 类的实现过程：

初始化模型组件 (__init__ 方法)：
- 构造函数首先调用基类的构造函数，传递配置信息。
- 初始化语言模型（self.llm），通常是一个用于文本生成的预训练模型。
- 初始化视觉模块（self.vpm），用于处理图像数据，通常是一个视觉变换器（Vision Transformer）。
- 计算视觉模块的嵌入维度（self.vision_dim）和语言模型的嵌入维度（self.embed_dim）。
- 初始化重新采样器（self.resampler），用于调整不同模态特征的维度，以便于融合。
- 初始化图像变换序列（self.transform），用于图像的预处理，如归一化。
视觉模块配置 (init_vision_module 方法)：
- 根据配置信息创建视觉变换器模型实例。
- 如果配置要求，移除视觉模块中最后一个层，这可能是为了减轻过拟合或适应特定的数据集。
重新采样器配置 (init_resampler 方法)：
- 创建一个 Resampler 实例，用于调整和融合视觉特征和文本特征。
- 配置参数包括查询数量、嵌入维度、头数、键值对维度以及是否使用自适应方法。
图像预处理配置 (init_transform 方法)：
- 使用 transforms.Compose 组合多个图像变换操作，如 ToTensor 和 Normalize。
- 归一化操作使用 ImageNet 数据集的均值和标准差。
获取和设置输入嵌入 (get_input_embeddings 和 set_input_embeddings 方法)：
- 提供接口来获取当前的语言模型输入嵌入层。
- 允许用户替换或更新输入嵌入层。
多模态嵌入获取 (get_vllm_embedding 方法)：
- 核心方法，处理文本和图像数据，生成视觉-语言-语言模型的嵌入表示。
- 对图像数据进行预处理，包括分割、变换和重塑。
- 使用视觉模块处理图像，并通过重新采样器调整维度。
- 将文本 ID 转换为嵌入表示，并与视觉嵌入融合。
前向传播 (forward 方法)：
- 定义模型的前向传播逻辑。
- 接收处理后的文本和视觉嵌入，以及位置 ID。
- 将这些信息传递给语言模型进行文本生成。
文本和图像转换为张量 (_convert_to_tensors 方法)：
- 将文本 ID 和图像数据转换为 PyTorch 张量，以便于模型处理。
处理输入列表 (_process_list 方法)：
- 对输入的文本 ID 列表和图像列表进行批量处理，包括填充操作以保证序列长度一致。
文本生成 (_decode 和 _decode_stream 方法)：
- _decode 方法用于生成文本输出，支持多种终止条件。
- _decode_stream 方法用于流式生成，可以逐步输出文本，适用于实时生成场景。
文本解码 (_decode_text 方法)：
- 将模型生成的 token ID 转换为可读的文本字符串。
图像分割和重塑 (slice_image 和 reshape_by_patch 方法)：
- slice_image 方法将图像分割成多个补丁，以适应视觉模块的处理。
- reshape_by_patch 方法将图像补丁重塑为适合模型输入的格式。
生成文本 (generate 方法)：
- 综合前面的方法，接收文本和图像输入，执行模型的生成逻辑，并返回生成的文本。
聊天功能 (chat 方法)：
- 特别设计的方法，用于实现聊天机器人的功能。
- 接收用户和助手的消息，包括文本和图像，生成回复。
模型推理模式：
- 在生成文本的方法中使用 torch.inference_mode() 确保模型在推理模式下运行，关闭梯度计算，提高效率。
错误处理和断言：
- 使用 assert 语句确保输入数据的有效性，如输入列表不为空，角色标识正确等。

这个详细的过程展示了 MiniCPMV 类从定义到实现的每个关键步骤，包括模型的配置、数据预处理、多模态特征融合、文本生成以及最终的部署和应用。

还是要回到llama的源码
在transformers框架下src/transformers/models/llama/modeling_llama.py

路人与大师

关注

24
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
glm4、qwen、MiniCPM-Llama3-V代码层面差异性分析

定义了一个继承自nn.Module的类。初始化方法，定义了这个类的构造函数。dim: 旋转嵌入的维度。rope_ratio: 调整基础比例的参数，默认值为 1。: 是否使用原始实现，默认值为False。device: 设备信息，指定计算是在 CPU 还是 GPU 上进行。dtype: 数据类型。调用父类nn.Module的初始化方法。计算倒频率inv_freq，用于生成旋转位置嵌入。生成从 0 到dim的步长为 2 的序列。这个序列除以dim并转换为指定的数据类型dtype。
复制链接

扫一扫