代码角度，深度解析 Qwen2-VL 模型结构

最新推荐文章于 2025-05-01 09:00:00 发布

Python_金钱豹

最新推荐文章于 2025-05-01 09:00:00 发布

阅读量6.2k

点赞数 43

文章标签：深度学习人工智能计算机视觉 transformer python neo4j 机器人

本文链接：https://blog.csdn.net/Python_cocola/article/details/142184940

版权

快速开始，接下来我会简单说说 Qwen2-VL 整体结构当中的几个关键的细节。帮助大家快速理解Qwen2-VL结构和其他网络结构的不同。

1. MLLM 模型的几个关键组件

每一个MLLM 模型都会包含如下的几个组成部分，不同的模型在不同的部分会有不同的实现的细节:

1. chat_template :

用于将用户的输入转化为模型所需要输入的标准格式，例如 qwen 的 chatml 格式

2. image processor

用于对输入的图像进行预处理，将输入的图像转化为模型所需要的格式,如 llava 需要切分的patch都是在这一步准备的

3. processor

利用 image processor 处理图片
利用 tokenizer 处理 prompt
可能会在 prompt 当中为 image 提前预留好位置 (placeholder) , 如 minicpm 的处理方法

4. model

vision_model：接受 vision embedding
scatter：将 vision embedding 插入到 text embedding 当中, llava onevision 和 minicpm 都采用了 scatter 的方式
llm encoder：采用大语言模型进行建模

2. 如何快速支持一个新的网络结构

在我们的实际推理/训练过程中，往往是通过 AutoX.from_pretrained 方法来加载所有上述相关的组件的(这里我简称下面提到的所有的Auto方法为AutoX):

config

config = AutoConfig.from_pretrained(model_path)

model

model = AutoModelForCausalLM.from_pretrained(               model_path,               torch_dtype = dtype,               **model_args   )

tokenizer

tokenizer = AutoTokenizer.from_pretrained(               model_path,               model_max_length=model_max_length,               padding_side=padding_side   )

processor

# processor 的加载本身会包含 image processor 的加载和 tokenizer 的加载   processor = AutoProcessor.from_pretrained(           model_path   )   template_path = Path(model_path) / 'chat_template.json'   if template_path.exists():       with open(template_path) as f:           template = json.load(f)       tokenizer.chat_template = template['chat_template']\   processor.tokenizer = tokenizer

而当一个新的网络结构出来以后，可能你需要通过更新 transformers 以获取官方对这个新的结构的支持。但是，可能:

官方的更新没有那么快(如 llava-onevision 现在在20240901 还没有merge到master分支)
官方更新到 master，但是尚未 release (如0901现在的 qwen2-vl 已经合并到 master，但是尚未release)
例如 minicpm 的网络，发布的代码是直接随模型发布的(remote_code)，并没有合并到 transformer 中

此时如果你希望提前“尝鲜”，你又不想更新你的 transformers,你可以按照如下的方法让AutoX能够识别到你的模型结构:

2.1 添加网络结构

首先，你可以在你自己的项目下为你想支持的网络结构添加一个类似于transformers 的目录结构，并将所有的 qwen2vl 相关的代码保存在这个目录当中

- models     - qwen2vl       - __init__.py       - configuration_qwen2_vl.py       - formatter.py       - image_processing_qwen2_vl.py       - modeling_qwen2_vl.py       - processing_qwen2_vl.py

2.2 修改相对引用

随后，你可以修改一些代码当中的相对引用，确保代码能够跑通，例如原来代码当中的:

from ...image_processing_utils import BaseImageProcessor, BatchFeature

可以修改为

from transformers.image_processing_utils import BaseImageProcessor, BatchFeature

2.3 注册网络结构

最后，你只需要在 init.py 当中添加如下内容，就可以让各种 AutoX 识别到当前的网络结构

from .processing_qwen2_vl import Qwen2VLProcessor   from .image_processing_qwen2_vl import Qwen2VLImageProcessor   from transformers import AutoTokenizer,AutoConfig,AutoModelForCausalLM,AutoProcessor,AutoImageProcessor   from .configuration_qwen2_vl import Qwen2VLConfig   from .modeling_qwen2_vl import Qwen2VLForConditionalGeneration   from transformers.models.qwen2 import Qwen2TokenizerFast   from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES      # 注册 processor   Qwen2VLProcessor.register_for_auto_class('AutoProcessor')   # 注册 image processor   Qwen2VLImageProcessor.register_for_auto_class('AutoImageProcessor')      #再次注册 Processor,这个应该和上面的差别不大，我这里都做了一下   AutoProcessor.register(Qwen2VLConfig , Qwen2VLProcessor)   AutoImageProcessor.register(Qwen2VLConfig , Qwen2VLImageProcessor)      # 注册 config   AutoConfig.register('qwen2_vl' , config = Qwen2VLConfig)   # 注册 CausalLM   AutoModelForCausalLM.register(Qwen2VLConfig , model_class = Qwen2VLForConditionalGeneration)   # 注册 Tokenizer   AutoTokenizer.register(Qwen2VLConfig,fast_tokenizer_class = Qwen2TokenizerFast)   # 注册模型到 MODEL_FOR_CAUSAL_LM_MAPPING_NAMES 当中，如果使用到 label smoother 可能会用到这个变量   MODEL_FOR_CAUSAL_LM_MAPPING_NAMES['qwen2_vl'] = "Qwen2VLForConditionalGeneration"

2.4 一个小的细节

有一定的可能你按照上述的方法注册完各个组建后还会发生报错。原因在于 AutoProcessor 在创建 processor 的时候会调用这个方法:

class AutoProcessor(ProcessorMixin):       @classmethod       def _get_arguments_from_pretrained(cls, pretrained_model_name_or_path, **kwargs):           import transformers           args = []           for attribute_name in cls.attributes:               class_name = getattr(cls, f"{attribute_name}_class")               if isinstance(class_name, tuple):                   classes = tuple(getattr(transformers, n) if n is not None else None for n in class_name)                   use_fast = kwargs.get("use_fast", True)                   if use_fast and classes[1] is not None:                       attribute_class = classes[1]                   else:                       attribute_class = classes[0]               else:                   # 注意这里, processor 在创建 image processor 时                   # 会尝试从 transformers import 对应预设的 image processor 类                   import transformers                   attribute_class = getattr(transformers, class_name)                  args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))           return args

简单来说，每一个 processor 在初始化时会指定自己要用的 image processor 和 tokenizer 的类别，如qwen2vl 原始代码写的是:

class Qwen2VLProcessor(ProcessorMixin):              attributes = ["image_processor", "tokenizer"]       valid_kwargs = ["chat_template"]             image_processor_class = "Qwen2VLImageProcessor"       tokenizer_class = ("Qwen2Tokenizer", "Qwen2TokenizerFast")

可以看到，这里 qwen2 预设当我调用 AutoProcessor.from_pretrain 时，我会去调用

Qwen2VLImageProcessor.from_pretrain 自己的 image_processor 。但是这里由于我们在执行上面的 _get_arguments_from_pretrained 时，目前无法通过如下代码找到 Qwen2VLImageProcessor，可能就会发生报错，

from transformers import Qwen2VLImageProcessor

解决办法也很简单，将上述的 :

class Qwen2VLProcessor(ProcessorMixin):       ...       image_processor_class = "Qwen2VLImageProcessor"

修改为

class Qwen2VLProcessor(ProcessorMixin):       ...        image_processor_class = "AutoImageProcessor"

即可。

2.5 使得注册生效

最后，你只需要通过下面的代码让注册生效:

`from model.qwen2vl import *`

3. Qwen2VL 的若干细节

接下来快速介绍一下 Qwen2VL 的相关细节。按照我们第一小节记录的4个MLLM 的要点，我会简单介绍一下Qwen2VL 在这个四个要点上的改进点。

3.1 Qwen2VL 采用的 chat_template

{% set image_count = namespace(value=0) %}   {% set video_count = namespace(value=0) %}   {% for message in messages %}       {% if loop.first and message['role'] != 'system' %}           <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n       {% endif %}              <|im_start|>{{ message['role'] }}\n            {% if message['content'] is string %}           {{ message['content'] }}<|im_end|>\n       {% else %}           {% for content in message['content'] %}               {% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}                     {% set image_count.value = image_count.value + 1 %}                     {% if add_vision_id %}                            Picture {{ image_count.value }}:                      {% endif %}                     <|vision_start|><|image_pad|><|vision_end|>               {% elif content['type'] == 'video' or 'video' in content %}                    {% set video_count.value = video_count.value + 1 %}                    {% if add_vision_id %}                           Video {{ video_count.value }}:                     {% endif %}                    <|vision_start|><|video_pad|><|vision_end|>               {% elif 'text' in content %}                    {{ content['text'] }}               {% endif %}           {% endfor %}       <|im_end|>\n   {% endif %}   {% endfor %}   {% if add_generation_prompt %}         <|im_start|>assistant\n   {% endif %}"

例如现在我们有这么一个问题:

    `conversation = [           {               "role": "user",               "content": [                   {"type": "image"},                   {"type": "text", "text": "描述一下这张图片"},               ],           },       ]`

可以看到，根据 chat_template ， qwen2vl 对于图片会编码成 <|vision_start|><|image_pad|><|vision_end|>

如果选择了 add_vision_id ，还会在前面再拼接一个 ‘Picture : 1’的东西。（实际没有使用）

上述的问题通过 chat_template 处理就会变成下面的形式:

print(prompt)      <|im_start|>system   You are a helpful assistant.<|im_end|>   <|im_start|>user   <|vision_start|><|image_pad|><|vision_end|>描述一下这张图片<|im_end|>   <|im_start|>assistant

3.2 image processor 对于图片的处理

在这一部分， qwen2vl 会对输入的图片进行处理。

总的来说， qwen2vl 在这一步做了两件事情:

将图片进行 resize, 使得图片的 h / w 都是 patch size 的整数倍：例如图片的高宽变成了 (count_h * patch_size ) * (count_w * patch_size )
将图片 flatten 成 (count_h * count_w) 个 (patch_size * patch_size) 的小 patch
其实就是 transformers 的输入作为T x h
每一个小 patch 的分辨率为 (patch_size * patch_size)
长度 T= (count_h * count_w)

3.2.1 基本处理细节

接下来具体介绍一下处理的细节。

总的来说，为了能够让图片被送到VIT当中, 我们需要对于图片进行 resize，resize 的目的有两个,

让图片的高度和宽度能够是 patch_size 的整数倍( qwen2vl 采用的 patch size 是 14) 我们会希望最终’flatten’出来的 patch 数量不要超过 vit 处理能力的上限如果对于一个一般的预训练的VIT 网络，能够处理的图片的尺寸的上限是比较小的，但是对于qwen2vl使用的VIT ，能够处理的图片的尺寸上限很大(即允许’flatten’出来的patch 的数量很多)

在 image_processing_qwen2_vl.py 当中，我们可以看到 qwen2 所允许的图片patch数量的上限

def smart_resize(       height: int, width: int,        factor: int = 28, # 这里 factor 为什么是 28 不是 14 后面会说       min_pixels: int = 56 * 56,        max_pixels: int = 14 * 14 * (4 * 1280) # 这里为什么是 4 * 1280 而不是直接写 5120 后面也会说   ):

可以看到，这里允许切分出来的 patch 的数量达到了（4*1280） = 5120, 已经是一个非常大的数量了。

但是注意到当你去看官方 release 出来的7B模型当中的配置文件， max_pixels 的参数设置更加夸张:

`preprocessor_config.json    {     "min_pixels": 3136,     "max_pixels": 12845056, # 12845056 / 14 / 14  = 65536 (4*16384)   }`

也就是说，模型允许你将一个超级大的图片“几乎不做resize缩小和做patch”，直接送入VIT ！

理解了上述的步骤后，我们再来看 image processor 当中两个特别的地方:

1. 为什么上述多了一个 “4”?

2. 如何引入 temporal 维度 ?

3.2.2 引入temporal 维度

上述的所有的操作看上去都不是特别的复杂。但是接下来我们会再稍微深入一些。

假设现在希望处理的内容不是一个 image ，而是一个 video ? 我们需要做哪些处理呢？

简单来说，还是按照上面的处理思路。但是首先。当我们构建一个 patch 的时候，我们的patch 不再是一个3D 的patch (3 x patch size x patch size)，我们需要让patch 也在时间维度 (temporal 维度)上也对原始的数据进行划分

# self.temporal_patch_size = 2    channel * self.temporal_patch_size * self.patch_size * self.patch_size

而对于一个 video ，所划分出来的送入 vit 的 patch 数量可以描述为:

grid_t * grid_h * grid_w

因此，在 image processor 中，你能看到:

# 这是理解 image processor 在做什么的关键   flatten_patches = patches.reshape(               grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size   )

但是这又引入一个新的问题，为了使得 vit 能够同时处理视频和图片，你会希望一个图片和视频使用同样的 patch

`3 x 2 x 14 x 14`

但是图片缺少temporal维度的信息

因此，Qwen2VL 的处理方法是: 对于每一张图片，会把图片“叠成两帧”，即一张图片是一个‘两帧一模一样’的‘小视频’，让图片和视频能够使用同样的 patch 划分方法:

if patches.shape[0] == 1:       # 这一步就是把 图片在 时间维度上复制一份，变成一个 “2帧小视频”       patches = np.tile(patches, (self.temporal_patch_size, 1, 1, 1))

3.2.3 多出来的 4

为什么在上述的 smart resize 中，

factor: int = 28 ；max_pixels: int = 14 * 14 * (4 * 1280) = 28 * 28 * 1280

这是因为，在 Qwen2vl 设计的VIT 的结构当中，VIT 结构抽取的特征并不是直接输入到LLM当中的:

Qwen2vl 会对所有抽取到的特征,最终按照每相邻的4个patch的特征汇总成一个特征(PatchMerger):

假设原图 resize 后划分成了如下的若干 patch

patch_1 , patch_2 , patch_3 , patch_4 ,    patch_5 , patch_6 , patch_7 , patch_8 ,   patch_9 , patch_10 , patch_11 , patch_12   patch_13 , patch_14 , patch_15 , patch_16

模型会将这些 patch 进行flatten，随后通过VIT获取到这些 patch 的特征,

注意，qwen2vl 在对这些 patch 进行 flatten 的时候，会按照如下的顺序进行 flatten(每4个相邻的放在一起)

feature = [     f1 , f2 , f5 , f6 , f3 , f4 , f7 , f8 , f9 ,f10 , f13 , f14, f11 , f12 , f15 , f16   ]

随后 patch merger 会对每相邻的4个特征利用一个 mlp 进行特征的聚合，并将维度变换到 llm 能接收到的维度 :

# h1 = mlp(f1 , f2 , f5 , f6)    # h2 = mlp(f3 , f4 , f7 , f8)    # h3 = mlp(f9 ,f10 , f13 , f14)    # h4 = mlp(f11 , f12 , f15 , f16)   feature = [   h1 , h2 , h3 , h4   ]

因此，实际输入到LLM当中的序列的长度是VIT 原始patch 长度的 1/4

这个排序可不是我胡说哈，我们按照原始的数据处理代码来跑一个例子:

`>>> import numpy as np   >>> import torch      >>> from easydict import EasyDict   >>> self = EasyDict()   >>> self.temporal_patch_size = 1   >>> self.merge_size = 2   >>> self.patch_size = 1   >>> grid_h, grid_w = 6 // self.patch_size , 6 // self.patch_size   >>> grid_t = 1      >>> patches = np.array(range(36)).reshape(6,6)   patches`

patches 长这样

>>> patches   tensor([[ 0,  1,  2,  3,  4,  5],           [ 6,  7,  8,  9, 10, 11],           [12, 13, 14, 15, 16, 17],           [18, 19, 20, 21, 22, 23],           [24, 25, 26, 27, 28, 29],           [30, 31, 32, 33, 34, 35]])

经过如下的处理

        `patches = patches.reshape(               grid_t,               self.temporal_patch_size,               channel,               grid_h // self.merge_size,               self.merge_size,               self.patch_size,               grid_w // self.merge_size,               self.merge_size,               self.patch_size,           )           patches = patches.transpose(0, 3, 6, 4, 7, 2, 1, 5, 8)           flatten_patches = patches.reshape(               grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size           )      flatten_patches.T`

变成了这样(注意这个排序)

>>> flatten_patches.T   array([[ 0,  1,  6,  7,  2,  3,  8,  9,  4,  5, 10, 11, 12, 13, 18, 19,           14, 15, 20, 21, 16, 17, 22, 23, 24, 25, 30, 31, 26, 27, 32, 33,           28, 29, 34, 35]], dtype=int64)

3.3 processor 对于input的处理

这里 Qwen2VL 采用了和 Minicpm 类似的处理方式: 提前将 image token 需要填充的位置预留在了 inputs 当中(最新的 llava onevision 的代码实现也使用了这种风格，看来这种风格会在后面被广泛的使用)。

merge_length = self.image_processor.merge_size**2   index = 0   for i in range(len(text)):       while "<|image_pad|>" in text[i]:           text[i] = text[i].replace(               "<|image_pad|>", "<|placeholder|>" * (image_grid_thw[index].prod() // merge_length), 1           )           index += 1       text[i] = text[i].replace("<|placeholder|>", "<|image_pad|>")      text_inputs = self.tokenizer(               text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length   )

3.4 modeling_qwen2vl

在上述的流程当中，我们已经准备好了模型的输入，接下来，我们就需要利用 VL 模型抽取输入内容的特征。

Qwen2VL 相比于其他的模型，在网络结构上存在着如下的两个特殊的设计:

VIT 部分也引入了相对位置编码,可参考 VisionRotaryEmbedding
LLM 部分将传统的 rope 替换为了 multimodal_rope，可参考 apply_multimodal_rotary_pos_emb

接下来，我们先跳过VIT 部分的改造，先看看 LLM 部分的改造，随后再回过头来，对VIT 部分的改进进行梳理。

3.4.1 快速回顾Rope

在使用 rope 的过程中，我们实际上会做如下的几件事情:

第一步，我们会需要为每一个位置分配一个 position_ids

position_ids = [0, 1, 2, ... , n ]

第二步，我们希望对于每一个位置 m 的向量，构建一个变换矩阵 , 能够对 m 位置的 embedding 向量变换

第三步，我们将这种变换简化为如下的操作

在这里插入图片描述

即，对于每一个位置，我们需要准备两个矩阵，使用这两个矩阵对原始的embedding 向量进行变换

在这里插入图片描述

关于这么做的目的这里不做赘述，总之，Rope 的向量希望使得经过处理后的任意两个位置的向量进行点乘时，点乘的结果包含了他们两个之间的“旋转角度之差”:

在这里插入图片描述

3.4.2 其他工作是如何使用Rope 的?

在之前的其他文章当中，我们都是对于ViT 获得的向量直接插入到原始的文本序列当中，随后正常的为每一个设置 position_ids：

例如我有下面一个图片的 patch 向量和一段文字, 我会按照如下的方式为每一个 patch 分配 position ids：

patches = [   [patch_1 , patch_2 , patch_3],   [patch_4 , patch_5 , patch_6],   ]   input = '<image> 描述这张图片'      # step1，将 patch 的特征 flatten 后拼接到文本当中    input = [ embed_p1 , embed_p2 , embed_p3 , embed_p4 , embed_p5 , embed_p6 , embed_描述 , embed_这张 , embed_图片]      # step2，分配 position_ids    position_ids = [ 0, 1, 2, 3, 4, 5, 6, 7 , 8]

3.4.3 上述做法的问题

在上述的操作当中，我们为 patch1 - patch6 一共分配了 0-6 的 input_ids。但是这种分配方式存在一个问题:

在 y 方向上， (patch1 , patch4) 看上去比 (patch1,patch3) 更接近
在 x 方向上， (patch1, patch5) 看上去比 (patch3,patch4) 更接近

即如果对 flatten 后的 patch 分配向量，会丢失patch 之间在二维平面上的相对位置关系。

对于这个问题的解决方法也很简单，我能不能直接使用多种不同的 position ids 的分配方式，为每一个位置分配多个不同的 position_ids：

`# 方式1, 描述 "序列" 的 position_ids   position_ids = [ 0, 1, 2, 3, 4, 5, 6, 7 , 8]         # 方式2，描述 "patch 在 y 方向上的位置关系"的 position_ids       对于 patch, 可以为patch 分配 y 轴方向上的位置作为 position_ids   patches = [   [patch_1 (0), patch_2 (0), patch_3 (0)],   [patch_4 (1), patch_5 (1), patch_6 (1)],   ]   因此对应的 position_ids 为 ：   # 文本的 position_ids 紧跟着图像的 position ids 递增   position_ids = [0, 0 ,0 , 1, 1, 1, 2, 3, 4]          # 方式3，描述 "patch 在 x 方向上的位置关系"的 position_ids       对于 patch, 可以为patch 分配 x 轴方向上的位置作为 position_ids   patches = [   [patch_1 (0), patch_2 (1), patch_3 (2)],   [patch_4 (0), patch_5 (1), patch_6 (2)],   ]   因此对应的 position_ids 为 ：   # 文本的 position_ids 紧跟着图像的 position ids 递增   position_ids = [0, 1 ,2 , 0, 1, 2, 3, 4, 5]`

这种为每一个位置分配 x / y 方向上的 embedding 的处理有很多地方可以见到。

如 ocr 方向上的 layoutlm:

如 DETR 系列当中的为每一个位置添加的 x position embedding 和 y position embedding

现在，对于每一个patch, 你都可以获取到他对应的 3 个具有不同意义的 positional ids：

`patches = [   [patch_1 , patch_2 , patch_3],   [patch_4 , patch_5 , patch_6],   ]   input = '<image> 描述这张图片'      # step1，将 patch 的特征 flatten 后拼接到文本当中    input = [ embed_p1 , embed_p2 , embed_p3 , embed_p4 , embed_p5 , embed_p6 , embed_描述 , embed_这张 , embed_图片]         position_ids = [ 0, 1, 2, 3, 4, 5, 6, 7 , 8]   position_ids_x = [0, 0 ,0 , 1, 1, 1, 2, 3, 4]    position_ids_y = [0, 1 ,2 , 0, 1, 2, 3, 4, 5]`

但是注意到在Qwen2VL 的实际实现当中，实际上还有一个额外的 position_ids （回顾前面我们提到的还有一个时间的维度），下面这张官方给出的图像就可以很好的描述这个 position ids 的建模过程

放大一点看细节:

注意, 在 Qwen2VL 当中，最后只使用了在 width / height / Time 三个方向上的 position_ids ，而没有使用最简单的 position_ids

3.4.4 三个position_ids 怎么做 rope ?

很简单，根据我们刚刚说的处理，ROPE 本质上就是对每一个position_id , 构造两个向量(一个 cos 向量一个 sin 向量)，然后和原始的 embedding 做点乘处理:

现在我们每一个位置有三个position_ids ，我们的目标是构造两个 cos 和 sin 向量能够和原始的向量去作用

我们具体来看看一个实现的例子，实际的实现方式还是挺神奇的。

positional_ids 的构建过程在:      Qwen2VLForConditionalGeneration:get_rope_index      在函数的注释当中，给出了一个很详细的实例:      input_ids: [V V V V V V V V V V V V T T T T T], here V is for vision.   vision temporal position_ids: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]   vision height position_ids: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]   vision width position_ids: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]   text temporal position_ids: [3, 4, 5, 6, 7]   text height position_ids: [3, 4, 5, 6, 7]   text width position_ids: [3, 4, 5, 6, 7]

在 Qwen2VL当中，实际产生和应用 rope 的向量的步骤在:

Qwen2VLAttention:forward:       # 第一步: 为每一个‘position’产生一个 cos 向量 和 一个 sin 向量       # cos.shape : T x 128       # sin.shape : T x 128        # q,k,v.shape : 1 x 28(heads) x T x 128        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)       # 第二步: 用用 cos ,sin embedding        # position_ids.shape 3 x 1 x T        query_states, key_states = apply_multimodal_rotary_pos_emb(               query_states, key_states, cos, sin, position_ids, self.rope_scaling["mrope_section"]           )      在函数 apply_multimodal_rotary_pos_emb 当中:      def apply_multimodal_rotary_pos_emb(q, k, cos, sin, position_ids, mrope_section, unsqueeze_dim=1):       # cos : T x 128 -> 3 x T x 128       cos = cos[position_ids]       # sin : T x 128 -> 3 x T x 128       sin = sin[position_ids]              # mrope_section : [16, 24, 24] -> [16, 24, 24, 16, 24, 24]       # mrope_section : 一个设定的常量，长度是128 :         # Multimodal rope section is for channel dimension of temporal, height and width in rope calculation.       mrope_section = mrope_section * 2             # 我们来在下面详解这个操作       cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(           unsqueeze_dim       )       sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(           unsqueeze_dim       )       q_embed = (q * cos) + (rotate_half(q) * sin)       k_embed = (k * cos) + (rotate_half(k) * sin)       return q_embed, k_embed

其中最核心的操作是下面这两步,我们以 cos 来举例子:

cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(           unsqueeze_dim   )   sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(           unsqueeze_dim   )

我们一步一步来看下面这个操作

`1. cos.split(mrope_section, dim=-1)           # 会得到         # [torch.Size([3, 1, 4242, 16]),        # torch.Size([3, 1, 4242, 24]),        # torch.Size([3, 1, 4242, 24]),        # torch.Size([3, 1, 4242, 16]),        # torch.Size([3, 1, 4242, 24]),        # torch.Size([3, 1, 4242, 24])]         2. [m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))]        # 会得到        # [torch.Size([1, 4242, 16]),        #  torch.Size([1, 4242, 24]),        #  torch.Size([1, 4242, 24]),        #  torch.Size([1, 4242, 16]),        #  torch.Size([1, 4242, 24]),        #  torch.Size([1, 4242, 24])]        3. torch.cat 会得到最终所需要的向量            # torch.Size([1, 4242, 128])     #`

这是什么意思呢？

其实可以这样理解，对于每一个序列当中的位置，他都会有三个不同的 position_id ，因此都会对应到三个不同的 cos 向量（128维）:

`cos_t = [a0 , a1 , a2 , ... a127]   cos_h = [b0 , b1 , b2 , ... b127]    cos_w = [c0 , c1 , c2 , ... c127]`

我希望按照如下规则获得最终使用的 cos 向量:

`mrope_section = [16, 24, 24, 16, 24, 24]   对构成的向量的 128 个位置:   0-15 的位置，我希望用来自 temporal 的向量    1-16 的位置，我希望用来自于 height 的 cos 向量   ...   104-127 的位置，我希望用来自于 width 的 cos 向量`

即

final_cos = [a0 , ... a23 , b24 , ... b47 ， c47 , ... c63 , a64 , ... ,a79 ,b80 , ... b103 , c104 , ... c127]

3.5 VIT 的更新

在上面的介绍当中，我们发现模型在对 patch 进行排序的时候采用了一个特殊的排序方法:

patch_1 , patch_2 , patch_3 , patch_4 ,    patch_5 , patch_6 , patch_7 , patch_8 ,   patch_9 , patch_10 , patch_11 , patch_12   patch_13 , patch_14 , patch_15 , patch_16      ->       patch_1 , patch_2 , patch_5 , patch_6 ,  patch_3 , patch_4 , patch_7 , patch_8 ,patch_9 , patch_10 , patch_13 , patch_14 , patch_11 , patch_12 , patch_15 , patch_16

显然，为了支持这种特殊的 patch 的排序，就不能够再在 VIT 阶段使用传统的简单的 position ids 的分配方式了

参考上述 LLM 阶段采用的 3d 的 positional ids ，在VIT 阶段也会为每一个patch 分配两组不同的 position id：

VIT 部分只考虑了 x / y 两个维度的 position ids    参考 Qwen2VisionTransformerPretrainedModel：rot_pos_emb      patches :      patch_1 , patch_2 , patch_3 , patch_4 ,    patch_5 , patch_6 , patch_7 , patch_8 ,   patch_9 , patch_10 , patch_11 , patch_12   patch_13 , patch_14 , patch_15 , patch_16      为每一个位置分配  h  / w 方向上的坐标       [   patch_1(0,0) , patch_2(0,1) , patch_(0,2) , patch_4(0,3) ,    patch_5(1,0) , patch_6(1,1) , patch_7(1,2) , patch_8(1,3) ,   patch_9(2,0) , patch_10(2,1) , patch_11(2,2) , patch_12(2,3),   patch_13(3,0) , patch_14(3,1) , patch_15(3,2) , patch_16(3,3)   ]

因此得到分配的 h_posion_ids 和 w_position_ids：

hpos_ids   Out[13]:    tensor([0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 2, 2, 3, 3,           4, 4, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5])      wpos_ids   Out[14]:    tensor([0, 1, 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 0, 1, 0, 1, 2, 3, 2, 3, 4, 5, 4, 5,           0, 1, 0, 1, 2, 3, 2, 3, 4, 5, 4, 5])      对于 flatten 后的 patch 序列，得到如下的 2d position 序列   torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1)      pos_ids[0].T    Out[16]:    tensor([[0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 2, 2, 3, 3, 4, 4, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5],           [0, 1, 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 0, 1, 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 0, 1, 0, 1, 2, 3, 2, 3, 4, 5, 4, 5]])

在实际使用上， rope 实际上在做的事情是，对每一个位置产生一个 (cos,sin)向量，能够和原始向量操作。

在Qwen2VL 当中，实际产生的操作记录如下:

1. 在 VIT 当中 ，调用 rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)              会返回一个 max_grid_size x 20 的向量         2. 对于每一个patch ， rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)              由于每一个patch  具有两个position id ，因此每一个 patch 会对应到一个 40 维度的向量      3. VisionAttention: forward        # 调用 apply_rotary_pos_emb_vision       q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)      4. apply_rotary_pos_emb_vision      def apply_rotary_pos_emb_vision(tensor: torch.Tensor, freqs: torch.Tensor) -> torch.Tensor:       orig_dtype = tensor.dtype       # tensor:  batch_size x T x 16(num_num) x 80       tensor = tensor.float()       # rope 产生的向量: T x 40        cos = freqs.cos()       sin = freqs.sin()       # 可以看到 ， rope 向量的 dim 和 tensor 的 dim 没有完全对上，因此这里进行了 repeat        cos = cos.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()       sin = sin.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()       # 这一步和 公式是对得上的关键公式       output = (tensor * cos) + (rotate_half(tensor) * sin)       output = output.to(orig_dtype)       return output

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述