VLM 系列——phi3.5-Vision——论文解读

TigerZ*

已于 2024-08-29 10:25:40 修改

阅读量122

点赞数 5

分类专栏： AIGC算法文章标签：计算机视觉人工智能深度学习 AIGC transformer

于 2024-08-29 10:07:05 首次发布

本文链接：https://blog.csdn.net/u012863603/article/details/141671210

版权

AIGC算法专栏收录该内容

35 篇文章 13 订阅 ¥89.90 ¥99.00

订阅专栏

一、概述

1、是什么

论文全称《Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone》是一系列大型语言模型（LLM） & 多模态大型语言模型（MLLM）。其中LLM包括phi-3-mini 3.8B、phi-3-small 7B、phi-3-medium 14B，phi-3-mini可以轻松地在现代手机上本地进行推理；多模态模型包括phi-3-vision（基于phi-3-mini & CLIP ViT-L/14） 4.2B。本文重点介绍多模态的phi-3-mini模型，该模型能够处理包括文本、图像在内的多种数据类型，具备图片描述、单图文问答、多图问对话、视频理解对话、json格式、高清OCR解析和表格理解（代码编写和debug、函数调用论文暂时未提）。注意因为基座模型主要使用英文训练，所以论文提到并且实测对中文图像识别和对话效果一般。

2、亮点

目前端侧多模态模型中算很小的一个。（最近又出了一个mini monkey2.8B）

在此版本中，模型具备了多帧图像理解和推理能力，这基于宝贵的客户反馈。多帧功能的亮点示例包括详细的图像比较、多图像总结/讲述以及视频总结，这些功能在办公场景中有广泛的应用。

PS

不建议用于中文OCR识别和对话场景，效果确实一般，英文场景较好。同样对于很重要的后训练涉及的数据集构成和清洗方式没有提及。。。。。

二、模型

1、模型结构

包含四部分：图像编码器、连接器&投影器和大语言模型，共计4.2B参数。

图像编码器

使用 CLIP ViT-L/14，具体是openai/clip-vit-large-patch14-336。

其中图片的处理方式如上面的图，包含两大部分：整图和切片图，详细如下：

*整图：也就是全局视图，图像大小调整为固定大小（ 336 × 336）。这提供了对图像的宏观理解。根据经验，这对于 LVLM 正确理解图像至关重要。

*切片图：也就是是局部视图。给定最大分区数 H，大小为 [h, w] 的图像 x 被调整大小并填充到大小为 [ph × 336, pw × 336] 的新图像 x’:pw 和 ph 分别代表每行和每列的 patch 数量。然后将 x’ 分割成 ph × pw 不重叠的patch。每个patch都是一个 336 × 336 大小的小图像，将这些补丁视为 ViT 的单独输入。使用“HD-H”来表示具有 H 个patch约束的高分辨率设置。例如，HD-9最多允许 9 个patch，包括一系列分辨率，例如 1008×1008 (3*3)、672×1344(2*4)、336×3024(1*8) 等。

。使用前面提到的动态图像分区策略将图像划分为patch，并从每个patch中提取特征。特征提取后，patch被重新组装成一个大的特征图。经过简单的标记合并过程后，特征图将被展平为最终的局部特征。

需要注意最终VIT出来的特征汇总的方式：图像具有 2D 结构，并且图像比例是动态的，每行的token数量在不同图像中可能会有所不同。这种变化可能会混淆 LVLM，从而难以确定哪些token属于图像的同一行，哪些token属于下一行。这种混乱可能会阻碍 LVLM 理解图像 2D 结构的能力，而这对于理解文档、图表和表格等结构图像内容至关重要。为了解决这个问题，在flatten前在图像特征的每行末尾引入了一个可学习的换行符（‘\n’）。最后，concate全局视图和局部视图，在它们之间插入一个特殊的‘separate’ token来区分这两个视图。

还需要注意，为了降低最终输入LLM的token数，最后将图片VIT特征中每2D上相邻（不重叠）的4个合并成了一个，如图像分辨率为336*336，VIT 的patch（区别上面的图像分块哈）大小为14，则原始VIT 的输出特征数为(336/14=24)**2*1024，然后经过merge变为：(24/2) * (24/2) * (1024*2*2)

涉及到的源码：https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/main/modeling_phi3_v.py

 def forward(
        self, input_ids: torch.LongTensor, pixel_values: torch.FloatTensor, image_sizes=None
) -> torch.FloatTensor:
    input_shape = input_ids.size()
    input_ids = input_ids.view(-1, input_shape[-1])

    # positions for image tokens
    positions = torch.nonzero((input_ids < 0) & (input_ids > -MAX_INPUT_ID), as_tuple=True)
    has_image = len(positions[0].tolist()) > 0
    input_ids = input_ids.clamp_min(0).clamp_max(self.vocab_size).detach()
    hidden_states = self.wte(input_ids)

    if has_image:
        assert self.use_hd_transform
        num_images, num_crops, c, h, w = pixel_values.shape
        assert c == 3 and h == w == 336
        img_features = self.get_img_features(pixel_values.flatten(0, 1)).reshape(
            num_images, num_crops, -1, self.image_dim_out
        )
        image_features_proj = self.hd_feature_transform(img_features, image_sizes)
        hidden_states = hidden_states.index_put(
            positions, image_features_proj, accumulate=False
        )

    if self.drop is not None:
        hidden_states = self.drop(hidden_states)

    return hidden_states


def hd_feature_transform(self, image_features, image_sizes):
    """
    image_features: (num_images, num_crops+1, 24*24, 1024)
    """
    assert (
            self.hd_transform_order == 'sub_glb'
    ), f'hd_transform_order `{self.hd_transform_order}` not implemented'
    if isinstance(self.img_projection, nn.Sequential):
        target_device = self.img_projection[0].bias.device
        target_dtype = self.img_projection[0].bias.dtype
    else:  # It's a single nn.Linear layer
        target_device = self.img_projection.bias.device
        target_dtype = self.img_projection.bias.dtype

    global_image_features = image_features[:, 0]  # (num_images, 24*24, 1024)
    # global feature can be viewed as a special HD case with num_crops 1x1
    global_image_features_hd = self.reshape_hd_patches_2x2merge(global_image_features, 1, 1)
    global_image_features_hd_newline = self.add_image_newline(global_image_features_hd)

    all_image_embeddings = []
    # need a for loop to process each image because of different image sizes
    # (patch arrangement is different for each image)
    for i, img_size in enumerate(image_sizes):
        h, w = img_size
        h_crop = h // 336
        w_crop = w // 336
        num_crops = h_crop * w_crop

        # NOTE: real num_crops is padded
        # (num_crops, 24*24, 1024)
        sub_image_features = image_features[i, 1: 1 + num_crops]
        sub_image_features_hd = self.reshape_hd_patches_2x2merge(
            sub_image_features, h_crop, w_crop
        )
        sub_image_features_hd_newline = self.add_image_newline(sub_image_features_hd)

        # [sub features, separator, global features]
        all_image_embeddings.extend(
            [
                sub_image_features_hd_newline.squeeze(0),  # (h_crop*12*(w_crop*12+1), 4096)
                self.glb_GN.squeeze(0),
                global_image_features_hd_newline[i],
            ]
        )

    image_features_proj = self.img_projection(
        torch.cat(all_image_embeddings, dim=0).to(target_device).to(target_dtype)
    )

    return image_features_proj


def reshape_hd_patches_2x2merge(self, image_features, h_crop, w_crop):
    """
    image_features: (num_images*num_crops, 24*24, 1024)
    output: (num_images, h_crop*12, w_crop*12, 4096), h_crop*w_crop == num_crops
    """
    N, L, C = image_features.shape
    assert L == 24 * 24 and C == 1024 and N % (h_crop * w_crop) == 0
    num_images = N // (h_crop * w_crop)
    H = int(L ** 0.5)
    image_features_hd = (
        image_features.reshape(N, H, H, C)  # N, 24, 24, 1024
        .reshape(N, H // 2, 2, H // 2, 2, C)  # N, 12, 2, 12, 2, 1024
        .permute(0, 1, 3, 2, 4, 5)  # N, 12, 12, 2, 2, 1024
        .reshape(N, -1, 4 * C)  # N, 144, 4096
        .reshape(
            num_images, h_crop, w_crop, H // 2, H // 2, -1
        )  # n_img, h_crop, w_crop, 12, 12, 4096
        .permute(0, 1, 3, 2, 4, 5)  # n_img, h_crop, 12, w_crop, 12, 4096
        .reshape(
            num_images, h_crop * H // 2, w_crop * H // 2, 4 * C
        )  # n_img, h_crop*12, w_crop*12, 4096
    )

    # alternative implementation using einops
    # from einops import rearrange
    # image_features_nhwc = rearrange(
    #     image_features,
    #     'N (H W) c -> N H W c',
    #     H=H,
    #     W=H,
    # )
    # image_features_2x2merge = rearrange(
    #     image_features_nhwc,
    #     'N (h h_pool) (w w_pool) c -> N h w (h_pool w_pool c)',
    #     h_pool=2,
    #     w_pool=2,
    # )
    # image_features_hd = rearrange(
    #     image_features_2x2merge,
    #     '(n_img h_crop w_crop) h w C -> n_img (h_crop h) (w_crop w) C',
    #     h_crop=h_crop,
    #     w_crop=w_crop,
    # )

    return image_features_hd


def add_image_newline(self, image_features_hd):
    """
    image_features_hd: (num_images, h_crop*12, w_crop*12, 4096)
    output: (num_images, (h_crop*12) * (w_crop*12+1), 4096)
    """
    num_images, h, w, hid_dim = image_features_hd.shape
    # add the newline token to the HD image feature patches
    newline_embeddings = self.sub_GN.expand(num_images, h, -1, -1)  # (n_img, h, 1, hid_dim)
    image_features_hd_newline = torch.cat(
        [image_features_hd, newline_embeddings], dim=2
    ).reshape(num_images, -1, hid_dim)
    return image_features_hd_newline

连接器&投影器

两层全连接层，分别为4096*3072 和3072*3072

经过上面的图像编码器生成的特征为4096，然后经过两个全连接变为LLM需要的3072，涉及的源码如下：

大语言模型

phi-3-mini-128K-instruct 3.8B参数，详细可以看另一篇LLM部分文章，词汇量320641、隐藏维度3072、32个head、32 layer。

配置文件如下：

2、模型亮点

属于常规结构，注意这里使用的LLM部分没有使用了block sparse attention的注意力机制，详细看另一篇LLM部分文章。

图片的分割处理方式。

PS

现在确实感觉没啥创新了。。。

三、数据

1、数据标签

单图

<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n

多轮对话

<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n

多图（从1开始）

<|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n

2、数据构成

预训练

只是提及如下方面，具体数据集未提及，共0.5Ttoken。

*Obelics：交错的图像-文本文档

*FLD-5B：图像-文本对

*OCR：从PDF文件的光学字符识别（OCR）派生的合成数据

*图表/表格理解

*仅文本数据

后训练

不包含仅文本的RAI数据集，还包括了各种内部多模态（MM）RAI数据集，这些数据集涵盖了公共和内部MM RAI基准测试中确定的各种伤害类别。

3、数据清洗

未提及。

四、策略

1、训练过程

论文只简单提到两阶段，预训练和后训练，没有提到具体的超参设计等。

预训练

本标记上使用预测下一个标记的目标，而在图像标记上的相关损失在这个阶段被忽略。预训练过程涉及总共0.5T个包含视觉和文本元素的标记。在预训练阶段，最大图像分辨率被限制在1344×1344，因为大多数训练图像都小于这个分辨率。

后训练

在监督微调（SFT）阶段和直接偏好优化（DPO）阶段都涉及了安全后训练。

2、推理过程

推理的时候是不是有后处理等等

五、结果

1、多维度对比

常见任务

图片视觉任务

视频任务

2、消融实验

暂无

六、使用方法

推理：https://huggingface.co/microsoft/Phi-3.5-vision-instruct

微调：https://github.com/microsoft/Phi-3CookBook/blob/main/md/04.Fine-tuning/FineTuning_Vision.md

七、待解决

暂无

八、参考链接

https://huggingface.co/microsoft/Phi-3.5-vision-instruct

MLLM | InternLM-XComposer2-4KHD: 支持336 像素到 4K 高清的分辨率的大视觉语言模型-CSDN博客

TigerZ*

关注

5
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
VLM 系列——phi3.5-Vision——论文解读

是一系列大型语言模型（LLM） & 多模态大型语言模型（MLLM）。其中LLM包括phi-3-mini 3.8B、phi-3-small 7B、phi-3-medium 14B，phi-3-mini可以轻松地在现代手机上本地进行推理；多模态模型包括phi-3-vision（基于phi-3-mini & CLIP ViT-L/14） 4.2B。本文重点介绍多模态的phi-3-mini模型，该模型能够处理包括文本、图像在内的多种数据类型，具备图片描述、单图文问答、多图问对话、视频理解对话、json格式、高清
复制链接

扫一扫