Llama 3.2-11B-vision多模态大模型结构详解（精确到各个算子）——图片预处理的详细步骤

料理码王

于 2025-01-07 23:19:26 发布

阅读量1.4k

点赞数 27

分类专栏： NLP 文本生成文章标签： llama 语言模型自然语言处理人工智能 nlp

本文链接：https://blog.csdn.net/qq_37150711/article/details/144995107

版权

自从去年Meta发布了首个开源Llama3.2Llama 3.2-11B-vision多模态大模型，然而，市面上几乎没有blog研究其结构的具体构造，让人对其原理和结构都会产生不同程度的困惑，不利于对大模型的学习，本系列blog将从头开始一步一步详细地讲解这个多模态大模型，而不会对某个步骤含糊其辞。本blog教程十分适合对大模型的小白。本系列的目录为：

图片预处理的详细步骤（即本文blog）
文本预处理的详细步骤（敬请期待……）
视觉编码器的详细结构和步骤（敬请期待……）
文本编码器的详细结构和步骤（敬请期待……）
文本交融的详细结构和步骤（敬请期待……）
输出的详细结构和步骤（敬请期待……）
Llama 3.2-11B-vision多模态大模型推理完整超清流程图（本系列blog彩蛋^_^敬请期待……）

0. Llama 3.2-11B-vision多模态大模型推理代码

下方为官方提供的完整推理代码，我选取了一个灰度图片，即MNIST数据集中的一张28X28的图片，上面切了一部分，为28X24的大小图片。该图用PIL读入后，实际为28*24的一个矩阵，矩阵中的每个值均为无符号的8位整数，范围在0~255之间，随后传入了processor函数，即本文将要详细和重点介绍的一个算法。其余均为文本预处理部分将在下期详细讲解。

# This is a sample Python script.

# Press Shift+F10 to execute it or replace it with your code.
# Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.

#%%

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
from time import time
import numpy as np
import pandas as pd

model_dir = "./models/llama3.2_11b"

model = MllamaForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)
model.tie_weights()
# 这里是初始化的过程，会识别分词器以及image分片的类型
processor = AutoProcessor.from_pretrained(model_dir)


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    url = "https://www.modelscope.cn/models/LLM-Research/Llama-3.2-11B-Vision/resolve/master/rabbit.jpg"
    while 1:
        # image = Image.open("./data/1995.jpg") # 图片路径
        image = Image.open("./data/000000.png")
        # image = Image.open("./data/bicycle_bigger.png")
        image_array = np.array(image) # 这个没什么用，每个数值在0~255之间，8位无符号整数
        query = "图中的数字是几？"
        # query = "图中的交通工具是什么？"
        # query = "图中的人在干嘛？"
        messages = [
            {
   "role": "user", "content": [
                {
   "type": "image"},
                {
   "type": "text", "text": query} # 填入问题
            ]}
        ]
        s1 = time()
        input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
        # 上面就是套一个text模版，可以本地执行
        # 这里应该是call的过程，就是调用的过程
        inputs = processor(image, input_text, return_tensors="pt").to(model.device)
        for k in inputs.data: # 保存输出值下来！
            save_tensor_to_txt(inputs.data[k], f'./data/tensor_output_0{
     k}.txt')
        print(processor.decode(inputs["input_ids"][0]))
        # 上面这个融合是比较麻烦的地方
        output = model.generate(**inputs, max_new_tokens=1000)
        print(time() - s1)
        print(processor.decode(output[0]))

# See PyCharm help at https://www.jetbrains.com/help/pycharm/

1. 图片预处理代码概览

下方的代码是官方提供的代码示例框架，本文输入的11100.png即为上面提到的28X24的灰度图片，彩色图片同理，它们均会被整合为三个通道的像素图形式，故整个过程和原理都是一样的。本文用简单灰度图是为了便于讲解。下方的MllamaImageProcessor类的preprocess方法是本文重点的关注和讲解对象。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
@文件：image_prerocess.py
@时间：2025/1/6 下午11:39
@作者：料理码王
@邮箱：1289590668@qq.com
@功能：输入一个28*24的灰度image，然后进行预处理！
"""
from PIL import Image
import numpy as np
from transformers import MllamaImageProcessor

# 加载图像
image_path = "./data/11100.png"
image = Image.open(image_path)
# 创建 MllamaImageProcessor 实例

# 配置MllamaImageProcessor
image_processor = MllamaImageProcessor(
    do_convert_rgb=True, # 灰度图直接每个像素点复制三次为一个劣列表
    do_normalize=True, # 三个通道分别减去均值，除以方差
    do_pad=True,
    do_rescale=True,
    do_resize=True,
    image_mean=[0.48145466, 0.4578275, 0.40821073],  # 这是个预设好的常量
    image_std=[0.26862954, 0.26130258, 0.27577711],  # 这是个预设好的常量
    max_image_tiles=4,
    resample=2,  # Resample value 2 corresponds to 'PILImageResampling.BILINEAR' 图像重缩放的比例因子，双线性插值法缩放图像
    rescale_factor=0.00392156862745098, # 1/255，应该是根据单像素的范围确定的
    size={
   "height": 560, "width": 560}
)

# 调用预处理方法
processed_image = image_processor.preprocess(
    images=image,
    return_tensors="pt",  # 返回NumPy数组
)

# 获取预处理后的pixel_values和其他信息
pixel_values = processed_image["pixel_values"]
aspect_ratio_ids = processed_image["aspect_ratio_ids"]
aspect_ratio_mask = processed_image["aspect_ratio_mask"]
num_tiles = processed_image["num_tiles"]

# 输出结果（或者你可以进一步使用这些处理后的数据）
print("Pixel Values Shape:", pixel_values.shape)
print("Aspect Ratio IDs:", aspect_ratio_ids)
print("Aspect Ratio Mask:", aspect_ratio_mask)
print("Number of Tiles:", num_tiles)

2. MllamaImageProcessor类

下方是这个类的起始部分，会包含各种需要输入的参数，此部分源码均又对各个参数进行细致的讲解，可使用文心一言等大模型详细获取具体各个参数的作用。此外，初始化部分主要也是对输入的参数进行验证，以免不同的参数之间发生冲突。

class MllamaImageProcessor(BaseImageProcessor):
    """
    Constructs a Mllama image processor.

    Args:
        do_convert_rgb (`bool`, *optional*, defaults to `True`):
            Whether to convert the image to RGB. This is useful if the input image is of a different format e.g. RGBA.
            Only has an effect if the input image is in the PIL format.
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Size of the image tile. Should be a dictionary containing 'height' and 'width' keys, both with integer values.
            The height and width values should be equal.
        resample (`int`, *optional*, defaults to `Resampling.BILINEAR`):
            Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
            has an effect if `do_resize` is set to `True`.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image.
        rescale_factor (`float`, *optional*, defaults to 0.0):
            Rescale factor to rescale the image by if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
            `True`.
        do_pad (`bool`, *optional*, defaults to `True`):
            Whether or not to pad the images to the largest height and width in the batch.
        max_image_tiles (`int`, *optional*, defaults to 4):
            The maximum number of tiles to split the image into.
    """

    model_input_names = ["pixel_values", "num_tiles", "aspect_ratio_ids", "aspect_ratio_mask"]

    def __init__(
        self,
        do_convert_rgb: bool = True,
        do_resize: bool = True,
        size: Optional[Dict[str, int]] = None,
        resample: PILImageResampling = PILImageResampling.BILINEAR,
        do_rescale: bool = True,
        rescale_factor: float = 1 / 255, # 这个也是根据单像素的颜色范围预设好的
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: bool = True,
        max_image_tiles: int = 4,
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)
        self.do_convert_rgb = do_convert_rgb
        self.do_resize = do_resize
        self.size = size if size is not None else {
   "height": 224, "width": 224}
        self.resample = resample
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
        self.do_pad = do_pad
        self.max_image_tiles = max_image_tiles

        _validate_mllama_preprocess_arguments(self.do_resize, self.size, self.do_pad, self.max_image_tiles)

    def preprocess(
        self,
        images: ImageInput,
        do_convert_rgb: Optional[bool] = None,
        do_resize: Optional[bool] = None,
        size: Optional[Dict[str, int]] = None,
        resample: Optional[PILImageResampling] = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: Optional[bool] = None,
        max_image_tiles: Optional[int] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
    ):
        """
        Preprocess a batch of images.

        Args:
            images (`ImageInput`):
                A list of images to preprocess.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image tile. Should be a dictionary containing 'height' and 'width' keys, both with integer values.
                The height and width values should be equal.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
                has an effect if `do_resize` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
                `True`.
            do_pad (`bool`, *optional*, defaults to `self.do_pad`):
                Whether or not to pad the images to the largest height and width in the batch.
            max_image_tiles (`int`, *optional*, defaults to `self.max_image_tiles`):
                The maximum number of tiles to split the image into.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:
                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.

        Returns:
            `BatchFeature` of the following structure:
                - **pixel_values** (`TensorType`): The preprocessed pixel values.
                - **aspect_ratio_ids** (`TensorType`): The aspect ratio ids of the images.
                - **num_tiles** (`List[List[int]]`): The number of tiles for each image in the batch.
        """
        do_convert

最低0.47元/天解锁文章