Llama 3.2-11B-vision多模态大模型结构详解(精确到各个算子)——图片预处理的详细步骤

自从去年Meta发布了首个开源Llama3.2Llama 3.2-11B-vision多模态大模型,然而,市面上几乎没有blog研究其结构的具体构造,让人对其原理和结构都会产生不同程度的困惑,不利于对大模型的学习,本系列blog将从头开始一步一步详细地讲解这个多模态大模型,而不会对某个步骤含糊其辞。本blog教程十分适合对大模型的小白。本系列的目录为:

  • 图片预处理的详细步骤(即本文blog)
  • 文本预处理的详细步骤(敬请期待……)
  • 视觉编码器的详细结构和步骤(敬请期待……)
  • 文本编码器的详细结构和步骤(敬请期待……)
  • 文本交融的详细结构和步骤(敬请期待……)
  • 输出的详细结构和步骤(敬请期待……)
  • Llama 3.2-11B-vision多模态大模型推理完整超清流程图(本系列blog彩蛋^_^敬请期待……)

0. Llama 3.2-11B-vision多模态大模型推理代码

下方为官方提供的完整推理代码,我选取了一个灰度图片,即MNIST数据集中的一张28X28的图片,上面切了一部分,为28X24的大小图片。该图用PIL读入后,实际为28*24的一个矩阵,矩阵中的每个值均为无符号的8位整数,范围在0~255之间,随后传入了processor函数,即本文将要详细和重点介绍的一个算法。其余均为文本预处理部分将在下期详细讲解。

# This is a sample Python script.

# Press Shift+F10 to execute it or replace it with your code.
# Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.

#%%

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
from time import time
import numpy as np
import pandas as pd

model_dir = "./models/llama3.2_11b"

model = MllamaForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)
model.tie_weights()
# 这里是初始化的过程,会识别分词器以及image分片的类型
processor = AutoProcessor.from_pretrained(model_dir)


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    url = "https://www.modelscope.cn/models/LLM-Research/Llama-3.2-11B-Vision/resolve/master/rabbit.jpg"
    while 1:
        # image = Image.open("./data/1995.jpg") # 图片路径
        image = Image.open("./data/000000.png")
        # image = Image.open("./data/bicycle_bigger.png")
        image_array = np.array(image) # 这个没什么用,每个数值在0~255之间,8位无符号整数
        query = "图中的数字是几?"
        # query = "图中的交通工具是什么?"
        # query = "图中的人在干嘛?"
        messages = [
            {
   "role": "user", "content": [
                {
   "type": "image"},
                {
   "type": "text", "text": query} # 填入问题
            ]}
        ]
        s1 = time()
        input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
        # 上面就是套一个text模版,可以本地执行
        # 这里应该是call的过程,就是调用的过程
        inputs = processor(image, input_text, return_tensors="pt").to(model.device)
        for k in inputs.data: # 保存输出值下来!
            save_tensor_to_txt(inputs.data[k], f'./data/tensor_output_0{
     k}.txt')
        print(processor.decode(inputs["input_ids"][0]))
        # 上面这个融合是比较麻烦的地方
        output = model.generate(**inputs, max_new_tokens=1000)
        print(time() - s1)
        print(processor.decode(output[0]))

# See PyCharm help at https://www.jetbrains.com/help/pycharm/

1. 图片预处理代码概览

下方的代码是官方提供的代码示例框架,本文输入的11100.png即为上面提到的28X24的灰度图片,彩色图片同理,它们均会被整合为三个通道的像素图形式,故整个过程和原理都是一样的。本文用简单灰度图是为了便于讲解。下方的MllamaImageProcessor类的preprocess方法是本文重点的关注和讲解对象。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
@文件:image_prerocess.py
@时间:2025/1/6 下午11:39
@作者:料理码王
@邮箱:1289590668@qq.com
@功能:输入一个28*24的灰度image,然后进行预处理!
"""
from PIL import Image
import numpy as np
from transformers import MllamaImageProcessor

# 加载图像
image_path = "./data/11100.png"
image = Image.open(image_path)
# 创建 MllamaImageProcessor 实例

# 配置MllamaImageProcessor
image_processor = MllamaImageProcessor(
    do_convert_rgb=True, # 灰度图直接每个像素点复制三次为一个劣列表
    do_normalize=True, # 三个通道分别减去均值,除以方差
    do_pad=True,
    do_rescale=True,
    do_resize=True,
    image_mean=[0.48145466, 0.4578275, 0.40821073],  # 这是个预设好的常量
    image_std=[0.26862954, 0.26130258, 0.27577711],  # 这是个预设好的常量
    max_image_tiles=4,
    resample=2,  # Resample value 2 corresponds to 'PILImageResampling.BILINEAR' 图像重缩放的比例因子,双线性插值法缩放图像
    rescale_factor=0.00392156862745098, # 1/255,应该是根据单像素的范围确定的
    size={
   "height": 560, "width": 560}
)

# 调用预处理方法
processed_image = image_processor.preprocess(
    images=image,
    return_tensors="pt",  # 返回NumPy数组
)

# 获取预处理后的pixel_values和其他信息
pixel_values = processed_image["pixel_values"]
aspect_ratio_ids = processed_image["aspect_ratio_ids"]
aspect_ratio_mask = processed_image["aspect_ratio_mask"]
num_tiles = processed_image["num_tiles"]

# 输出结果(或者你可以进一步使用这些处理后的数据)
print("Pixel Values Shape:", pixel_values.shape)
print("Aspect Ratio IDs:", aspect_ratio_ids)
print("Aspect Ratio Mask:", aspect_ratio_mask)
print("Number of Tiles:", num_tiles)

2. MllamaImageProcessor类

下方是这个类的起始部分,会包含各种需要输入的参数,此部分源码均又对各个参数进行细致的讲解,可使用文心一言等大模型详细获取具体各个参数的作用。此外,初始化部分主要也是对输入的参数进行验证以免不同的参数之间发生冲突。

class MllamaImageProcessor(BaseImageProcessor):
    """
    Constructs a Mllama image processor.

    Args:
        do_convert_rgb (`bool`, *optional*, defaults to `True`):
            Whether to convert the image to RGB. This is useful if the input image is of a different format e.g. RGBA.
            Only has an effect if the input image is in the PIL format.
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Size of the image tile. Should be a dictionary containing 'height' and 'width' keys, both with integer values.
            The height and width values should be equal.
        resample (`int`, *optional*, defaults to `Resampling.BILINEAR`):
            Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
            has an effect if `do_resize` is set to `True`.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image.
        rescale_factor (`float`, *optional*, defaults to 0.0):
            Rescale factor to rescale the image by if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
            `True`.
        do_pad (`bool`, *optional*, defaults to `True`):
            Whether or not to pad the images to the largest height and width in the batch.
        max_image_tiles (`int`, *optional*, defaults to 4):
            The maximum number of tiles to split the image into.
    """

    model_input_names = ["pixel_values", "num_tiles", "aspect_ratio_ids", "aspect_ratio_mask"]

    def __init__(
        self,
        do_convert_rgb: bool = True,
        do_resize: bool = True,
        size: Optional[Dict[str, int]] = None,
        resample: PILImageResampling = PILImageResampling.BILINEAR,
        do_rescale: bool = True,
        rescale_factor: float = 1 / 255, # 这个也是根据单像素的颜色范围预设好的
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: bool = True,
        max_image_tiles: int = 4,
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)
        self.do_convert_rgb = do_convert_rgb
        self.do_resize = do_resize
        self.size = size if size is not None else {
   "height": 224, "width": 224}
        self.resample = resample
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
        self.do_pad = do_pad
        self.max_image_tiles = max_image_tiles

        _validate_mllama_preprocess_arguments(self.do_resize, self.size, self.do_pad, self.max_image_tiles)

    def preprocess(
        self,
        images: ImageInput,
        do_convert_rgb: Optional[bool] = None,
        do_resize: Optional[bool] = None,
        size: Optional[Dict[str, int]] = None,
        resample: Optional[PILImageResampling] = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: Optional[bool] = None,
        max_image_tiles: Optional[int] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
    ):
        """
        Preprocess a batch of images.

        Args:
            images (`ImageInput`):
                A list of images to preprocess.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image tile. Should be a dictionary containing 'height' and 'width' keys, both with integer values.
                The height and width values should be equal.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
                has an effect if `do_resize` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
                `True`.
            do_pad (`bool`, *optional*, defaults to `self.do_pad`):
                Whether or not to pad the images to the largest height and width in the batch.
            max_image_tiles (`int`, *optional*, defaults to `self.max_image_tiles`):
                The maximum number of tiles to split the image into.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:
                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.

        Returns:
            `BatchFeature` of the following structure:
                - **pixel_values** (`TensorType`): The preprocessed pixel values.
                - **aspect_ratio_ids** (`TensorType`): The aspect ratio ids of the images.
                - **num_tiles** (`List[List[int]]`): The number of tiles for each image in the batch.
        """
        do_convert
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值