自从去年Meta发布了首个开源Llama3.2Llama 3.2-11B-vision多模态大模型,然而,市面上几乎没有blog研究其结构的具体构造,让人对其原理和结构都会产生不同程度的困惑,不利于对大模型的学习,本系列blog将从头开始一步一步详细地讲解这个多模态大模型,而不会对某个步骤含糊其辞。本blog教程十分适合对大模型的小白。本系列的目录为:
- 图片预处理的详细步骤(即本文blog)
- 文本预处理的详细步骤(敬请期待……)
- 视觉编码器的详细结构和步骤(敬请期待……)
- 文本编码器的详细结构和步骤(敬请期待……)
- 文本交融的详细结构和步骤(敬请期待……)
- 输出的详细结构和步骤(敬请期待……)
- Llama 3.2-11B-vision多模态大模型推理完整超清流程图(本系列blog彩蛋^_^敬请期待……)
Llama 3.2-11B-vision图片预处理的详细步骤
0. Llama 3.2-11B-vision多模态大模型推理代码
下方为官方提供的完整推理代码,我选取了一个灰度图片,即MNIST数据集中的一张28X28的图片,上面切了一部分,为28X24的大小图片。该图用PIL读入后,实际为28*24的一个矩阵,矩阵中的每个值均为无符号的8位整数,范围在0~255之间,随后传入了processor函数,即本文将要详细和重点介绍的一个算法。其余均为文本预处理部分将在下期详细讲解。
# This is a sample Python script.
# Press Shift+F10 to execute it or replace it with your code.
# Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.
#%%
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
from time import time
import numpy as np
import pandas as pd
model_dir = "./models/llama3.2_11b"
model = MllamaForConditionalGeneration.from_pretrained(
model_dir,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
)
model.tie_weights()
# 这里是初始化的过程,会识别分词器以及image分片的类型
processor = AutoProcessor.from_pretrained(model_dir)
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
url = "https://www.modelscope.cn/models/LLM-Research/Llama-3.2-11B-Vision/resolve/master/rabbit.jpg"
while 1:
# image = Image.open("./data/1995.jpg") # 图片路径
image = Image.open("./data/000000.png")
# image = Image.open("./data/bicycle_bigger.png")
image_array = np.array(image) # 这个没什么用,每个数值在0~255之间,8位无符号整数
query = "图中的数字是几?"
# query = "图中的交通工具是什么?"
# query = "图中的人在干嘛?"
messages = [
{
"role": "user", "content": [
{
"type": "image"},
{
"type": "text", "text": query} # 填入问题
]}
]
s1 = time()
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
# 上面就是套一个text模版,可以本地执行
# 这里应该是call的过程,就是调用的过程
inputs = processor(image, input_text, return_tensors="pt").to(model.device)
for k in inputs.data: # 保存输出值下来!
save_tensor_to_txt(inputs.data[k], f'./data/tensor_output_0{
k}.txt')
print(processor.decode(inputs["input_ids"][0]))
# 上面这个融合是比较麻烦的地方
output = model.generate(**inputs, max_new_tokens=1000)
print(time() - s1)
print(processor.decode(output[0]))
# See PyCharm help at https://www.jetbrains.com/help/pycharm/
1. 图片预处理代码概览
下方的代码是官方提供的代码示例框架,本文输入的11100.png即为上面提到的28X24的灰度图片,彩色图片同理,它们均会被整合为三个通道的像素图形式,故整个过程和原理都是一样的。本文用简单灰度图是为了便于讲解。下方的MllamaImageProcessor类的preprocess方法是本文重点的关注和讲解对象。
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
@文件:image_prerocess.py
@时间:2025/1/6 下午11:39
@作者:料理码王
@邮箱:1289590668@qq.com
@功能:输入一个28*24的灰度image,然后进行预处理!
"""
from PIL import Image
import numpy as np
from transformers import MllamaImageProcessor
# 加载图像
image_path = "./data/11100.png"
image = Image.open(image_path)
# 创建 MllamaImageProcessor 实例
# 配置MllamaImageProcessor
image_processor = MllamaImageProcessor(
do_convert_rgb=True, # 灰度图直接每个像素点复制三次为一个劣列表
do_normalize=True, # 三个通道分别减去均值,除以方差
do_pad=True,
do_rescale=True,
do_resize=True,
image_mean=[0.48145466, 0.4578275, 0.40821073], # 这是个预设好的常量
image_std=[0.26862954, 0.26130258, 0.27577711], # 这是个预设好的常量
max_image_tiles=4,
resample=2, # Resample value 2 corresponds to 'PILImageResampling.BILINEAR' 图像重缩放的比例因子,双线性插值法缩放图像
rescale_factor=0.00392156862745098, # 1/255,应该是根据单像素的范围确定的
size={
"height": 560, "width": 560}
)
# 调用预处理方法
processed_image = image_processor.preprocess(
images=image,
return_tensors="pt", # 返回NumPy数组
)
# 获取预处理后的pixel_values和其他信息
pixel_values = processed_image["pixel_values"]
aspect_ratio_ids = processed_image["aspect_ratio_ids"]
aspect_ratio_mask = processed_image["aspect_ratio_mask"]
num_tiles = processed_image["num_tiles"]
# 输出结果(或者你可以进一步使用这些处理后的数据)
print("Pixel Values Shape:", pixel_values.shape)
print("Aspect Ratio IDs:", aspect_ratio_ids)
print("Aspect Ratio Mask:", aspect_ratio_mask)
print("Number of Tiles:", num_tiles)
2. MllamaImageProcessor类
下方是这个类的起始部分,会包含各种需要输入的参数,此部分源码均又对各个参数进行细致的讲解,可使用文心一言等大模型详细获取具体各个参数的作用。此外,初始化部分主要也是对输入的参数进行验证,以免不同的参数之间发生冲突。
class MllamaImageProcessor(BaseImageProcessor):
"""
Constructs a Mllama image processor.
Args:
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB. This is useful if the input image is of a different format e.g. RGBA.
Only has an effect if the input image is in the PIL format.
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Size of the image tile. Should be a dictionary containing 'height' and 'width' keys, both with integer values.
The height and width values should be equal.
resample (`int`, *optional*, defaults to `Resampling.BILINEAR`):
Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
has an effect if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to 0.0):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
`True`.
do_pad (`bool`, *optional*, defaults to `True`):
Whether or not to pad the images to the largest height and width in the batch.
max_image_tiles (`int`, *optional*, defaults to 4):
The maximum number of tiles to split the image into.
"""
model_input_names = ["pixel_values", "num_tiles", "aspect_ratio_ids", "aspect_ratio_mask"]
def __init__(
self,
do_convert_rgb: bool = True,
do_resize: bool = True,
size: Optional[Dict[str, int]] = None,
resample: PILImageResampling = PILImageResampling.BILINEAR,
do_rescale: bool = True,
rescale_factor: float = 1 / 255, # 这个也是根据单像素的颜色范围预设好的
do_normalize: bool = True,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_pad: bool = True,
max_image_tiles: int = 4,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.do_convert_rgb = do_convert_rgb
self.do_resize = do_resize
self.size = size if size is not None else {
"height": 224, "width": 224}
self.resample = resample
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
self.do_pad = do_pad
self.max_image_tiles = max_image_tiles
_validate_mllama_preprocess_arguments(self.do_resize, self.size, self.do_pad, self.max_image_tiles)
def preprocess(
self,
images: ImageInput,
do_convert_rgb: Optional[bool] = None,
do_resize: Optional[bool] = None,
size: Optional[Dict[str, int]] = None,
resample: Optional[PILImageResampling] = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_pad: Optional[bool] = None,
max_image_tiles: Optional[int] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
):
"""
Preprocess a batch of images.
Args:
images (`ImageInput`):
A list of images to preprocess.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Size of the image tile. Should be a dictionary containing 'height' and 'width' keys, both with integer values.
The height and width values should be equal.
resample (`int`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
has an effect if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
`True`.
do_pad (`bool`, *optional*, defaults to `self.do_pad`):
Whether or not to pad the images to the largest height and width in the batch.
max_image_tiles (`int`, *optional*, defaults to `self.max_image_tiles`):
The maximum number of tiles to split the image into.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
Returns:
`BatchFeature` of the following structure:
- **pixel_values** (`TensorType`): The preprocessed pixel values.
- **aspect_ratio_ids** (`TensorType`): The aspect ratio ids of the images.
- **num_tiles** (`List[List[int]]`): The number of tiles for each image in the batch.
"""
do_convert