Qwen2.5-VL 多模态模型运行新手入门

Ziegler Han

已于 2025-03-12 10:24:21 修改

阅读量778

点赞数 3

分类专栏： Python 大模型文章标签：图像处理 python

于 2025-03-11 20:53:26 首次发布

本文链接：https://blog.csdn.net/weixin_40677588/article/details/146188678

版权

Python 同时被 2 个专栏收录

8 篇文章

订阅专栏

大模型

1 篇文章

订阅专栏

Qwen2.5-VL 多模态模型运行指南

环境准备

# 创建conda虚拟环境
conda create --name my-qwen2.5-vl python=3.10
conda activate my-qwen2.5-vl

依赖安装

# 安装核心依赖包
pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]==0.0.8

# 安装PyTorch（推荐使用CUDA 12.1版本）
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

代码准备

创建qwen-vl.py文件并粘贴以下内容：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# 基础模型加载方式
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", 
    torch_dtype="auto",
    device_map="auto"
)

# （可选）启用flash_attention_2加速（需要支持CUDA的GPU）
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto"
# )

# 初始化处理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# 构建多模态输入
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "test.jpg"},
            {"type": "text", "text": "描述这张图片."}
        ]
    }
]

# 预处理流程
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to("cuda")

# 生成描述
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, 
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text)

准备测试图片

下载任意图片下载图片
重命名为test.jpg
放置在与代码文件同级目录

运行程序

python qwen-vl.py

预期输出

这张图片展示了一位年轻女子和她的狗在海滩上互动的场景。她坐在沙滩上，穿着格子衬衫和黑色裤子，面带微笑地看着她的狗。她的狗戴着彩色的项圈，正在用前爪与她握手。背景是模糊的海洋和天空，给人一种宁静和温暖的感觉。阳光洒在她们身上，营造出一种温馨和谐的氛围。

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.00s/it]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
['这张图片展示了一位年轻女子和她的狗在海滩上互动的场景。她坐在沙滩上，穿着格子衬衫和黑色裤子，面带微笑地看着她的狗。她的狗戴着彩色的项圈，正在用前爪与她握手。背景是模糊的海洋和天空，给人一种宁静和温暖的感觉。阳光洒在她们身上，营造出一种温馨和谐的氛围。']

高级配置

视觉令牌范围调整

# 在初始化processor时添加参数（单位：像素）
min_pixels = 256 * 28 * 28  # 对应256个视觉令牌
max_pixels = 1280 * 28 * 28 # 对应1280个视觉令牌
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    min_pixels=min_pixels,
    max_pixels=max_pixels
)