Qwen 的 QVQ-72B-Preview：向增强型视觉推理迈出一步

最新推荐文章于 2025-05-13 20:33:05 发布

吴脑的键客

最新推荐文章于 2025-05-13 20:33:05 发布

阅读量1.2k

点赞数 11

分类专栏：机器人技术文章标签：人工智能 AIGC

本文链接：https://blog.csdn.net/weixin_41446370/article/details/144711691

版权

机器人技术专栏收录该内容

54 篇文章

订阅专栏

在这里插入图片描述

导言

开源语言模型 Qwen 最近发布了 QVQ-72B-Preview，这是一个旨在提高视觉推理能力的实验研究模型。该模型基于 Qwen2-VL-72B，旨在处理涉及文本和图像的复杂任务。

性能

QVQ-72B-Preview在各种基准测试中表现出色，展示了其跨学科理解和推理的能力。以下是一些主要亮点：

MMMU 基准：模型取得了 70.3% 的优异成绩，表明其在多学科理解方面的熟练程度。
MathVision：在数学推理任务中取得显著进步，表现优于其他模型。
OlympiadBench：增强解决问题的能力，有效解决具有挑战性的问题。

	QVQ-72B-Preview	o1-2024-12-17	gpt-4o-2024-05-13	Claude3.5 Sonnet-20241022	Qwen2VL-72B
MMMU(val)	70.3	77.3	69.1	70.4	64.5
MathVista(mini)	71.4	71.0	63.8	65.3	70.5
MathVision(full)	35.9	–	30.4	35.6	25.9
OlympiadBench	20.4	–	25.9	–	11.2

## 快速入门要开始使用 QVQ-72B-Preview，Qwen 提供了一个名为 `qwen-vl-utils`的工具包来处理各种可视化输入类型。下面的代码片段演示了它与 `transformers`库的用法：

pip install qwen-vl-utils

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto"
)

# default processer
processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/QVQ/demo.png",
            },
            {"type": "text", "text": "What value should be filled in the blank space?"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

局限性

虽然 QVQ-72B-Preview 显示出了令人鼓舞的结果，但我们必须认识到它的局限性：

1.语言混合：模型可能会混合语言或在语言之间意外切换，从而影响响应的清晰度。
2.递归推理：它可能会陷入递归循环，导致冗长且可能没有结论的响应。
3.安全和道德：为确保性能可靠，需要采取强有力的安全措施，用户在部署过程中应小心谨慎。
4. 性能：在多步骤视觉推理中，模型可能会失去对图像内容的关注，与 Qwen2-VL-72B 相比，它对基本识别任务的改进并不显著。