导言
开源语言模型 Qwen 最近发布了 QVQ-72B-Preview,这是一个旨在提高视觉推理能力的实验研究模型。该模型基于 Qwen2-VL-72B,旨在处理涉及文本和图像的复杂任务。
性能
QVQ-72B-Preview在各种基准测试中表现出色,展示了其跨学科理解和推理的能力。以下是一些主要亮点:
- MMMU 基准:模型取得了 70.3% 的优异成绩,表明其在多学科理解方面的熟练程度。
- MathVision:在数学推理任务中取得显著进步,表现优于其他模型。
- OlympiadBench:增强解决问题的能力,有效解决具有挑战性的问题。
QVQ-72B-Preview | o1-2024-12-17 | gpt-4o-2024-05-13 | Claude3.5 Sonnet-20241022 | Qwen2VL-72B | |
---|---|---|---|---|---|
MMMU(val) | 70.3 | 77.3 | 69.1 | 70.4 | 64.5 |
MathVista(mini) | 71.4 | 71.0 | 63.8 | 65.3 | 70.5 |
MathVision(full) | 35.9 | – | 30.4 | 35.6 | 25.9 |
OlympiadBench | 20.4 | – | 25.9 | – | 11.2 |
pip install qwen-vl-utils
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}
],
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/QVQ/demo.png",
},
{"type": "text", "text": "What value should be filled in the blank space?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
局限性
虽然 QVQ-72B-Preview 显示出了令人鼓舞的结果,但我们必须认识到它的局限性:
1.语言混合:模型可能会混合语言或在语言之间意外切换,从而影响响应的清晰度。
2.递归推理:它可能会陷入递归循环,导致冗长且可能没有结论的响应。
3.安全和道德:为确保性能可靠,需要采取强有力的安全措施,用户在部署过程中应小心谨慎。
4. 性能:在多步骤视觉推理中,模型可能会失去对图像内容的关注,与 Qwen2-VL-72B 相比,它对基本识别任务的改进并不显著。
结论
Qwen 的 QVQ-72B-Preview 是增强大型语言模型视觉推理能力的重要一步。虽然它的性能令人印象深刻,但仍需进一步开发和研究,以解决其局限性,并确保使用的稳健性和安全性。