阿里云通义开源Qwen2.5-VL,视觉AI超越Claude 3.5

阿里云通义千问开源了全新的视觉模型Qwen2.5-VL,并推出了3B、7B和72B三个尺寸版本。

其中,旗舰版Qwen2.5-VL-72B在13项权威评测中夺得视觉理解冠军,超越了GPT-4o与Claude3.5。阿里云官方介绍称,新的Qwen2.5-VL能够更准确地解析图像内容,并突破性地支持超过1小时的视频理解。该模型可以在视频中搜索具体事件,并对视频的不同时间段进行要点总结,从而快速、高效地帮助用户提取视频中的关键信息。

在这里插入图片描述
此外,Qwen2.5-VL无需微调即可变身为一个能操控手机和电脑的AI视觉智能体(Visual Agents),实现多步骤复杂操作,如给指定朋友发送祝福、电脑修图、手机订票等。Qwen2.5-VL不仅擅长识别常见物体,如花、鸟、鱼和昆虫,还能够分析图像中的文本、图表、图标、图形和布局。阿里云还提升了Qwen2.5-VL的OCR识别能力,增强了多场景、多语言和多方向的文本识别和文本定位能力。

在这里插入图片描述
同时,在信息抽取能力上进行了大幅度增强,以满足日益增长的资质审核、金融商务等数字化、智能化需求。

快速上手

安装依赖

pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]==0.0.8

使用 🤗 Transformers 聊天

使用默认会爆24GB显存的,建议用Qwen官方FA2运行

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
#)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图像推理

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Messages containing a video url and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频 URL 的兼容性主要取决于第三方库的版本。 如果不想使用默认后端,可通过 FORCE_QWENVL_VIDEO_READER=torchvision 或 FORCE_QWENVL_VIDEO_READER=decord 更改后端。

批推理

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages2]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

### 如何调用阿里云 Qwen 2.5-VL API 对于希望利用阿里云Qwen 2.5-VL模型执行特定任务的应用开发者而言,了解API的调用方式至关重要。通过HTTP请求的方式可以轻松实现这一目标。 #### 准备工作 确保已获取访问密钥(AccessKey ID 和 AccessKey Secret),这是向阿里云发送任何请求所必需的身份验证凭证[^1]。 #### 构建 HTTP 请求 构建针对Qwen 2.5-VL API的有效HTTP POST请求时,需指定如下参数: - **URL**: `https://api.qwen.damo.ac.cn/v1/completions` 是用于提交文本生成请求的目标端点。 - **Headers**: - `Content-Type`: 设置为`application/json`以表明正文采用JSON格式。 - `Authorization`: 使用Bearer Token形式提供身份验证信息,即`Bearer {your_access_token}`。 - **Body (Payload)**: JSON对象内含提示词(prompt),以及可选配置项如最大长度(max_tokens)等。 ```json { "prompt": "你好世界", "max_tokens": 64, "temperature": 0.7, "top_p": 0.9 } ``` 此结构化数据作为POST请求体的一部分被发送给服务器处理并返回响应结果[^2]。 #### Python 示例代码 下面给出一段Python脚本示范如何发起上述描述中的API调用过程: ```python import requests import json url = 'https://api.qwen.damo.ac.cn/v1/completions' headers = { 'Content-Type': 'application/json', 'Authorization': 'Bearer YOUR_ACCESS_TOKEN' # 替换成自己的token } data = { "prompt": "编写一首关于秋天的小诗。", "max_tokens": 100, # 输出的最大字符数 "temperature": 0.8, # 控制随机性的温度值 "top_p": 0.9 # 核采样概率阈值 } response = requests.post(url=url, headers=headers, data=json.dumps(data)) print(response.json()) ``` 该片段展示了完整的流程——从定义必要的头部信息到组装实际要传递的数据包直至最终解析来自远程服务的回答内容[^3]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值