阿里国际开源Ovis2系列多模态大语言模型共有六个版本

最新推荐文章于 2025-03-19 09:44:52 发布

吴脑的键客

最新推荐文章于 2025-03-19 09:44:52 发布

阅读量1.1k

点赞数 13

分类专栏：机器人技术文章标签：语言模型人工智能自然语言处理

本文链接：https://blog.csdn.net/weixin_41446370/article/details/145812218

版权

机器人技术专栏收录该内容

53 篇文章

订阅专栏

2025 年 2 月 21 日，阿里巴巴国际化团队宣布其新型多模态大语言模型Ovis2 系列正式开源。

Ovis2 是阿里巴巴国际化团队提出的Ovis系列模型的最新版本。与前序1. 6 版本相比，Ovis2 在数据构造和训练方法上都有显著改进。它不仅强化了小规模模型的能力密度，还通过指令微调和偏好学习大幅提升了思维链（CoT）推理能力。此外，Ovis2 引入了视频和多图像处理能力，并增强了多语言能力和复杂场景下的OCR能力，显著提升了模型的实用性。

此次开源的Ovis2 系列包括1B、2B、4B、8B、16B和34B六个版本，各个参数版本均达到了同尺寸的SOTA（State of the Art）水平。其中，Ovis2-34B在权威评测榜单OpenCompass上展现出了卓越的性能。在多模态通用能力榜单上，Ovis2-34B位列所有开源模型第二，以不到一半的参数尺寸超过了诸多70B开源旗舰模型。在多模态数学推理榜单上，Ovis2-34B更是位列所有开源模型第一，其他尺寸版本也展现出出色的推理能力。这些成绩不仅证明了Ovis架构的有效性，也展示了开源社区在推动多模态大模型发展方面的巨大潜力。

在这里插入图片描述
Ovis2 的架构设计巧妙地解决了模态间嵌入策略差异这一局限性。它由视觉tokenizer、视觉嵌入表和LLM三个关键组件构成。视觉tokenizer将输入图像分割成多个图像块，利用视觉Transformer提取特征，并通过视觉头层将特征匹配到“视觉单词”上，得到概率化的视觉token。视觉嵌入表存储每个视觉单词对应的嵌入向量，而LLM则将视觉嵌入向量与文本嵌入向量拼接后进行处理，生成文本输出，完成多模态任务。

在训练策略上，Ovis2 采用了四阶段训练方法，以充分激发其多模态理解能力。第一阶段冻结大部分LLM和ViT参数，训练视觉模块，学习视觉特征到嵌入的转化。第二阶段进一步增强视觉模块的特征提取能力，提升高分辨率图像理解、多语言和OCR能力。第三阶段通过对话形式的视觉Caption数据对齐视觉嵌入与LLM的对话格式。第四阶段则是多模态指令训练和偏好学习，进一步提升模型在多种模态下对用户指令的遵循能力和输出质量。

为了提升视频理解能力，Ovis2 开发了一种创新的关键帧选择算法。该算法基于帧与文本的相关性、帧之间的组合多样性和帧的序列性挑选最有用的视频帧。通过高维条件相似度计算、行列式点过程（DPP）和马尔可夫决策过程（MDP），算法能够在有限的视觉上下文中高效地选择关键帧，从而提升视频理解的性能。

Ovis2 系列模型在OpenCompass多模态评测榜单上的表现尤为突出。不同尺寸的模型在多个Benchmark上均取得了SOTA成绩。例如，Ovis2-34B在多模态通用能力和数学推理榜单上分别位列第二和第一，展现了其强大的性能。此外，Ovis2 在视频理解榜单上也取得了领先性能，进一步证明了其在多模态任务中的优势。

阿里巴巴国际化团队表示，开源是推动AI技术进步的关键力量。通过公开分享Ovis2 的研究成果，团队期待与全球开发者共同探索多模态大模型的前沿，并激发更多创新应用。目前，Ovis2 的代码已开源至GitHub，模型可在Hugging Face和Modelscope平台上获取，同时提供了在线Demo供用户体验。相关研究论文也已发布在arXiv上，供开发者和研究者参考。

我们使用 OpenCompass 多模态和推理排行榜中使用的 VLMEvalKit 来评估 Ovis2。

在这里插入图片描述

图像基准

Benchmark	Qwen2.5-VL-3B	SAIL-VL-2B	InternVL2.5-2B-MPO	Ovis1.6-3B	InternVL2.5-1B-MPO	Ovis2-1B	Ovis2-2B
MMBench-V1.1_test	77.1	73.6	70.7	74.1	65.8	68.4	76.9
MMStar	56.5	56.5	54.9	52.0	49.5	52.1	56.7
MMMU_val	51.4	44.1	44.6	46.7	40.3	36.1	45.6
MathVista_testmini	60.1	62.8	53.4	58.9	47.7	59.4	64.1
HallusionBench	48.7	45.9	40.7	43.8	34.8	45.2	50.2
AI2D	81.4	77.4	75.1	77.8	68.5	76.4	82.7
OCRBench	83.1	83.1	83.8	80.1	84.3	89.0	87.3
MMVet	63.2	44.2	64.2	57.6	47.2	50.0	58.3
MMBench_test	78.6	77	72.8	76.6	67.9	70.2	78.9
MMT-Bench_val	60.8	57.1	54.4	59.2	50.8	55.5	61.7
RealWorldQA	66.5	62	61.3	66.7	57	63.9	66.0
BLINK	48.4	46.4	43.8	43.8	41	44.0	47.9
QBench	74.4	72.8	69.8	75.8	63.3	71.3	76.2
ABench	75.5	74.5	71.1	75.2	67.5	71.3	76.6
MTVQA	24.9	20.2	22.6	21.1	21.7	23.7	25.6

Benchmark	Qwen2.5-VL-7B	InternVL2.5-8B-MPO	MiniCPM-o-2.6	Ovis1.6-9B	InternVL2.5-4B-MPO	Ovis2-4B	Ovis2-8B
MMBench-V1.1_test	82.6	82.0	80.6	80.5	77.8	81.4	83.6
MMStar	64.1	65.2	63.3	62.9	61	61.9	64.6
MMMU_val	56.2	54.8	50.9	55	51.8	49.0	57.4
MathVista_testmini	65.8	67.9	73.3	67.3	64.1	69.6	71.8
HallusionBench	56.3	51.7	51.1	52.2	47.5	53.8	56.3
AI2D	84.1	84.5	86.1	84.4	81.5	85.7	86.6
OCRBench	87.7	88.2	88.9	83	87.9	91.1	89.1
MMVet	66.6	68.1	67.2	65	66	65.5	65.1
MMBench_test	83.4	83.2	83.2	82.7	79.6	83.2	84.9
MMT-Bench_val	62.7	62.5	62.3	64.9	61.6	65.2	66.6
RealWorldQA	68.8	71.1	68.0	70.7	64.4	71.1	72.5
BLINK	56.1	56.6	53.9	48.5	50.6	53.0	54.3
QBench	77.9	73.8	78.7	76.7	71.5	78.1	78.9
ABench	75.6	77.0	77.5	74.4	75.9	77.5	76.4
MTVQA	28.5	27.2	23.1	19.2	28	29.4	29.7

Benchmark	Qwen2.5-VL-72B	InternVL2.5-38B-MPO	InternVL2.5-26B-MPO	Ovis1.6-27B	LLaVA-OV-72B	Ovis2-16B	Ovis2-34B
MMBench-V1.1_test	87.8	85.4	84.2	82.2	84.4	85.6	86.6
MMStar	71.1	70.1	67.7	63.5	65.8	67.2	69.2
MMMU_val	67.9	63.8	56.4	60.3	56.6	60.7	66.7
MathVista_testmini	70.8	73.6	71.5	70.2	68.4	73.7	76.1
HallusionBench	58.8	59.7	52.4	54.1	47.9	56.8	58.8
AI2D	88.2	87.9	86.2	86.6	86.2	86.3	88.3
OCRBench	88.1	89.4	90.5	85.6	74.1	87.9	89.4
MMVet	76.7	72.6	68.1	68	60.6	68.4	77.1
MMBench_test	88.2	86.4	85.4	84.6	85.6	87.1	87.8
MMT-Bench_val	69.1	69.1	65.7	68.2	-	69.2	71.2
RealWorldQA	75.9	74.4	73.7	72.7	73.9	74.1	75.6
BLINK	62.3	63.2	62.6	48	-	59.0	60.1
QBench	-	76.1	76.0	77.7	-	79.5	79.8
ABench	-	78.6	79.4	76.5	-	79.4	78.7
MTVQA	-	31.2	28.7	26.5	-	30.3	30.6

视频基准

Benchmark	Qwen2.5-VL-3B	InternVL2.5-2B	InternVL2.5-1B	Ovis2-1B	Ovis2-2B
VideoMME(wo/w-subs)	61.5/67.6	51.9 / 54.1	50.3 / 52.3	48.6/49.5	57.2/60.8
MVBench	67.0	68.8	64.3	60.32	64.9
MLVU(M-Avg/G-Avg)	68.2/-	61.4/-	57.3/-	58.5/3.66	68.6/3.86
MMBench-Video	1.63	1.44	1.36	1.26	1.57
TempCompass	64.4	-	-	51.43	62.64

Benchmark	Qwen2.5-VL-7B	InternVL2.5-8B	LLaVA-OV-7B	InternVL2.5-4B	Ovis2-4B	Ovis2-8B
VideoMME(wo/w-subs)	65.1/71.6	64.2 / 66.9	58.2/61.5	62.3 / 63.6	64.0/66.3	68.0/71.6
MVBench	69.6	72.0	56.7	71.6	68.45	68.15
MLVU(M-Avg/G-Avg)	70.2/-	68.9/-	64.7/-	68.3/-	70.8/4.23	76.4/4.25
MMBench-Video	1.79	1.68	-	1.73	1.69	1.85
TempCompass	71.7	-	-	-	67.02	69.28

Benchmark	Qwen2.5-VL-72B	InternVL2.5-38B	InternVL2.5-26B	LLaVA-OneVision-72B	Ovis2-16B	Ovis2-34B
VideoMME(wo/w-subs)	73.3/79.1	70.7 / 73.1	66.9 / 69.2	66.2/69.5	70.0/74.4	71.2/75.6
MVBench	70.4	74.4	75.2	59.4	68.6	70.3
MLVU(M-Avg/G-Avg)	74.6/-	75.3/-	72.3/-	68.0/-	77.7/4.44	77.8/4.59
MMBench-Video	2.02	1.82	1.86	-	1.92	1.98
TempCompass	74.8	-	-	-	74.16	75.97

使用

pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
pip install flash-attn==2.7.0.post2 --no-build-isolation

Ovis2-1B

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()

text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# single-image input
image_path = '/data/images/example_1.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = 'Describe the image.'
query = f'<image>\n{text}'

## cot-style input
# cot_suffix = "Provide a step-by-step solution to the problem, and conclude with 'the answer is' followed by the final solution."
# image_path = '/data/images/example_1.jpg'
# images = [Image.open(image_path)]
# max_partition = 9
# text = "What's the area of the shape?"
# query = f'<image>\n{text}\n{cot_suffix}'

## multiple-images input
# image_paths = [
#     '/data/images/example_1.jpg',
#     '/data/images/example_2.jpg',
#     '/data/images/example_3.jpg'
# ]
# images = [Image.open(image_path) for image_path in image_paths]
# max_partition = 4
# text = 'Describe each image.'
# query = '\n'.join([f'Image {i+1}: <image>' for i in range(len(images))]) + '\n' + text

## video input (require `pip install moviepy==1.0.3`)
# from moviepy.editor import VideoFileClip
# video_path = '/data/videos/example_1.mp4'
# num_frames = 12
# max_partition = 1
# text = 'Describe the video.'
# with VideoFileClip(video_path) as clip:
#     total_frames = int(clip.fps * clip.duration)
#     if total_frames <= num_frames:
#         sampled_indices = range(total_frames)
#     else:
#         stride = total_frames / num_frames
#         sampled_indices = [min(total_frames - 1, int((stride * i + stride * (i + 1)) / 2)) for i in range(num_frames)]
#     frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
#     frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
# images = frames
# query = '\n'.join(['<image>'] * len(images)) + '\n' + text

## text-only input
# images = []
# max_partition = None
# text = 'Hello'
# query = text

# format conversation
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

# generate output
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
    print(f'Output:\n{output}')

批推理

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# preprocess inputs
batch_inputs = [
    ('/data/images/example_1.jpg', 'What colors dominate the image?'),
    ('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
    ('/data/images/example_3.jpg', 'Is there any text in the image?')
]

batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []

for image_path, text in batch_inputs:
    image = Image.open(image_path)
    query = f'<image>\n{text}'
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    batch_input_ids.append(input_ids.to(device=model.device))
    batch_attention_mask.append(attention_mask.to(device=model.device))
    batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))

batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
                                                  padding_value=0.0).flip(dims=[1])
batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
                                                       batch_first=True, padding_value=False).flip(dims=[1])
batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]

# generate outputs
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
                                **gen_kwargs)

for i in range(len(batch_inputs)):
    output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
    print(f'Output {i + 1}:\n{output}\n')