月之暗面开源：多模态推理模型(激活2.8B) Kimi-VL-A3B-Thinking

最新推荐文章于 2025-05-18 20:17:36 发布

Open-source-AI

最新推荐文章于 2025-05-18 20:17:36 发布

阅读量819

点赞数 19

分类专栏：前沿文章标签：大模型人工智能开源

本文链接：https://blog.csdn.net/weixin_52582710/article/details/147170676

版权

前沿专栏收录该内容

153 篇文章

订阅专栏

在这里插入图片描述

Kimi-VL-A3B-Thinking 模型介绍

1. 模型概述

Kimi-VL 是一个高效的开源多模态模型，专注于视觉-语言任务（Vision-Language Model, VLM）。它通过激活仅 2.8B 参数的语言解码器（Kimi-VL-A3B），在多模态推理、长文本理解以及代理能力方面表现出色。

1.1 心特点

多模态推理能力：能够处理复杂的视觉和语言任务，如大学级别的图像和视频理解、光学字符识别（OCR）、数学推理等。
长文本处理能力：支持 128K 的扩展上下文窗口，能够处理长视频和长文档。
高效性：在保持高性能的同时，计算成本较低。

1.2 应用场景

多轮代理交互：如 OSWorld 等任务。
视觉语言任务：包括图像和视频理解、OCR、数学推理等。

2. 模型架构

Kimi-VL 的架构基于以下三个主要组件： 1.M ixture-of-Experts (MoE) 语言模型：通过稀疏激活机制提高效率。 2. 原生分辨率视觉编码器（MoonViT）：能够处理超高分辨率的视觉输入。 3. MLP 投影器：将视觉和语言特征映射到同一空间。

2.1 MoE 语言模型

稀疏激活：仅激活 2.8B 参数，显著降低计算成本。
高效推理：在多模态任务中表现出色。

2.2 MoonViT 视觉编码器

原生分辨率支持：能够处理超高分辨率的图像和视频。
低计算成本：在普通视觉任务中保持高效。

3. 模型性能

Kimi-VL 在多个基准测试中表现出色，与现有的高效多模态模型（如 GPT-4o-mini、Qwen2.5-VL-7B 和 Gemma-3-12B-IT）相比具有竞争力，并在某些领域超越了 GPT-4o。

3.1 关键性能指标

长视频和长文档处理：
- LongVideoBench：64.5 分
- MMLongBench-Doc：35.1 分
视觉理解：
- InfoVQA：83.2 分
- ScreenSpot-Pro：34.5 分
数学推理：
- MMMU：61.7 分
- MathVision：36.8 分
- MathVista：71.3 分

3.2 性能对比表

Benchmark	GPT-4o	GPT-4o-mini	Qwen2.5-VL-7B	Qwen2.5-VL-72B	Gemma-3-27B	Kimi-VL-Thinking
MathVision (Pass@1)	30.4	38.1	25.1	35.5	32.1	36.8
MathVista-mini (Pass@1)	63.8	56.7	74.8	68.2	62.3	71.3
MMMU (val) (Pass@1)	69.1	60.0	74.8	58.6	64.8	61.7

4. 模型变体

Kimi-VL 提供两种主要变体，分别针对不同的应用场景：

Kimi-VL-A3B-Instruct：适用于通用多模态感知和理解任务。
Kimi-VL-A3B-Thinking：专注于高级文本和多模态推理任务（如数学推理）。

4.1 模型参数

模型名称	总参数量	激活参数量	上下文长度	下载链接
Kimi-VL-A3B-Instruct	16B	3B	128K	https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct
Kimi-VL-A3B-Thinking	16B	3B	128K	https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking

4.2 推荐参数设置

Kimi-VL-A3B-Thinking：推荐使用 Temperature = 0.6。
Kimi-VL-A3B-Instruct：推荐使用 Temperature = 0.2。

5. 模型使用方法

5.1 使用 Hugging Face Transformers 进行推理

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "moonshotai/Kimi-VL-A3B-Thinking"

# 加载模型和处理器
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# 准备输入图像和消息
image_paths = ["./figures/demo1.png", "./figures/demo2.png"]
images = [Image.open(path) for path in image_paths]
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path} for image_path in image_paths
        ] + [{"type": "text", "text": "Please infer step by step who this manuscript belongs to and what it records."}]
    }
]

# 处理输入并生成响应
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids,_ids out in zip(inputs.input_ids, generated_ids)]
response = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(response)

5.2 使用 VLLM 进行推理

Kimi-VL 已提交 Merge Request #16387 到 vLLM，可以在对应的分支上部署使用。

6. 总结

Kimi-VL 是一个高效、功能强大的多模态模型，适用于各种复杂的视觉和语言任务。其在长文本处理、视觉理解和数学推理等方面的表现尤为突出，为多模态模型的发展树立了新的标准。用户可以根据具体需求选择合适的模型变体（Instruct 或 Thinking），并通过 Hugging Face 或 VLLM 进行高效推理。