探索 Skywork-VL-Reward-7B：AI 推理新境界

吴脑的键客

已于 2025-04-24 07:23:37 修改

阅读量1k

点赞数 8

分类专栏：人工智能文章标签：人工智能开源

于 2025-04-24 07:22:16 首次发布

本文链接：https://blog.csdn.net/weixin_41446370/article/details/147467813

版权

人工智能专栏收录该内容

598 篇文章

订阅专栏

市场上缺乏多模态奖励模型已成为制约多模态强化技术发展的主要瓶颈。我们开源了7B多模态奖励模型Skywork-VL-Reward，为业界注入了新的动力，开启了多模态强化学习的新篇章

Skywork-VL-Reward基于Qwen2.5-VL-7B-Instruct架构，增加了训练奖励模型的值头结构。我们在 VL-RewardBench 中获得了 73.1 分的 SOTA，在 RewardBench 中获得了 90.1 分的高分。此外，我们在 Skywork-R1V-2.0 上训练的 MPO 进一步验证了模型的有效性。我们希望这个多模态奖励模型能为开源社区做出贡献！更多详情，请参阅我们的技术报告。

在这里插入图片描述

技术报告

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

评估

VL-RewardBench

模型名称	模型大小	通用	幻觉	思考推理	总体精度	宏观平均值
Proprietary Models
Claude-3.5-Sonnet(2024-06-22)	-	43.4	55.0	62.3	55.3	53.6
Gemini-1.5-Flash (2024-09-24)	-	47.8	59.6	58.4	57.6	55.3
GPT-4o(2024-08-06)	-	49.1	67.6	70.5	65.8	62.4
Gemini-1.5-Pro(2024-09-24)	-	50.8	72.5	64.2	67.2	62.5
Gemini-2.0-flash-exp(2024-12)	-	50.8	72.6	70.1	68.8	64.5
Open-Source Models
Qwen2-VL-7B-Instruct	7B	31.6	19.1	51.1	28.3	33.9
MAmmoTH-VL-8B	8B	36.0	40.0	52.0	42.2	42.7
Qwen2.5-VL-7B-Instruct	7B	43.4	42.0	63.0	48.0	49.5
InternVL3-8B	8B	60.6	44.0	62.3	57.0	55.6
IXC-2.5-Reward-7B	7B	80.3	65.3	60.4	66.3	68.6
Qwen2-VL-72B-Instruct	72B	38.1	32.8	58.0	39.5	43.0
Molmo-72B-0924	72B	33.9	42.3	54.9	44.1	43.7
QVQ-72B-Preview	72B	41.8	46.2	51.2	46.4	46.4
Qwen2.5-VL-72B-Instruct	72B	47.8	46.8	63.5	51.6	52.7
InternVL3-78B	78B	67.8	52.5	64.5	63.3	61.6
Skywork-VL Reward(Ours)	7B	66.0	80.0	61.0	73.1	69.0

RewardBench

模型名称	聊天	努力聊天	安全	思考推理	得分
Language-Only Reward Models
InternLM2-7B-Reward	99.2	69.5	87.2	94.5	87.6
Skywork-Reward-Llama3.1-8B	95.8	87.3	90.8	96.2	92.5
Skywork-Reward-Llama-3.1-8B-v0.2	94.7	88.4	92.7	96.7	93.1
QRM-Llama3.1-8B-v2	96.4	86.8	92.6	96.8	93.1
Multi-Modal Reward Models
Qwen2-VL-7B-Instruct	65.1	50.9	55.8	68.3	60.0
InternVL3-8B	97.2	50.4	83.6	83.9	78.8
Qwen2.5-VL-7B-Instruct	94.3	63.8	84.1	86.2	82.1
IXC-2.5-Reward-7B	90.8	83.8	87.8	90.0	88.1
Skywork-VL Reward(Ours)	90.0	87.5	91.1	91.8	90.1

在这里插入图片描述

使用方法

设置环境

conda create -n vl-reward python=3.11
conda activate vl-reward
bash setup.sh

运行推理代码

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from trl import AutoModelForCausalLMWithValueHead
from qwen_vl_utils import process_vision_info
from transformers.utils import cached_file
from safetensors import safe_open


processor = AutoProcessor.from_pretrained("Skywork/Skywork-VL-Reward-7B")
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Skywork/Skywork-VL-Reward-7B", min_pixels=min_pixels, max_pixels=max_pixels)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Skywork/Skywork-VL-Reward-7B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving
# pip install flash-attn --no-build-isolation
#
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Skywork/Skywork-VL-Reward-7B",
#     device_map="auto",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
# )

model = AutoModelForCausalLMWithValueHead.from_pretrained(model)
vhead_file = cached_file(
    path_or_repo_id="Skywork/Skywork-VL-Reward-7B", filename="value_head.safetensors"
)
with safe_open(vhead_file, framework="pt", device="cpu") as f:
    vhead_params = {key: f.get_tensor(key) for key in f.keys()}
model.load_state_dict(vhead_params, strict=False)
model.requires_grad_(False)
model.eval()

# score: 23.89
# if you use flash_attention_2 the score will be 23.76
demo_image = "demo.jpg"
demo_question = "Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: Is Purple the highest value?\nChoices:\n(A) no\n(B) yes"
demo_answer = "The answer is: B"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": demo_image,
            },
            {
                "type": "text",
                "text": demo_question,
            },
        ],
    },
    {
        "role": "assistant",
        "content": demo_answer,
    },
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=False
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
values = model(**inputs, return_dict=True, use_cache=False)[-1]
scores = values.gather(
    dim=-1, index=(inputs["attention_mask"].sum(dim=-1, keepdim=True) - 1)
)
score = scores[0].item()
print("Reward Score is: ", score)

Reward Score is: 24.22561264038086