探索 Skywork-VL-Reward-7B:AI 推理新境界

市场上缺乏多模态奖励模型已成为制约多模态强化技术发展的主要瓶颈。我们开源了7B多模态奖励模型Skywork-VL-Reward,为业界注入了新的动力,开启了多模态强化学习的新篇章

Skywork-VL-Reward基于Qwen2.5-VL-7B-Instruct架构,增加了训练奖励模型的值头结构。我们在 VL-RewardBench 中获得了 73.1 分的 SOTA,在 RewardBench 中获得了 90.1 分的高分。此外,我们在 Skywork-R1V-2.0 上训练的 MPO 进一步验证了模型的有效性。我们希望这个多模态奖励模型能为开源社区做出贡献!更多详情,请参阅我们的技术报告。

在这里插入图片描述

技术报告

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

评估

VL-RewardBench

模型名称模型大小通用幻觉思考推理总体精度宏观平均值
Proprietary Models
Claude-3.5-Sonnet(2024-06-22)-43.455.062.355.353.6
Gemini-1.5-Flash (2024-09-24)-47.859.658.457.655.3
GPT-4o(2024-08-06)-49.167.670.565.862.4
Gemini-1.5-Pro(2024-09-24)-50.872.564.267.262.5
Gemini-2.0-flash-exp(2024-12)-50.872.670.168.864.5
Open-Source Models
Qwen2-VL-7B-Instruct7B31.619.151.128.333.9
MAmmoTH-VL-8B8B36.040.052.042.242.7
Qwen2.5-VL-7B-Instruct7B43.442.063.048.049.5
InternVL3-8B8B60.644.062.357.055.6
IXC-2.5-Reward-7B7B80.365.360.466.368.6
Qwen2-VL-72B-Instruct72B38.132.858.039.543.0
Molmo-72B-092472B33.942.354.944.143.7
QVQ-72B-Preview72B41.846.251.246.446.4
Qwen2.5-VL-72B-Instruct72B47.846.863.551.652.7
InternVL3-78B78B67.852.564.563.361.6
Skywork-VL Reward(Ours)7B66.080.061.073.169.0

RewardBench

模型名称聊天努力聊天安全思考推理得分
Language-Only Reward Models
InternLM2-7B-Reward99.269.587.294.587.6
Skywork-Reward-Llama3.1-8B95.887.390.896.292.5
Skywork-Reward-Llama-3.1-8B-v0.294.788.492.796.793.1
QRM-Llama3.1-8B-v296.486.892.696.893.1
Multi-Modal Reward Models
Qwen2-VL-7B-Instruct65.150.955.868.360.0
InternVL3-8B97.250.483.683.978.8
Qwen2.5-VL-7B-Instruct94.363.884.186.282.1
IXC-2.5-Reward-7B90.883.887.890.088.1
Skywork-VL Reward(Ours)90.087.591.191.890.1

在这里插入图片描述

使用方法

设置环境

conda create -n vl-reward python=3.11
conda activate vl-reward
bash setup.sh

运行推理代码

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from trl import AutoModelForCausalLMWithValueHead
from qwen_vl_utils import process_vision_info
from transformers.utils import cached_file
from safetensors import safe_open


processor = AutoProcessor.from_pretrained("Skywork/Skywork-VL-Reward-7B")
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Skywork/Skywork-VL-Reward-7B", min_pixels=min_pixels, max_pixels=max_pixels)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Skywork/Skywork-VL-Reward-7B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving
# pip install flash-attn --no-build-isolation
#
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Skywork/Skywork-VL-Reward-7B",
#     device_map="auto",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
# )

model = AutoModelForCausalLMWithValueHead.from_pretrained(model)
vhead_file = cached_file(
    path_or_repo_id="Skywork/Skywork-VL-Reward-7B", filename="value_head.safetensors"
)
with safe_open(vhead_file, framework="pt", device="cpu") as f:
    vhead_params = {key: f.get_tensor(key) for key in f.keys()}
model.load_state_dict(vhead_params, strict=False)
model.requires_grad_(False)
model.eval()

# score: 23.89
# if you use flash_attention_2 the score will be 23.76
demo_image = "demo.jpg"
demo_question = "Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: Is Purple the highest value?\nChoices:\n(A) no\n(B) yes"
demo_answer = "The answer is: B"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": demo_image,
            },
            {
                "type": "text",
                "text": demo_question,
            },
        ],
    },
    {
        "role": "assistant",
        "content": demo_answer,
    },
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=False
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
values = model(**inputs, return_dict=True, use_cache=False)[-1]
scores = values.gather(
    dim=-1, index=(inputs["attention_mask"].sum(dim=-1, keepdim=True) - 1)
)
score = scores[0].item()
print("Reward Score is: ", score)

Reward Score is: 24.22561264038086

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值