市场上缺乏多模态奖励模型已成为制约多模态强化技术发展的主要瓶颈。我们开源了7B多模态奖励模型Skywork-VL-Reward,为业界注入了新的动力,开启了多模态强化学习的新篇章
Skywork-VL-Reward基于Qwen2.5-VL-7B-Instruct架构,增加了训练奖励模型的值头结构。我们在 VL-RewardBench 中获得了 73.1 分的 SOTA,在 RewardBench 中获得了 90.1 分的高分。此外,我们在 Skywork-R1V-2.0 上训练的 MPO 进一步验证了模型的有效性。我们希望这个多模态奖励模型能为开源社区做出贡献!更多详情,请参阅我们的技术报告。
技术报告
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
评估
VL-RewardBench
模型名称 | 模型大小 | 通用 | 幻觉 | 思考推理 | 总体精度 | 宏观平均值 |
---|---|---|---|---|---|---|
Proprietary Models | ||||||
Claude-3.5-Sonnet(2024-06-22) | - | 43.4 | 55.0 | 62.3 | 55.3 | 53.6 |
Gemini-1.5-Flash (2024-09-24) | - | 47.8 | 59.6 | 58.4 | 57.6 | 55.3 |
GPT-4o(2024-08-06) | - | 49.1 | 67.6 | 70.5 | 65.8 | 62.4 |
Gemini-1.5-Pro(2024-09-24) | - | 50.8 | 72.5 | 64.2 | 67.2 | 62.5 |
Gemini-2.0-flash-exp(2024-12) | - | 50.8 | 72.6 | 70.1 | 68.8 | 64.5 |
Open-Source Models | ||||||
Qwen2-VL-7B-Instruct | 7B | 31.6 | 19.1 | 51.1 | 28.3 | 33.9 |
MAmmoTH-VL-8B | 8B | 36.0 | 40.0 | 52.0 | 42.2 | 42.7 |
Qwen2.5-VL-7B-Instruct | 7B | 43.4 | 42.0 | 63.0 | 48.0 | 49.5 |
InternVL3-8B | 8B | 60.6 | 44.0 | 62.3 | 57.0 | 55.6 |
IXC-2.5-Reward-7B | 7B | 80.3 | 65.3 | 60.4 | 66.3 | 68.6 |
Qwen2-VL-72B-Instruct | 72B | 38.1 | 32.8 | 58.0 | 39.5 | 43.0 |
Molmo-72B-0924 | 72B | 33.9 | 42.3 | 54.9 | 44.1 | 43.7 |
QVQ-72B-Preview | 72B | 41.8 | 46.2 | 51.2 | 46.4 | 46.4 |
Qwen2.5-VL-72B-Instruct | 72B | 47.8 | 46.8 | 63.5 | 51.6 | 52.7 |
InternVL3-78B | 78B | 67.8 | 52.5 | 64.5 | 63.3 | 61.6 |
Skywork-VL Reward(Ours) | 7B | 66.0 | 80.0 | 61.0 | 73.1 | 69.0 |
RewardBench
模型名称 | 聊天 | 努力聊天 | 安全 | 思考推理 | 得分 | |
---|---|---|---|---|---|---|
Language-Only Reward Models | ||||||
InternLM2-7B-Reward | 99.2 | 69.5 | 87.2 | 94.5 | 87.6 | |
Skywork-Reward-Llama3.1-8B | 95.8 | 87.3 | 90.8 | 96.2 | 92.5 | |
Skywork-Reward-Llama-3.1-8B-v0.2 | 94.7 | 88.4 | 92.7 | 96.7 | 93.1 | |
QRM-Llama3.1-8B-v2 | 96.4 | 86.8 | 92.6 | 96.8 | 93.1 | |
Multi-Modal Reward Models | ||||||
Qwen2-VL-7B-Instruct | 65.1 | 50.9 | 55.8 | 68.3 | 60.0 | |
InternVL3-8B | 97.2 | 50.4 | 83.6 | 83.9 | 78.8 | |
Qwen2.5-VL-7B-Instruct | 94.3 | 63.8 | 84.1 | 86.2 | 82.1 | |
IXC-2.5-Reward-7B | 90.8 | 83.8 | 87.8 | 90.0 | 88.1 | |
Skywork-VL Reward(Ours) | 90.0 | 87.5 | 91.1 | 91.8 | 90.1 |
使用方法
设置环境
conda create -n vl-reward python=3.11
conda activate vl-reward
bash setup.sh
运行推理代码
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from trl import AutoModelForCausalLMWithValueHead
from qwen_vl_utils import process_vision_info
from transformers.utils import cached_file
from safetensors import safe_open
processor = AutoProcessor.from_pretrained("Skywork/Skywork-VL-Reward-7B")
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Skywork/Skywork-VL-Reward-7B", min_pixels=min_pixels, max_pixels=max_pixels)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Skywork/Skywork-VL-Reward-7B",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving
# pip install flash-attn --no-build-isolation
#
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
# "Skywork/Skywork-VL-Reward-7B",
# device_map="auto",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# )
model = AutoModelForCausalLMWithValueHead.from_pretrained(model)
vhead_file = cached_file(
path_or_repo_id="Skywork/Skywork-VL-Reward-7B", filename="value_head.safetensors"
)
with safe_open(vhead_file, framework="pt", device="cpu") as f:
vhead_params = {key: f.get_tensor(key) for key in f.keys()}
model.load_state_dict(vhead_params, strict=False)
model.requires_grad_(False)
model.eval()
# score: 23.89
# if you use flash_attention_2 the score will be 23.76
demo_image = "demo.jpg"
demo_question = "Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: Is Purple the highest value?\nChoices:\n(A) no\n(B) yes"
demo_answer = "The answer is: B"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": demo_image,
},
{
"type": "text",
"text": demo_question,
},
],
},
{
"role": "assistant",
"content": demo_answer,
},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
values = model(**inputs, return_dict=True, use_cache=False)[-1]
scores = values.gather(
dim=-1, index=(inputs["attention_mask"].sum(dim=-1, keepdim=True) - 1)
)
score = scores[0].item()
print("Reward Score is: ", score)
Reward Score is: 24.22561264038086