总结-NuerIPS2023-Direct Preference Optimization:Your Language Model is Secretly a Reward Model

论文题目:Direct Preference Optimization:Your Language Model is Secretly a Reward Model
作者列表:Rafael Rafailov 1 ^1 1,Archit Sharma 1 ^1 1,Eric Mitchell 1 ^1 1
作者单位: 1 ^1 1Stanford University
会议/期刊:NuerIPS2023
论文链接:https://arxiv.org/abs/2305.18290
源码链接:https://github.com/eric-mitchell/direct-preference-optimization/tree/main

目录

论文贡献

使用DPO(Direct Preference Optimization)算法,直接通过简单的分类损失函数来优化人类偏好,避免了复杂的强化学习过程,解决RLHF复杂、计算成本高、训练不稳定的问题,DPO效果如下:

功能性安全性安全性效率
HelpfulnessHarmlessnessHonestycomputational cost
InstructGPT(PPO)
Claude(PPO)
DPO

上述论文链接:
InstructGPT(PPO): https://arxiv.org/abs/2203.02155
claude(PPO): https://arxiv.org/abs/2204.05862
DPO: https://arxiv.org/abs/2305.18290
评价指标解析:
Helpfulness: 考虑在NLP任务上的表现,考虑在对抗性输入下,模型“更恰当地”回应对抗性输入,同时仍然提供有用的答案,而不诉诸逃避。
Harmlessness: 偏见、引起对立、大模型越狱等
Honesty: 减小大模型幻觉,包括内在幻觉(模型的回答和对话上下文相矛盾)和外在幻觉(模型回答与事实不符)

具体方案

1. 核心思想:
DPO的核心思想是通过直接优化语言模型的策略,使其更好地符合人类的偏好,而不是像传统的RLHF那样先训练奖励模型再通过强化学习优化策略。DPO算法通过构建一个偏好模型,将人类的偏好信息转化为语言模型的分类任务,避免了强化学习中的不稳定性和高计算开销。
在这里插入图片描述
2. 方法流程:
DPO的主要步骤如下:

  • 收集人类偏好数据:DPO需要构建一个包含人类偏好对的数据集。每个偏好对都包含一个提示(prompt)和两种可能的完成方式(response)——一种是首选的(preferred),一种是不受欢迎的(unpreferred)。
  • 偏好模型的优化:通过最大化偏好数据中人类更喜欢的响应的概率(相对于不喜欢的响应),直接优化语言模型的策略。这相当于在分类任务中,模型需要区分人类偏好的正确和不正确响应。
  • 损失函数的设计:DPO引入了一个简单的二元交叉熵损失函数来优化语言模型。这个损失函数通过调整模型参数,增加人类偏好响应的概率,同时降低不受欢迎响应的概率。DPO还通过引入KL散度约束,确保模型不会偏离预训练模型太远,从而防止模型退化。
    3. 与传统RLHF的区别:
  • 无需奖励模型:传统RLHF方法先训练奖励模型,而DPO直接在策略层面进行优化。
  • 计算开销小:DPO不需要强化学习中的复杂采样和迭代过程,大大减少了计算资源的需求。
  • 稳定性高:通过将强化学习中的问题转换为简单的分类任务,DPO避免了强化学习中的不稳定性。

性能分析

评价指标

主要比对对象: PPO算法
评估者:
GPT-4
现有的研究表明LM可以是现有指标更好的自动评估者
https://arxiv.org/abs/2304.00723

性能

GPT-4评估的在单论对话中的赢率(对于同一个prompt,每个模型都给出响应的回答,由GPT-4评判那个模型的输出更好)
在Anthropic-HH Dialogue dataset上,DPO的表现性能较好。
在这里插入图片描述

GPT-4评估的DPO和PPO对于总结文章能力的泛化能力对比
对比PPO,DPO的泛化能力更强。
在这里插入图片描述

GPT-4和评估结果和人类评估结果一致,证明本文使用GPT-4作为评价者是合理的

在这里插入图片描述

简洁实现(代码)

基于TRL(Transformer Reinforcement Learning)库实现微调LLMA2.

TRL is a library to post-train LLMs and diffusion models with methods such as Supervised Fine-tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO).
The library is built on top of 🤗 Transformers and is compatible with any model architecture available there.

Step 1: install packages

# Install the necessary packages
!pip install transformers==4.40.1
!pip install gitpython==3.1.43
!pip install auto-gptq==0.7.1
!pip install optimum==1.19.1
!pip install bitsandbytes==0.43.1
!pip install datasets==2.19.0
!pip install peft==0.10.0
!pip install trl==0.8.6
!pip install accelerate==0.29.3
!pip install datasets==2.19.0
import torch
import re
import json
import gdown
from datasets import Dataset
import pandas as pd
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig, GenerationConfig
from tqdm.auto import tqdm
from trl import DPOTrainer
import datasets
import matplotlib.pyplot as plt

Step 2: load Anthropic HH dataset

from tqdm.auto import tqdm
from typing import Dict,Union,List,Tuple
def extract_anthropic_prompt(prompt_and_response):
    """Extract the anthropic prompt from a prompt and response pair."""
    search_term = '\n\nAssistant:'
    search_term_idx = prompt_and_response.rfind(search_term)
    assert search_term_idx != -1, f"Prompt and response does not contain '{search_term}'"
    return prompt_and_response[:search_term_idx + len(search_term)]

def get_hh(split: str, silent: bool = False, cache_dir: str = None) -> Dict[str, Dict[str, Union[List[Tuple[int, int]], List[str], str]]]:
    """Load the Anthropic Helpful-Harmless dataset from Huggingface and convert it to the necessary format.

       The dataset is converted to a dictionary with the following structure:
       {
           'prompt1': {
               'responses': List[str],
               'pairs': List[Tuple[int, int]],
               'sft_target': str
           },
           'prompt2': {
               ...
           },
       }

       Prompts should be structured as follows:
         \n\nHuman: <prompt>\n\nAssistant:
       Multiple turns are allowed, but the prompt should always start with \n\nHuman: and end with \n\nAssistant:.

       For this dataset, the sft_target is just the chosen response.
    """
    print(f'Loading HH dataset ({split} split) from Huggingface...')
    dataset = datasets.load_dataset('Anthropic/hh-rlhf', split=split, cache_dir=cache_dir)
    print('done')

    def split_prompt_and_responses(ex):
        prompt = extract_anthropic_prompt(ex['chosen'])
        chosen_response = ex['chosen'][len(prompt):]
        rejected_response = ex['rejected'][len(prompt):]
        return prompt, chosen_response, rejected_response

    prompt_list = []
    chosen_list = []
    rejected_list = []
    position_list = []
    for row in tqdm(dataset, desc='Processing HH', disable=silent):
        prompt, chosen, rejected = split_prompt_and_responses(row)
        prompt_list.append(prompt)
        chosen_list.append(chosen)
        rejected_list.append(rejected)
    train_dataset=Dataset.from_dict({'prompt':prompt_list,'chosen':chosen_list,'rejected':rejected_list})

    return train_dataset
train_dataset=get_hh('train')
print(train_dataset[0])
pd.DataFrame(train_dataset).rename(columns={"chosen": "preferred", "rejected": "non-preferred"})

Step 3: load LLAMA2 pretraining parameters

MODEL_NAME = 'LLaMA-2-7B'
model_path = 'TheBloke/Llama-2-7B-GPTQ'

# Construct the language model specified by MODEL_NAME
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    revision='gptq-4bit-32g-actorder_True',
    cache_dir='/content/drive/MyDrive/Colab Notebooks/GenAI',
    device_map='auto'
)

# Construct the corresponding tokenizer which converts each word into the corresponding index in the vocabulary.
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    legacy=False
)

print(f'*** Load {MODEL_NAME} successfully!! ***')

Step 4: DPO算法对齐LLAMA2

tokenizer.pad_token = tokenizer.eos_token
training_args = TrainingArguments(
    output_dir='./',
    per_device_train_batch_size=1,
    num_train_epochs=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=False,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps = 1,
    warmup_ratio = 0.1,
    report_to = 'none'
)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)
# use the select() method provided by the datasets library for selecting a subset of the dataset, instead of slicing it like a Python list.
dpo_trainer = DPOTrainer(
    model,
    args=training_args,
    beta=0.1,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
)
dpo_trainer.train()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值