trl-强化学习训练-grpo训练为例

一、定义

1.trl + lora+ transformers 训练模型
2.部署与预测
3.模型合并
4.vllm 部署
4. 代码讲解
5. trl 训练器参数讲解

二、训练

1.trl + lora+ transformers 训练模型
训练数据格式要求:包含字段:prompt ,不需要ans
如:在这里插入图片描述
注意: 数据的处理,被封装在了grpo trainer 优化器的compute_loss 中。
自定义输入:

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str):
    if "####" not in text:
        return None
    return text.split("####")[1].strip()


# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train"):
    data = load_dataset('E://openaigsm8k/main')[split].select(range(200)) # type: ignore
    data = data.map(lambda x: { # type: ignore   自定义输入格式
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

训练:

import argparse
from dataclasses import dataclass, field
from typing import Optional

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer

from trl import GRPOConfig, GRPOTrainer, ModelConfig, ScriptArguments, TrlParser, get_peft_config


@dataclass
class GRPOScriptArguments(ScriptArguments):
    """
    Script arguments for the GRPO training script.

    Args:
        reward_model_name_or_path (`str` or `None`):
            Reward model id of a pretrained model hosted inside a model repo on huggingface.co or local path to a
            directory containing model weights saved using [`~transformers.PreTrainedModel.save_pretrained`].
    """

    reward_model_name_or_path: Optional[str] = field(
        default=None,
        metadata={
            "help": "Reward model id of a pretrained model hosted inside a model repo on huggingface.co or "
            "local path to a directory containing model weights saved using `PreTrainedModel.save_pretrained`."
        },
    )


def main(script_args, training_args, model_args):
    # Load a pretrained model
    model = AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
    )

    def reward_len(completions, **kwargs):
        return [-abs(20 - len(completion)) for completion in completions]
    
    # Load the dataset
    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)

    # Initialize the GRPO trainer
    trainer = GRPOTrainer(
        model=model,
        reward_funcs=reward_len,
        args=training_args,
        train_dataset=dataset[script_args.dataset_train_split].select(range(100)),
        eval_dataset=dataset[script_args.dataset_test_split].select(range(100)) if training_args.eval_strategy != "no" else None,
        processing_class=tokenizer,
        peft_config=get_peft_config(model_args),
    )

    # Train and push the model to the Hub
    trainer.train()

    # Save and push to hub
    trainer.save_model(training_args.output_dir)
    # if training_args.push_to_hub:
    #     trainer.push_to_hub(dataset_name=script_args.dataset_name)


def make_parser(subparsers: argparse._SubParsersAction = None):
    dataclass_types = (GRPOScriptArguments, GRPOConfig, ModelConfig)
    if subparsers is not None:
        parser = subparsers.add_parser("grpo", help="Run the GRPO training script", dataclass_types=dataclass_types)
    else:
        parser = TrlParser(dataclass_types)
    return parser


if __name__ == "__main__":
    parser = make_parser()
    script_args, training_args, model_args = parser.parse_args_and_config()
    main(script_args, training_args, model_args)

    '''
    CUDA_VISIBLE_DEVICES=0 python test.py --model_name_or_path /home/grpo_test//Qwen2.5-0.5B-Instruct --dataset_name /home/grpo_test/trl-libtldr \
        --learning_rate 2.0e-4 --num_train_epochs 1 --per_device_train_batch_size 2 \
        --gradient_accumulation_steps 2 --eval_strategy no --logging_steps 2 --use_peft 1 --lora_r 32 --lora_alpha 16 --output_dir Qwen2-0.5B-GRPO
    '''
#加速使用vllm 训练

2.部署与预测

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    "/home/grpo_test/qwen2-grpo-lora", # YOUR MODEL YOU USED FOR TRAINING
    load_in_4bit = True,
)
tokenizer = AutoTokenizer.from_pretrained("/home/grpo_test/qwen2-grpo-lora")

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

inputs = tokenizer(
[
   text,
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
outputs = model.generate(**inputs, max_new_tokens = 128)
print(outputs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))


'''
As an artificial intelligence language model, I do not calculate or execute numerical operations. However, I can provide you with the mathematical formula to calculate pi. The approximate value of pi is 3.14159, and the full mathematical formula is:

π = ∑ (2 × sin(1/n) × n^2) / (n × (n + 1) × (2n + 1))

Please note that this is an approximate value, and the accuracy can vary depending on the precision required.
'''

3.模型合并

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
lora_model = AutoPeftModelForCausalLM.from_pretrained(
    "/home/grpo_test/qwen2-grpo-lora", # YOUR MODEL YOU USED FOR TRAINING
    torch_dtype="auto",
    device_map="cuda:0"
)

model2 = lora_model.merge_and_unload()
print(model2)
model2.save_pretrained("merged-model")


from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "merged-model"

model1 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="cuda:0"
)
print(model1)


model1_dict = dict()
model2_dict = dict()

for k, v in model1.named_parameters():
    model1_dict[k] = v
        
for k, v in model2.named_parameters():
    model2_dict[k] = v

keys = model1_dict.keys()
keys2 = model2_dict.keys()
print(set(keys) == set(keys2))
for k in keys:
    print(f'{k}: {torch.allclose(model1_dict[k], model2_dict[k])},')

4.vllm 加载

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8000, 1
model_name = "merged-model"
#prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
    dtype=torch.float16,
    gpu_memory_utilization=0.35      #降低gpu 显存
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
#自定义输入
prompt = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)


sampling_params = SamplingParams(temperature=0.95, max_tokens = 1000)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
  1. 代码讲解
    1. 数据处理部分
#数据处理   已包含在compute_loss中。
device = self.accelerator.device
prompts = [x["prompt"] for x in inputs]   #获取prompt 内容
#maybe_apply_chat_template 将输入转换为指定格式的输入。后台会自动判定,是否需要应用template 模板
prompts_text = [maybe_apply_chat_template(example, self.processing_class)["prompt"] for example in inputs]

#转向量化
prompt_inputs = self.processing_class(
    prompts_text, return_tensors="pt", padding=True, padding_side="left", add_special_tokens=False
)
#gpu化
prompt_inputs = super()._prepare_inputs(prompt_inputs)

if self.max_prompt_length is not None:
    prompt_inputs["input_ids"] = prompt_inputs["input_ids"][:, -self.max_prompt_length :]
    prompt_inputs["attention_mask"] = prompt_inputs["attention_mask"][:, -self.max_prompt_length :]
def maybe_apply_chat_template(
    example: dict[str, list[dict[str, str]]],
    tokenizer: PreTrainedTokenizer,
    tools: Optional[list[Union[dict, Callable]]] = None,
) -> dict[str, str]:
    r"""
    Example:
    ```python
    >>> from transformers import AutoTokenizer
    >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
    >>> example = {
    ...     "prompt": [{"role": "user", "content": "What color is the sky?"}],
    ...     "completion": [{"role": "assistant", "content": "It is blue."}]
    ... }
    >>> apply_chat_template(example, tokenizer)
    {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n', 'completion': 'It is blue.<|end|>\n<|endoftext|>'}
    ```
    """
    if is_conversational(example):    #判定是不是问答模式,即 [{"role": "user", "content": "What color is the sky?"}] 
        return apply_chat_template(example, tokenizer, tools)
    else:
        return example
  1. 损失部分
#参考模型生成
with torch.inference_mode():
    if self.ref_model is not None:
        ref_per_token_logps = get_per_token_logps(self.ref_model, prompt_completion_ids, num_logits_to_keep)
    else:
        with self.accelerator.unwrap_model(model).disable_adapter():    # 关闭适配器,原生模型作为参考模型
            ref_per_token_logps = get_per_token_logps(model, prompt_completion_ids, num_logits_to_keep) 
 
 #计算模型与参考模型的KL 散度
# Compute the KL divergence between the model and the reference model
per_token_kl = torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1

 # Compute the rewards    计算奖励
prompts = [prompt for prompt in prompts for _ in range(self.num_generations)]

rewards_per_func = torch.zeros(len(prompts), len(self.reward_funcs), device=device)
for i, (reward_func, reward_processing_class) in enumerate(
    zip(self.reward_funcs, self.reward_processing_classes)
):
    if isinstance(reward_func, PreTrainedModel):
        if is_conversational(inputs[0]):
            messages = [{"messages": p + c} for p, c in zip(prompts, completions)]
            texts = [apply_chat_template(x, reward_processing_class)["text"] for x in messages]
        else:
            texts = [p + c for p, c in zip(prompts, completions)]
        reward_inputs = reward_processing_class(
            texts, return_tensors="pt", padding=True, padding_side="right", add_special_tokens=False
        )
        reward_inputs = super()._prepare_inputs(reward_inputs)
        with torch.inference_mode():
            rewards_per_func[:, i] = reward_func(**reward_inputs).logits[:, 0]  # Shape (B*G,)
    else:
        # Repeat all input columns (but "prompt" and "completion") to match the number of generations
        reward_kwargs = {key: [] for key in inputs[0].keys() if key not in ["prompt", "completion"]}
        for key in reward_kwargs:
            for example in inputs:
                # Repeat each value in the column for `num_generations` times
                reward_kwargs[key].extend([example[key]] * self.num_generations)
        output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs)
        rewards_per_func[:, i] = torch.tensor(output_reward_func, dtype=torch.float32, device=device)

# Sum the rewards from all reward functions      计算出奖励
rewards = rewards_per_func.sum(dim=1)

# Compute grouped-wise rewards          计算组内奖励均值和方差
mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)

# Normalize the rewards to compute the advantages       标准化奖励来计算优势
mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)       #优势标准化

#计算最后的损失函数
# x - x.detach() allows for preserving gradients from x      #x - x.detach()允许从x中保留梯度
per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -(per_token_loss - self.beta * per_token_kl)
loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()

#per_token_logps - per_token_logps.detach() 的理解:策略比例 采用梯度代替

1. per_token_logps - per_token_logps.detach()的目的是在计算损失函数时,排除那些不需要梯度更新的部分,通常用于避免反向传播到某些特定的操作或变量。
<think>好的,我现在需要回答用户关于如何下载TRL框架最新版本的安装教程的问题。首先,我要回顾用户提供的引用内容,看看是否有相关的信息。 用户提到了引用[2],其中提到官方有几种使用方法,推荐将verl库clone到本地并安装。虽然verl可能和TRL不同,但安装流程可能类似。引用[3]提到使用transformers库、accelerate和PEFT,这些可能与TRL相关,因为TRL也属于训练大模型的工具。引用[4]提到了使用Unsloth框架和GRPO模块,但可能和TRL没有直接关联。 接下来,我需要确定TRL的正确安装步骤。TRL(Transformer Reinforcement Learning)是Hugging Face的一个库,用于训练大语言模型,特别是结合强化学习。官方文档通常推荐使用pip安装,或者从源码安装最新版本。 根据用户提供的引用[2]中的方法,git clone仓库然后安装可能适用。同时需要确保依赖项如torch、transformers、accelerate等已安装。引用[3]中的PyTorch和transformers库也是TRL的依赖项,所以需要提到这些。 用户可能想安装最新版本,所以建议从GitHub克隆仓库并安装。同时,要提醒安装git-lfs,如引用[1]提到的,因为模型权重可能需要它,不过安装TRL本身可能不需要,但如果有依赖可能需要。 总结步骤:使用pip安装稳定版,或从GitHub克隆最新开发版。安装依赖项如torch、transformers。可能还需要accelerate和peft,如引用[3]所述。最后验证安装是否成功。 需要检查是否有遗漏的步骤或可能的错误,比如版本冲突,建议使用虚拟环境。同时,确保引用正确的来源,如引用[2]中的方法适用于verl,但TRL可能有不同的仓库地址,需要更正为TRL的官方GitHub地址。</think>安装TRL(Transformer Reinforcement Learning)框架的最新版本可通过以下步骤实现: ### 1. 基础安装(稳定版本) 使用`pip`直接安装官方发布的稳定版本: ```bash pip install trl ``` 此方法自动安装核心依赖库如`transformers`、`accelerate`和`peft`[^3]。 --- ### 2. 源码安装(最新开发版) 若需获取最新特性,可从GitHub仓库克隆并安装: ```bash git clone https://github.com/huggingface/trl cd trl pip install -e . ``` 源码安装包含未发布的实验性功能,适合开发者调试[^2]。 --- ### 3. 依赖管理 确保关键依赖版本兼容: ```bash pip install torch>=2.0.0 transformers>=4.34.0 accelerate>=0.23.0 peft>=0.6.0 ``` 推荐使用虚拟环境(如`conda`或`venv`)隔离依赖。 --- ### 4. 验证安装 运行Python检查导入是否成功: ```python import trl print(trl.__version__) ``` 若无报错则安装完成。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值