SFT只训练指定的部分

huijigo

已于 2024-05-09 19:28:53 修改

阅读量557

点赞数 5

分类专栏：杂记文章标签： python 人工智能

于 2024-05-09 19:23:59 首次发布

本文链接：https://blog.csdn.net/huijigo/article/details/138610555

版权

杂记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

本博客参考链接:Hugging face的TRL库的Train on completions only部分

要训练指定的部分，需要对tokenizer返回的labels进行特殊的标记，这个特殊的标记的实现过程要进行DataCollatorForCompletionOnlyLM这个进行修改。我们来阅读一下这个类的源码。

class DataCollatorForCompletionOnlyLM(DataCollatorForLanguageModeling):
    """
    Data collator used for completion tasks. It ensures that all the tokens of the labels are set to an 'ignore_index'
    when they do not come from the assistant. This ensure that the loss is only
    calculated on the completion made by the assistant.

    Args:
        response_template (`Union[str, List[int]]`): the template form that indicates the start of the response, typically something like
            '### Response:\n'. It can also be passed as tokenized ids, which can be useful when using a tokenizer that encodes the response
            differently if it does not have proper context.
        instruction_template (`Union[str, List[int]]`): the template form that indicates the start of the human instruction, typically something like
            '### Human:\n'. Useful for assistant-style conversation datasets. It can also be passed as tokenized ids.
        mlm (`bool`, *optional*, defaults to `False`): Whether or not to use masked language modeling in the underlying
            `DataCollatorForLanguageModeling` class. Note that this option currently has no effect but is present
             for flexibility and backwards-compatibility.
        ignore_index (`int`, *optional*, defaults to `-100`):
            The index to use to ignore the initial tokens with
    """

    def __init__(
        self,
        response_template: Union[str, List[int]],
        instruction_template: Optional[Union[str, List[int]]] = None,
        *args,
        mlm: bool = False,
        ignore_index: int = -100,
        **kwargs,
    ):
        super().__init__(*args, mlm=mlm, **kwargs)

        self.instruction_template = instruction_template
        if isinstance(instruction_template, str):
            # The user provides a string, must tokenize
            self.instruction_token_ids = self.tokenizer.encode(self.instruction_template, add_special_tokens=False)
        else:
            # The user already provides the token ids
            self.instruction_token_ids = instruction_template

        self.response_template = response_template
        if isinstance(response_template, str):
            # The user provides a string, must tokenize
            self.response_token_ids = self.tokenizer.encode(self.response_template, add_special_tokens=False)
        else:
            # The user already provides the token ids
            self.response_token_ids = response_template

        if not self.mlm and self.instruction_template and self.tokenizer.pad_token_id == self.tokenizer.eos_token_id:
            warnings.warn(
                "The pad_token_id and eos_token_id values of this tokenizer are identical. "
                "If you are planning for multi-turn training, "
                "it can result in the model continuously generating questions and answers without eos token. "
                "To avoid this, set the pad_token_id to a different value."
            )

        self.ignore_index = ignore_index

    def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
        batch = super().torch_call(examples)

        if self.instruction_template is None:
            for i in range(len(examples)):
                response_token_ids_start_idx = None

                for idx in np.where(batch["labels"][i] == self.response_token_ids[0])[0]:
                    # `response_token_ids` is `'### Response:\n'`, here we are just making sure that the token IDs match
                    if (
                        self.response_token_ids
                        == batch["labels"][i][idx : idx + len(self.response_token_ids)].tolist()
                    ):
                        response_token_ids_start_idx = idx

                if response_token_ids_start_idx is None:
                    warnings.warn(
                        f"Could not find response key `{self.response_template}` in the "
                        f'following instance: {self.tokenizer.decode(batch["input_ids"][i])} '
                        f"This instance will be ignored in loss calculation. "
                        f"Note, if this happens often, consider increasing the `max_seq_length`."
                    )
                    batch["labels"][i, :] = self.ignore_index
                else:
                    response_token_ids_end_idx = response_token_ids_start_idx + len(self.response_token_ids)

                    # Make pytorch loss function ignore all tokens up through the end of the response key
                    batch["labels"][i, :response_token_ids_end_idx] = self.ignore_index

        else:
            for i in range(len(examples)):
                response_token_ids_idxs = []
                human_token_ids_idxs = []

                for assistant_idx in np.where(batch["labels"][i] == self.response_token_ids[0])[0]:
                    # find the indexes of the start of a response.
                    if (
                        self.response_token_ids
                        == batch["labels"][i][assistant_idx : assistant_idx + len(self.response_token_ids)].tolist()
                    ):
                        response_token_ids_idxs.append(assistant_idx + len(self.response_token_ids))

                if len(response_token_ids_idxs) == 0:
                    warnings.warn(
                        f"Could not find response key `{self.response_template}` in the "
                        f'following instance: {self.tokenizer.decode(batch["input_ids"][i])} '
                        f"This instance will be ignored in loss calculation. "
                        f"Note, if this happens often, consider increasing the `max_seq_length`."
                    )
                    batch["labels"][i, :] = self.ignore_index

                human_token_ids = self.instruction_token_ids
                for human_idx in np.where(batch["labels"][i] == human_token_ids[0])[0]:
                    # find the indexes of the start of a human answer.
                    if human_token_ids == batch["labels"][i][human_idx : human_idx + len(human_token_ids)].tolist():
                        human_token_ids_idxs.append(human_idx)

                if len(human_token_ids_idxs) == 0:
                    warnings.warn(
                        f"Could not find instruction key `{self.instruction_template}` in the "
                        f'following instance: {self.tokenizer.decode(batch["input_ids"][i])} '
                        f"This instance will be ignored in loss calculation. "
                        f"Note, if this happens often, consider increasing the `max_seq_length`."
                    )
                    batch["labels"][i, :] = self.ignore_index

                if (
                    len(human_token_ids_idxs) > 0
                    and len(response_token_ids_idxs) > 0
                    and human_token_ids_idxs[0] > response_token_ids_idxs[0]
                ):
                    human_token_ids_idxs = [0] + human_token_ids_idxs

                for idx, (start, end) in enumerate(zip(human_token_ids_idxs, response_token_ids_idxs)):
                    # Make pytorch loss function ignore all non response tokens
                    if idx != 0:
                        batch["labels"][i, start:end] = self.ignore_index
                    else:
                        batch["labels"][i, :end] = self.ignore_index

                if len(response_token_ids_idxs) < len(human_token_ids_idxs):
                    batch["labels"][i, human_token_ids_idxs[-1] :] = self.ignore_index

        return batch

上面的代码主要有两个部分，一个部分是构造函数(init()）和torch_call()。其中，通过Vscode的转到定义功能，DataCollatorForCompletionOnlyLM的父类DataCollatorForLanguageModeling的父类DataCollatorMixin中重新定义了__call__()这个魔法函数：

class DataCollatorMixin:
    def __call__(self, features, return_tensors=None):
        if return_tensors is None:
            return_tensors = self.return_tensors
        if return_tensors == "tf":
            return self.tf_call(features)
        elif return_tensors == "pt":
            return self.torch_call(features)
        elif return_tensors == "np":
            return self.numpy_call(features)
        else:
            raise ValueError(f"Framework '{return_tensors}' not recognized!")

所以，当直接调用这个类的对象的时候，就会调用torch_all()这个函数。所以torch_all就是实现我们要的目标的方法。

从huggingface给的实例出发，下面是实例代码：

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
)

trainer.train()

从collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)这行代码切入，来看源码在做什么。
首先，这个方法给定了两个参数，分别是responese_template = " ### Answer:"，和 tokenizer。
对于给定的参数，我们看一下__init__()如何初始化：

先看一下构造函数：

    def __init__(
        self,
        response_template: Union[str, List[int]],
        instruction_template: Optional[Union[str, List[int]]] = None,
        *args,
        mlm: bool = False,
        ignore_index: int = -100,
        **kwargs,
    ):
     	super().__init__(*args, mlm=mlm, **kwargs)

        self.instruction_template = instruction_template
        if isinstance(instruction_template, str):
            # The user provides a string, must tokenize
            self.instruction_token_ids = self.tokenizer.encode(self.instruction_template, add_special_tokens=False)
        else:
            # The user already provides the token ids
            self.instruction_token_ids = instruction_template

        self.response_template = response_template
        if isinstance(response_template, str):
            # The user provides a string, must tokenize
            self.response_token_ids = self.tokenizer.encode(self.response_template, add_special_tokens=False)
        else:
            # The user already provides the token ids
            self.response_token_ids = response_template

        if not self.mlm and self.instruction_template and self.tokenizer.pad_token_id == self.tokenizer.eos_token_id:
            warnings.warn(
                "The pad_token_id and eos_token_id values of this tokenizer are identical. "
                "If you are planning for multi-turn training, "
                "it can result in the model continuously generating questions and answers without eos token. "
                "To avoid this, set the pad_token_id to a different value."
            )

        self.ignore_index = ignore_index

对于实例的输入，我们不管他调用的super有什么用（我也不知道做了什么），看看下面的代码做了什么：就是将response_template进行了编码并绑定了类中的变量，记住，instruct_template是None。

ok，刚刚看了构造函数，现在看看，在代码的调用过程中，torch_all会做什么？

首先是第一行：

batch = super().torch_call(examples)

这里的代码时调用了父类的torch_call
父类的torch_call的作用有两个功能：

1、构成batch
最开始得到的examples应该是这样的

就是一个列表，列表中是字典，一个字典就是一个tokenizer的结果，包含input_ids和attention_mask
2、生成labels的同时，对pos_token设置成-100的index_ignore
然后看后面的代码：
if self.instruciotn_template is None:是True，转入：执行下面的代码
记住examples是列表，里面的元素是字典，是每一句话被tokeize之后的结果。
batch是一个字典，字典有三个键值对，键分别是input_ids,attention_mask和labels,每一个值都是二维数组，是每一个sentence组成的batch

for i in range(len(examples)):
    response_token_ids_start_idx = None
	
     for idx in np.where(batch["labels"][i] == self.response_token_ids[0])[0]:
         # `response_token_ids` is `'### Response:\n'`, here we are just making sure that the token IDs match
         if (
             self.response_token_ids
             == batch["labels"][i][idx : idx + len(self.response_token_ids)].tolist()
         ):
             response_token_ids_start_idx = idx

     if response_token_ids_start_idx is None:
         warnings.warn(
             f"Could not find response key `{self.response_template}` in the "
             f'following instance: {self.tokenizer.decode(batch["input_ids"][i])} '
             f"This instance will be ignored in loss calculation. "
             f"Note, if this happens often, consider increasing the `max_seq_length`."
         )
         batch["labels"][i, :] = self.ignore_index
     else:
         response_token_ids_end_idx = response_token_ids_start_idx + len(self.response_token_ids)

         # Make pytorch loss function ignore all tokens up through the end of the response key
         batch["labels"][i, :response_token_ids_end_idx] = self.ignore_index

第一个for循环，找到labels中和response_template相同token的最后一个的index作为response_token_ids_start_idx，然后将labels中的开头到responese_tempalte的最后一个token都标记为-100，这样的话就不会计算损失了。

如果含有instruction_templat != None，这是一种多轮对话的训练方法方法。会将每轮对话中的回答部分进行训练。

huijigo

关注

5
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
SFT只训练指定的部分

第一个for循环，找到labels中和response_template相同token的最后一个的index作为response_token_ids_start_idx，然后将labels中的开头到responese_tempalte的最后一个token都标记为-100，这样的话就不会计算损失了。要训练指定的部分，需要对tokenizer返回的labels进行特殊的标记，这个特殊的标记的实现过程要进行DataCollatorForCompletionOnlyLM这个进行修改。我们来阅读一下这个类的源码。
复制链接

扫一扫

专栏目录