大模型训练的相关细节与注意事项

lucky_append

已于 2024-06-16 13:53:01 修改

阅读量1.5k

点赞数 46

文章标签：自然语言处理

于 2024-04-29 11:37:29 首次发布

本文链接：https://blog.csdn.net/qq_41728178/article/details/138306463

版权

Token相关
- https://blog.csdn.net/yosemite1998/article/details/122306758
  - 实际处理数据的过程中的运行逻辑，运用已经收集好的词表将每条数据转换成token id的格式。
- 希望大模型能学习到新领域知识时，优先收集新的词表以扩充当前的词表数
  - 借鉴collossalAI中的格式：
```
{"piece": "你好"}
{"piece": "人工智能"}
```
  - 收集足够的词表之后与原本的词表进行融合或者替换
训练的流程步骤
- 借鉴collossalAI-llama-2的训练过程，此次训练的主要目标是增加大模型的表达能力，增加中文方面的处理能力，分为三个步骤。
- 大规模预训练阶段：
  - 大规模预训练的目的是旨在从头开始建立模型的基础能力。什么样的基础能力的组成是合理有效的。
  - 数量级上需要使用一个不少于1万亿个token的数据集。
- 汉语知识注入阶段：
  - 在这个阶段，我们将汉语知识引入到模型中。它需要使用一个高质量的数据集，该数据集富含与汉语相关的全面知识。
- 知识重放阶段：
  - 通过问答机制重放知识，包括中文和英文领域。
训练数据的组成
- 借鉴collossalAI经验：
  
  Our experiments have revealed that the distributions within the training dataset, as well as the arrangement of various topic-related data points, significantly impact the overall performance of the model, particularly in the context of continual pre-training of LLaMA-2.
  
  In an effort to achieve a more balanced distribution and exert control over the dataset's ordering, we have adopted a method where we divide each sub-dataset into discrete bins. These bins are then combined to construct individual data buckets, with one bin contributed by each sub-dataset.
  - 训练数据集中的分布，以及各种主题相关数据点的排列，会显著影响模型的整体性能，特别是在LLaMA-2的持续预训练的情况下。
  - 为了实现更平衡的分布并控制数据集的排序，collossalAI采用了一种方法，每个类别对应一个子数据集，将每个子数据集划分为离散的箱。然后将这些箱组合起来构建单独的数据桶，每个子数据集贡献一个箱。

预训练过程中的输入是什么，输出是什么，loss收敛的逻辑是什么

GPT训练步骤
- 注意：假设有监督微调的数据集是分类任务，如果是生成任务，过程会与无监督预训练阶段相同。

借鉴collossalAI中的预训练步骤

colossal-llama-2

预训练数据的构建组成：

{"source": "", "target": "Lionel Andrés Messi(Spanish pronunciation: [ljoˈnel anˈdɾes ˈmesi] (i); born 24 June 1987), also known as Leo Messi, is an Argentine professional footballer who plays as a forward for and captains both Major League Soccer club Inter Miami and the Argentina national team.", "category": "sports"}
{"source": "猜谜语：一身卷卷细毛，吃的青青野草，过了数九寒冬，无私献出白毛。（打一动物）", "target": "白羊", "category": "riddle"}

source (字符串): This part is ignored when calculating loss. Default can be empty.
target (字符串): Loss will be calculated.
category (字符串): 数据类别Tags for each data point.

实际构造阶段当source有值时代表数据对应的是一些问题的回答，source对应题干，target对应答案。

从数据json文件到token id过程的步骤：

    if ignore_index is None:
        ignore_index = IGNORE_INDEX
    
    source_text = data_point["source"]  # `str` 对应数据中的source字段
    target_text = data_point["target"]  # `str` 对应数据中的target字段
    # category字段只是标注数据所属类型，用于构建阶段的信息支持
    is_null_source = len(source_text) == 0 # 是问题还是知识点

    source_text = tokenizer.bos_token + source_text 
    target_text += tokenizer.eos_token
    sequence_text = source_text + target_text # 最终形成bos + source + target + eos的数据结构
    
    tokenized = tokenizer([source_text, sequence_text])["input_ids"] # 获取到转译成token id的信息格式
    sequence_input_ids = tokenized[1] # 获取sequence_text的token id
    sequence_labels = deepcopy(sequence_input_ids)

    source_length = len(tokenized[0]) # label先是直接复制input，再将其中的source位置的token id值赋值为忽视标识
    if not is_null_source:
        sequence_labels[:source_length] = [ignore_index for _ in range(source_length)]

sequence_input_ids对应每条训练数据中的input_ids

sequence_labels对应每条训练数据中的label

微调阶段训练过程中的输入是什么，输出是什么，loss收敛的逻辑是什么

借鉴collossalAI中的指令微调步骤

指令集的数据组成：

{"messages": [{"from": "human", "content": "What are the three primary colors?"}, {"from": "assistant", "content": "The three primary colors are red, blue, and yellow."}]}
{"messages": [{"from": "human", "content": "解释个人电脑和服务器之间的区别。"}, {"from": "assistant", "content": "个人电脑和服务器是两种不同类型的计算机系统，它们的主要区别在于用途、硬件配置和性能。 个人电脑，顾名思义，是为个人使用而设计的计算机。它们通常用于日常的工作、娱乐和学习，可以运行各种各样的应用程序和游戏。个人电脑的硬件配置一般是按照标准配置来设计的，不过也可以根据个人需求进行定制。 而服务器是为了满足大量用户的需求而设计的计算机系统，它们通常用于为用户提供各种网络服务，如网站、电子邮件和文件传输等。服务器通常需要高性能的硬件配置，并且可以承受高负载和长时间的运行。由于服务器需要支持大量用户的访问，它们通常配备多核处理器、大容量内存和大容量硬盘驱动器，以提高系统的运行速度和稳定性。 总之，个人电脑和服务器之间的主要区别在于它们的用途、硬件配置和性能。个人电脑用于个人使用，而服务器用于支持大量用户的访问。服务器的硬件配置通常比个人电脑更高，以保证系统的性能和稳定性。"}]}

messages (列表): 对应一个用户于机器人的对话记录，其中会将机器人的话术作为输出loss计算的目标。This part consists of a conversation between a human and assistant. The length of messages can vary and only content from assistant is used for calculating loss.

messages = data_point["messages"]
template = Conversation(
    system="A chat between a curious human and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
    roles=("Human", "Assistant"),
    messages=[],
    offset=0,
    sep_style=SeparatorStyle.ADD_BOS_EOS_TOKEN,
    seps=["<s>", "</s>"],
)
template.messages = []

# 一个messages代表一个对话记录
# 格式转换，system+Human:<s>{人说的内容}</s>+Assistant:<s>{机器人说的内容}</s>
for mess in messages:
    from_str = mess["from"]
    if from_str.lower() == "human":
        from_str = template.roles[0]
    elif from_str.lower() == "assistant":
        from_str = template.roles[1]
    else:
        raise ValueError(f"Unsupported role {from_str.lower()}")

    template.append_message(from_str, mess["content"])
    
turns = [i for i in range(1, len(messages) // 2 + 1)]
target_turn_index = bisect.bisect_right(
    turns,
    max_length - 1,
    key=lambda x: len(tokenizer([template.get_prompt(2 * x)], add_special_tokens=False)["input_ids"][0]),
)

target_turn = turns[target_turn_index - 1]
prompt = template.get_prompt(2 * target_turn)
# 格式转换，system+Human:<s>{人说的内容}</s>+Assistant:<s>{机器人说的内容}</s>

# 获得对应的token id列表
tokenized = tokenizer([prompt], add_special_tokens=False)["input_ids"][0]

template.messages = template.messages[0 : 2 * target_turn]

# 记录对话过程中所有的机器人说话的字符段下标。为了后续构建label做准备
starts = []
ends = []
gpt_bos = False if template.messages[0][0] == template.roles[0] else True
gpt_eos = False if template.messages[0][0] == template.roles[0] else True

for i, token_id in enumerate(tokenized):
    if token_id == tokenizer.bos_token_id:
        if gpt_bos:
            starts.append(i)
        gpt_bos = not gpt_bos
    elif token_id == tokenizer.eos_token_id:
        if gpt_eos:
            ends.append(i)
        gpt_eos = not gpt_eos

# tokenized对应完整的对话记录
# labels对应机器人的回复话术
tokenized = [tokenizer.bos_token_id] + tokenized
labels = [ignore_index] * len(tokenized)
for start, end in zip(starts, ends):
    labels[start + 1 : end + 2] = tokenized[start + 1 : end + 2]

labels_decode = deepcopy(labels)
for i, z in enumerate(labels_decode):
    if z == ignore_index:
        labels_decode[i] = tokenizer.unk_token_id
# 完成label token id列表的构建

其他

lucky_append

关注

46
点赞
踩
33

收藏

觉得还不错? 一键收藏
0
评论
大模型训练的相关细节与注意事项

为了实现更平衡的分布并控制数据集的排序，collossalAI采用了一种方法，每个类别对应一个子数据集，将每个子数据集划分为离散的箱。借鉴collossalAI-llama-2的训练过程，此次训练的主要目标是增加大模型的表达能力，增加中文方面的处理能力，分为三个步骤。训练数据集中的分布，以及各种主题相关数据点的排列，会显著影响模型的整体性能，特别是在LLaMA-2的持续预训练的情况下。实际处理数据的过程中的运行逻辑，运用已经收集好的词表将每条数据转换成token id的格式。
复制链接

扫一扫