【预训练大语言模型】使用Transformers库进行GPT2预训练

人工智能小豪

于 2024-08-06 17:39:16 发布

阅读量888

点赞数 25

文章标签：人工智能自然语言处理 python langchain NLP 大模型机器学习

本文链接：https://blog.csdn.net/2301_81888214/article/details/140961853

版权

基于 HuggingFace的Transformer库，在Colab或Kaggle进行预训练。

本教程提供：英文数据集wikitext-2和代码数据集的预训练。
注：可以自行上传数据集进行训练

目的：跑通自回归语言模型的预训练流程

在这里插入图片描述

所有资料 ⚡️ ，朋友们如果有需要全套《LLM大模型入门+进阶学习资源包》，扫码获取~

👉CSDN大礼包🎁：全网最全《LLM大模型入门+进阶学习资源包》免费分享（安全链接，放心点击）👈

一、准备

1.1 安装依赖

!pip install -U datasets
!pip install accelerate -U

注意：在Colab上训练时，最好将datasets更新到最新版（再重启kernel），避免版本低报错

colab和kaggle已经预安装transformers库

1.2 数据准备

加载数据

from datasets import load_dataset

datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

当然你也可使用huggingface上任何公开的文本数据集，或使用自己构造的数据，并将路径替换为指定路径：

# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

要访问一个数据中实际的元素，您需要先选择一个key，然后给出一个索引:
看一下数据的格式

datasets["train"][10].keys()

可以看到该数据集的每个元素就是一个仅包含文本的字典

dict_keys(['text'])

查看例子

datasets["train"][1]

{‘text': ' =Valkyria Chronicles III = \n'}

训练集和测试集数量

print(len(datasets["train"]), len(datasets["test"]))

36718 4358

通过如下的函数来随机展示数据集中的一些样本：

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(datasets["train"])

在这里插入图片描述

数据集中，一些是空文本或标题，一些文本完整段落，

二、因果语言建模（Causal Language Modeling，CLM）

对于因果语言建模，我们首先拿到数据集中的所有文本，并将它们分词的结果拼接起来。

然后，我们将它们拆分到特定序列长度的训练样本中，这样模型将接收如下所示的连续文本块：

part of text 1

或

end of text 1 [BOS_TOKEN] beginning of text 2

这取决于训练样本是否跨越数据集中的几个原始文本：

原始文本长于特定序列长度则被切分
原始文本短于特定序列长度则和其他文本拼接。

模型的标签就是将输入右移一个位置（预测下一个token）。

本例中，将使用gpt2模型。

model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"

当然，你也可以选择这里列出的任何一个https://huggingface.co/models?filter=causal-lm 因果语言模型的checkpoint。

为了用训练模型时使用的词汇对所有文本进行分词，先下载一个预训练过的分词器（Tokenizer）。

直接使用AutoTokenizer类来自加载:

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

现在可以对所有的文本进行分词。

首先定义一个对文本进行分词的函数

def tokenize_function(examples):
    return tokenizer(examples["text"])

然后，将它用到datasets对象中进行分词，使用batch=True和4个进程来加速预处理，并移除之后用不到的text列。

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

查看已分词的数据集的样本，文本已转换为input_ids (文本的Token Id序列)和attention_mask:

tokenized_datasets["train"][1]
{'input_ids': [238, 8576, 9441, 2987, 238, 252],
 'attention_mask': [1, 1, 1, 1, 1, 1]}

然后，需要将所有文本分词的结果拼接在一起，并将其分割成特定block_size的小块（第二节开头提到的操作，block_size其实就是Batch后的max_length）。

为此，将再次使用map方法，并使用选项batch=True。设置不同的block_size，可以获得不同数量的样本，从而能改变样本数量。

通过这种方式，可以从一批样本中得到新的一批样本。

首先，需要设置预训练CLM模型时所使用的最大序列长度。在这里设置为256，以防您的显存爆炸💥。

# block_size = tokenizer.model_max_length
block_size = 256

然后，使用预处理函数来对训练文本进行分组:

def group_texts(examples):
    # 拼接所有文本
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # 这里将剩余的少部分token去掉了。但如果模型支持的话，可以添加padding，这可以根据需要进行定制修改。
    total_length = (total_length // block_size) * block_size
    
    # 通过max_len进行分割
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

首先注意，我们复制了标签的输入。

这是因为🤗transformer库的模型默认向右移动，所以我们不需要手动操作。

还要注意，在默认情况下，map方法将发送一批1,000个示例，由预处理函数处理。因此，在这里，我们将删除剩余部分，使连接的标记化文本每1000个示例为block_size的倍数。您可以通过传递更高的批处理大小来调整此行为(当然这也会被处理得更慢)。你也可以使用multiprocessing来加速预处理:

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=2000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]
Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]
Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

现在，可以检查数据集是否发生了变化：
现在样本包含了block_size连续字符块，可能跨越了几个原始文本。

tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries, along with Valkyria Chronicles II director Takeshi Oz'

在构建了处理好的预训练语料后，可以开始模型训练。

我们将建立一个模型:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

直接使用transformers的trainer类型，其代码如下所示：

from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(config)

训练参数

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    f"{model_checkpoint}-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    # push_to_hub=True
)

训练模型

trainer.train()

训练日志

[ 220/3375 02:11 < 31:43, 1.66 it/s, Epoch 0.19/3]

评估结果

import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 552.71

The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you would need a larger dataset and more epochs.

1.5 推理

tokenizer深入

tokens = tokenizer.tokenize("六朝何事")
tokens

奇奇怪怪的结果（词表里没啥中文，直接中文按2字节编码）

['å', 'ħ', 'Ń', 'æ', 'ľ', 'Ŀ', 'ä', '½', 'ķ', 'ä', 'º', 'ĭ']

转换为token id

tokenizer.convert_tokens_to_ids(tokens)

结果

[150, 165, 193, 151, 188, 189, 149, 121, 181, 149, 118, 171]

使用encode，直接转换为token ids

tokenizer.encode("六朝何事")

[150, 165, 193, 151, 188, 189, 149, 121, 181, 149, 118, 171]

与直接使用tokenizer

tokenizer("六朝何事")

结果一致

{'input_ids': [150, 165, 193, 151, 188, 189, 149, 121, 181, 149, 118, 171], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

反tokenize

tokenizer.decode(tokenizer("六朝何事")
['input_ids'])

‘六朝何事’

1.6 推理

x = tokenizer("六朝何事", return_tensors="pt")
y = model.forward(x['input_ids'])
y

结果一大堆

CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[-0.5103, -0.3852, -0.0509,  ..., -0.1831,  0.7720, -0.2264],
         [-0.9077,  0.0660, -0.7552,  ...,  0.0428,  0.6765, -0.0024],
         [ 0.4458, -0.4124, -1.2314,  ...,  0.3847,  0.4391,  0.0402],
         ...,
         [ 0.3976,  0.0738, -0.7156,  ...,  0.1152,  0.8602,  0.0270],
         [ 0.6953,  0.7504,  0.0266,  ..., -0.6524,  1.1901,  0.1273],
         [-0.3004,  0.5009, -1.0164,  ..., -0.1076,  1.4422, -0.5940]]],
       grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-0.0165, -0.5414, -0.1960,  ...,  0.0751, -1.3083, -0.6204],
          [ 0.5249,  0.0685,  0.2652,  ..., -0.1789,  0.0868, -0.5673],
          [ 0.6694, -0.5541, -0.2543,  ...,  0.0981, -0.1687, -0.2084],

查看预测logits即y.logits的shape为torch.Size([1, 12, 50257])

tensor([[[ 0.6202, -0.1432, -0.0364,  ..., -0.6025,  0.7150, -0.2145],
         [-0.3945, -0.0824, -0.5818,  ...,  0.0286,  0.6341, -0.2636],
         [ 0.2438,  0.5748, -0.9318,  ..., -0.4956,  0.5061, -0.3112],
         ...,
         [ 1.0054,  0.3126, -0.1491,  ..., -0.1764,  0.4643, -0.1376],
         [ 0.5537,  0.7263,  0.0582,  ..., -0.7386,  1.2950, -0.1308],
         [ 0.5036,  1.0895,  0.0722,  ..., -0.8044,  0.4085, -0.8951]]],
       grad_fn=<UnsafeViewBackward0>)

由于中文预测出来的token解码不对，这里后续使用英文测试

import torch
import numpy as np

inputs_text = "Hello "
x = tokenizer(inputs_text, return_tensors="pt")
y = model.forward(x['input_ids'])

# 贪婪采样，取最大概率token
next_token_id = int(np.argmax(y.logits[0][-1].detach().numpy()))
print(next_token_id)
next_token = tokenizer.convert_ids_to_tokens(next_token_id)
print(inputs_text + next_token)

结果
10391
Hello ĠBright

generate代码 (设置预测长度max_length)

max_length = 20
inputs_text = "hello "

input_ids = [tokenizer.encode(inputs_text)]
input_ids = input_ids[:-1]
for i in range(max_length):
    outputs = model(torch.tensor([input_ids]))
    last_token_id = int(np.argmax(outputs.logits[0][-1].detach().numpy()))
    last_token = tokenizer.convert_ids_to_tokens(last_token_id)
    inputs_text += last_token
    input_ids.append(last_token_id)

如何系统的去学习大模型LLM ？

作为一名热心肠的互联网老兵，我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。

但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的 AI大模型资料 包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

所有资料 ⚡️ ，朋友们如果有需要全套《LLM大模型入门+进阶学习资源包》，扫码获取~

👉CSDN大礼包🎁：全网最全《LLM大模型入门+进阶学习资源包》免费分享（安全链接，放心点击）👈

一、全套AGI大模型学习路线

AI大模型时代的学习之旅：从基础到前沿，掌握人工智能的核心技能！

二、640套AI大模型报告合集

这套包含640份报告的合集，涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。

三、AI大模型经典PDF籍

随着人工智能技术的飞速发展，AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型，如GPT-3、BERT、XLNet等，以其强大的语言理解和生成能力，正在改变我们对人工智能的认识。那以下这些PDF籍就是非常不错的学习资源。

在这里插入图片描述

四、AI大模型商业化落地方案

阶段1：AI大模型时代的基础理解

目标：了解AI大模型的基本概念、发展历程和核心原理。
内容：
- L1.1 人工智能简述与大模型起源
- L1.2 大模型与通用人工智能
- L1.3 GPT模型的发展历程
- L1.4 模型工程
  - L1.4.1 知识大模型
  - L1.4.2 生产大模型
  - L1.4.3 模型工程方法论
  - L1.4.4 模型工程实践
- L1.5 GPT应用案例

阶段2：AI大模型API应用开发工程

目标：掌握AI大模型API的使用和开发，以及相关的编程技能。
内容：
- L2.1 API接口
  - L2.1.1 OpenAI API接口
  - L2.1.2 Python接口接入
  - L2.1.3 BOT工具类框架
  - L2.1.4 代码示例
- L2.2 Prompt框架
  - L2.2.1 什么是Prompt
  - L2.2.2 Prompt框架应用现状
  - L2.2.3 基于GPTAS的Prompt框架
  - L2.2.4 Prompt框架与Thought
  - L2.2.5 Prompt框架与提示词
- L2.3 流水线工程
  - L2.3.1 流水线工程的概念
  - L2.3.2 流水线工程的优点
  - L2.3.3 流水线工程的应用
- L2.4 总结与展望

阶段3：AI大模型应用架构实践

目标：深入理解AI大模型的应用架构，并能够进行私有化部署。
内容：
- L3.1 Agent模型框架
  - L3.1.1 Agent模型框架的设计理念
  - L3.1.2 Agent模型框架的核心组件
  - L3.1.3 Agent模型框架的实现细节
- L3.2 MetaGPT
  - L3.2.1 MetaGPT的基本概念
  - L3.2.2 MetaGPT的工作原理
  - L3.2.3 MetaGPT的应用场景
- L3.3 ChatGLM
  - L3.3.1 ChatGLM的特点
  - L3.3.2 ChatGLM的开发环境
  - L3.3.3 ChatGLM的使用示例
- L3.4 LLAMA
  - L3.4.1 LLAMA的特点
  - L3.4.2 LLAMA的开发环境
  - L3.4.3 LLAMA的使用示例
- L3.5 其他大模型介绍