HuggingfaceNLP笔记7.2Fine-tuning a masked language model

最新推荐文章于 2024-10-18 22:34:44 发布

NJU_AI_NB

最新推荐文章于 2024-10-18 22:34:44 发布

阅读量710

点赞数 21

文章标签：笔记语言模型人工智能

本文链接：https://blog.csdn.net/aa12367/article/details/138496368

版权

对于许多涉及Transformer模型的NLP应用程序，您可以简单地从Hugging Face Hub获取一个预训练模型，并直接在您的数据上对其进行微调以适应当前任务。只要用于预训练的语料库与用于微调的语料库不相差太大，迁移学习通常会产生良好的结果。

然而，有一些情况下，您可能希望首先在您的数据上对语言模型进行微调，然后再训练一个特定任务的头部。例如，如果您的数据集包含法律合同或科学文章，那么像BERT这样的普通Transformer模型通常会将您语料库中的领域特定词汇视为罕见标记，导致性能可能不尽如人意。通过在领域内数据上微调语言模型，您可以提升许多下游任务的性能，这意味着通常只需要执行此步骤一次！

在领域内数据上微调预训练语言模型的过程通常称为领域适应。它在2018年由ULMFiT推广，这是第一个真正使NLP中的迁移学习起作用的基于LSTMs的神经架构之一。下面显示了使用ULMFiT进行领域适应的示例；在本节中，我们将做类似的事情，但使用Transformer而不是LSTM！
在这里插入图片描述

通过本节结束时，您将在Hub上拥有一个可以像下面展示的自动完成句子的掩码语言模型：

让我们开始吧！

🙋 如果“掩码语言建模”和“预训练模型”这些术语对您来说很陌生，请查看第1章，我们在那里解释所有这些核心概念，包括视频！

选择用于掩码语言建模的预训练模型

首先，让我们选择一个适合的用于掩码语言建模的预训练模型。如下截图所示，您可以通过在Hugging Face Hub上应用“Fill-Mask”筛选器来找到候选列表：
在这里插入图片描述

尽管BERT和RoBERTa系列模型是下载量最多的，但我们将使用一个名为DistilBERT的模型，它可以在训练速度快得多的情况下几乎不损失下游性能。该模型使用一种称为知识蒸馏的特殊技术进行训练，其中使用大型“教师模型”如BERT来指导“学生模型”的训练，后者具有更少的参数。在本节中详细解释知识蒸馏的细节将偏离主题，但如果您感兴趣，可以在使用Transformer进行自然语言处理（俗称为Transformer教材）中了解更多。

让我们继续使用AutoModelForMaskedLM类下载DistilBERT：

from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

我们可以通过调用num_parameters()方法来查看此模型有多少参数：

distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'

DistilBERT约有6700万个参数，比BERT基础模型大约小两倍，这大致相当于训练速度加快了两倍 — 不错！现在让我们看看这个模型预测的文本样本中最可能的完成标记是哪些标记：

text = "This is a great [MASK]."

作为人类，我们可以想象[MASK]标记的许多可能性，比如“day”、“ride”或“painting”。对于预训练模型，预测取决于模型训练时使用的语料库，因为它学会捕捉数据中存在的统计模式。就像BERT一样，DistilBERT是在英文维基百科和BookCorpus数据集上进行预训练的，因此我们期望[MASK]的预测反映这些领域。为了预测掩码，我们需要DistilBERT的分词器生成模型的输入，所以让我们也从Hub下载它：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

有了分词器和模型，我们现在可以将我们的文本示例传递给模型，提取logits，并打印出前5个候选项：

import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# 找到[MASK]的位置并提取其logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# 选择logits最高的[MASK]候选项
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'

从输出中我们可以看到，模型的预测涉及日常术语，这也许并不令人惊讶，考虑到英文维基百科的基础。让我们看看如何将这个领域更改为一些更专业的东西——高度极化的电影评论！

数据集

为了展示领域适应，我们将使用著名的大型电影评论数据集（或简称IMDb），这是一个电影评论语料库，经常用于评估情感分析模型。通过在这个语料库上对DistilBERT进行微调，我们期望语言模型将从其预训练的维基百科的事实数据中适应到电影评论的更主观元素。我们可以使用🤗 Datasets中的load_dataset()函数从Hugging Face Hub获取数据：

from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

我们可以看到train和test拆分各包含25,000条评论，而有一个名为unsupervised的未标记拆分，其中包含50,000条评论。让我们看一些样本，以了解我们正在处理的文本类型。就像我们在课程的前几章中所做的那样，我们将链接Dataset.shuffle()和Dataset.select()函数来创建一个随机样本：

sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")

'>>> Review: This is your typical Priyadarshan movie--a bunch of loony characters out on some silly mission. His signature climax has the entire cast of the film coming together and fighting each other in some crazy moshpit over hidden money. Whether it is a winning lottery ticket in Malamaal Weekly, black money in Hera Pheri, "kodokoo" in Phir Hera Pheri, etc., etc., the director is becoming ridiculously predictable. Don\'t get me wrong; as clichéd and preposterous his movies may be, I usually end up enjoying the comedy. However, in most his previous movies there has actually been some good humor, (Hungama and Hera Pheri being noteworthy ones). Now, the hilarity of his films is fading as he is using the same formula over and over again.<br /><br />Songs are good. Tanushree Datta looks awesome. Rajpal Yadav is irritating, and Tusshar is not a whole lot better. Kunal Khemu is OK, and Sharman Joshi is the best.'
'>>> Label: 0'

'>>> Review: Okay, the story makes no sense, the characters lack any dimensionally, the best dialogue is ad-libs about the low quality of movie, the cinematography is dismal, and only editing saves a bit of the muddle, but Sam" Peckinpah directed the film. Somehow, his direction is not enough. For those who appreciate Peckinpah and his great work, this movie is a disappointment. Even a great cast cannot redeem the time the viewer wastes with this minimal effort.<br /><br />The proper response to the movie is the contempt that the director San Peckinpah, James Caan, Robert Duvall, Burt Young, Bo Hopkins, Arthur Hill, and even Gig Young bring to their work. Watch the great Peckinpah films. Skip this mess.'
'>>> Label: 0'

'>>> Review: I saw this movie at the theaters when I was about 6 or 7 years old. I loved it then, and have recently come to own a VHS version. <br /><br />My 4 and 6 year old children love this movie and have been asking again and again to watch it. <br /><br />I have enjoyed watching it again too. Though I have to admit it is not as good on a little TV.<br /><br />I do not have older children so I do not know what they would think of it. <br /><br />The songs are very cute. My daughter keeps singing them over and over.<br /><br />Hope this helps.'
'>>> Label: 1'

是的，这些都是电影评论，如果你足够年长，甚至可能理解最后一条评论中关于拥有VHS版本的言论 😜！虽然在语言建模中我们不需要标签，但从这里我们已经可以看出，0代表负面评价，而1代表正面评价。

📝 动手试试！ 从unsupervised拆分中随机选取一个样本，验证标签既不是0也不是1。同时，你也可以检查train和test拆分中的标签确实是0或1——这是每个NLP从业者在新项目开始时应该进行的有用检查！

现在我们快速浏览了数据，接下来我们将深入准备它，用于遮罩语言建模。正如我们将看到的，与我们在第3章中看到的序列分类任务相比，这里需要额外的步骤。让我们开始吧！

Preprocessing the data

对于自回归模型和掩码语言模型，一个常见的预处理步骤是将所有示例连接起来，然后将整个语料库分成相等大小的块。这与我们通常的方法非常不同，我们通常只是对单个示例进行标记化。为什么要将所有内容连接在一起呢？原因是如果单个示例太长，它们可能会被截断，这将导致丢失对语言建模任务可能有用的信息！

所以首先，我们将像往常一样对我们的语料库进行标记化，但是不在我们的标记器中设置truncation=True选项。如果可用，我们还将获取单词ID（如果我们使用快速标记器，如第6章中所述），因为我们稍后需要它们来执行整词掩码。我们将把这些放在一个简单的函数中，同时我们将删除text和label列，因为我们不再需要它们：

def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# 使用batched=True来激活快速多线程！
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 50000
    })
})

由于DistilBERT是类似BERT的模型，我们可以看到编码文本由我们在其他章节中看到的input_ids和attention_mask组成，以及我们添加的word_ids。

现在我们已经对电影评论进行了标记化，下一步是将它们全部组合在一起并将结果分成块。但这些块应该有多大？这将最终由您可用的GPU内存量决定，但一个很好的起点是查看模型的最大上下文大小。这可以通过检查标记器的model_max_length属性来推断：

tokenizer.model_max_length

这个值是从与检查点关联的tokenizer_config.json文件中派生的；在这种情况下，我们可以看到上下文大小为512个标记，就像BERT一样。

✏️ 试一下！ 一些Transformer模型，如BigBird和Longformer，比BERT和其他早期Transformer模型具有更长的上下文长度。实例化其中一个检查点的标记器，并验证model_max_length是否与其模型卡上引用的值一致。

因此，为了在像Google Colab上找到的GPU上运行我们的实验，我们将选择一些可以适应内存的较小的东西：

chunk_size = 128

请注意，在实际情况下使用较小的块大小可能会有害，因此您应该使用与您将应用模型的用例相对应的大小。

现在是有趣的部分。为了展示如何进行串联，让我们从我们的标记化训练集中取几个评论，并打印出每个评论的标记数：

# 切片会为每个特征生成一个列表的列表
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 200'
'>>> Review 1 length: 559'
'>>> Review 2 length: 192'

然后，我们可以使用一个简单的字典推导式将所有这些示例串联起来，如下所示：

concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 951'

很好，总长度检查通过了 — 现在让我们将串联的评论分成chunk_size给定的大小的块。为此，我们遍历concatenated_examples中的特征，并使用列表推导式为每个特征创建切片。结果是每个特征的块的字典：

chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 55'

正如您在此示例中所看到的，最后一个块通常会小于最大块大小。处理这种情况有两种主要策略：

如果小于 chunk_size，则丢弃最后一个块。
对最后一个块进行填充，直到其长度等于 chunk_size。

我们将采用第一种方法，在这里将上述所有逻辑封装在一个函数中，我们可以将其应用于我们的标记化数据集：

def group_texts(examples):
    # 连接所有文本
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # 计算连接文本的长度
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # 如果小于 chunk_size，则丢弃最后一个块
    total_length = (total_length // chunk_size) * chunk_size
    # 按照 max_len 分割
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # 创建一个新的标签列
    result["labels"] = result["input_ids"].copy()
    return result

请注意，在 group_texts() 的最后一步中，我们创建了一个新的 labels 列，它是 input_ids 列的副本。正如我们将很快看到的那样，在掩码语言建模中，目标是预测输入批次中随机掩码的标记，通过创建一个 labels 列，我们为我们的语言模型提供了学习的地面真相。

现在让我们使用我们可靠的 Dataset.map() 函数将 group_texts() 应用于我们的标记化数据集：

lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 61289
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 59905
    })
    unsupervised: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 122963
    })
})

您可以看到，对文本进行分组然后分块处理会产生比我们原始的 train 和 test 拆分的 25,000 个更多的示例。这是因为现在我们有涉及跨越原始语料库中多个示例的连续标记的示例。您可以通过查找一个块中的特殊 [SEP] 和 [CLS] 标记来明确看到这一点：

tokenizer.decode(lm_datasets["train"][1]["input_ids"])

".... at.......... high. a classic line : inspector : i'm here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn't! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless"

在这个示例中，您可以看到两个重叠的电影评论，一个关于高中电影，另一个关于无家可归。让我们也看看用于掩码语言建模的标签是什么样子：

tokenizer.decode(lm_datasets["train"][1]["labels"])

".... at.......... high. a classic line : inspector : i'm here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn't! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless"

根据我们上面的group_texts()函数的预期，这看起来与解码后的input_ids完全相同 — 但是我们的模型如何可能学到任何东西呢？我们缺少一个关键步骤：在输入中随机位置插入[MASK]标记！让我们看看如何在微调期间实时执行此操作，使用一个特殊的数据收集器。

使用 Trainer API 对 DistilBERT 进行微调

微调一个掩码语言模型几乎与微调一个序列分类模型相同，就像我们在第3章中所做的那样。唯一的区别是，我们需要一个特殊的数据收集器，它可以在每个文本批次中随机屏蔽一些标记。幸运的是，🤗 Transformers 准备了一个专门用于此任务的DataCollatorForLanguageModeling。我们只需将其传递给分词器和一个mlm_probability参数，该参数指定要屏蔽的标记比例。我们将选择15%，这是用于 BERT 的数量，并且是文献中的一个常见选择：

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

为了看到随机屏蔽是如何工作的，让我们将一些示例提供给数据收集器。由于它期望一个dict列表，其中每个dict代表一块连续文本，我们在将批次提供给收集器之前首先遍历数据集。对于此数据收集器，我们删除了"word_ids"键，因为它不需要它：

samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

'>>> [CLS] bromwell [MASK] is a cartoon comedy. it ran at the same [MASK] as some other [MASK] about school life, [MASK] as " teachers ". [MASK] [MASK] [MASK] in the teaching [MASK] lead [MASK] to believe that bromwell high\'[MASK] satire is much closer to reality than is " teachers ". the scramble [MASK] [MASK] financially, the [MASK]ful students whogn [MASK] right through [MASK] pathetic teachers\'pomp, the pettiness of the whole situation, distinction remind me of the schools i knew and their students. when i saw [MASK] episode in [MASK] a student repeatedly tried to burn down the school, [MASK] immediately recalled. [MASK]...'

'>>> .... at.. [MASK]... [MASK]... high. a classic line plucked inspector : i\'[MASK] here to [MASK] one of your [MASK]. student : welcome to bromwell [MASK]. i expect that many adults of my age think that [MASK]mwell [MASK] is [MASK] fetched. what a pity that it isn\'t! [SEP] [CLS] [MASK]ness ( or [MASK]lessness as george 宇in stated )公 been an issue for years but never [MASK] plan to help those on the street that were once considered human [MASK] did everything from going to school, [MASK], [MASK] vote for the matter. most people think [MASK] the homeless'

很好，它起作用了！我们可以看到[MASK]标记已随机插入到我们文本的各个位置。这些将是我们的模型在训练期间需要预测的标记 — 数据收集器的美妙之处在于它将在每个批次中随机化[MASK]的插入！

✏️ 试一下！ 多次运行上面的代码片段，看到随机屏蔽发生在你眼前！还将tokenizer.decode()方法替换为tokenizer.convert_ids_to_tokens()，以查看有时会屏蔽给定单词的单个标记，而不是其他标记。

随机屏蔽的一个副作用是，当使用Trainer时，我们的评估指标将不是确定性的，因为我们在训练和测试集中使用相同的数据收集器。稍后我们将看到，当我们使用🤗 Accelerate 进行微调时，如何使用自定义评估循环的灵活性来冻结随机性。

在为掩码语言建模训练模型时，可以使用的一种技术是将整个单词一起掩码，而不仅仅是单个标记。这种方法称为整词掩码。如果我们想使用整词掩码，我们将需要自己构建一个数据收集器。数据收集器只是一个函数，它接受一个样本列表并将它们转换为一个批次，所以让我们现在来做这个！我们将使用之前计算的单词ID来建立单词索引和相应标记之间的映射，然后随机决定要掩码哪些单词并将该掩码应用于输入。请注意，标签都是-100，除了对应于掩码单词的标签。

import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # 创建单词和相应标记索引之间的映射
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # 随机掩码单词
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

接下来，我们可以尝试在之前相同的样本上运行它：

samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

'>>> [CLS] bromwell high is a cartoon comedy [MASK] it ran at the same time as some other programs about school life, such as " teachers ". my 35 years in the teaching profession lead me to believe that bromwell high\'s satire is much closer to reality than is " teachers ". the scramble to survive financially, the insightful students who can see right through their pathetic teachers\'pomp, the pettiness of the whole situation, all remind me of the schools i knew and their students. when i saw the episode in which a student repeatedly tried to burn down the school, i immediately recalled.....'

'>>> .... [MASK] [MASK] [MASK] [MASK]....... high. a classic line : inspector : i\'m here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that isn\'t! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless'

✏️ 试一下吧！ 多次运行上面的代码片段，看到随机掩码发生在你眼前！还可以将tokenizer.decode()方法替换为tokenizer.convert_ids_to_tokens()，看到给定单词的标记总是一起被掩码。

现在我们有了两个数据收集器，其余的微调步骤是标准的。如果你没有幸运得到一块神秘的P100 GPU 😭，在Google Colab上训练可能需要一段时间，所以我们首先会将训练集的大小降低到几千个示例。不用担心，我们仍然会得到一个相当不错的语言模型！在🤗 Datasets中对数据集进行降采样的一个快速方法是通过Dataset.train_test_split()函数，我们在第5章中看到过：

train_size = 10_000
test_size = int(0.1 * train_size)```

```python
downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 1000
    })
})

这样就自动创建了新的 train 和 test 分割，训练集大小设置为 10,000 个示例，验证集为其的 10% — 如果你有一块强大的 GPU，可以随意增加这个比例！接下来我们需要登录 Hugging Face Hub。如果你在笔记本中运行这段代码，可以使用以下实用函数登录：

from huggingface_hub import notebook_login

notebook_login()

这将显示一个小部件，您可以在其中输入您的凭据。或者，您可以在您喜欢的终端中运行：

huggingface-cli login

一旦登录成功，我们可以为 Trainer 指定参数：

from transformers import TrainingArguments

batch_size = 64
# 每个 epoch 显示训练损失
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

在这里，我们调整了一些默认选项，包括 logging_steps 以确保我们跟踪每个 epoch 的训练损失。我们还使用了 fp16=True 来启用混合精度训练，这可以提高训练速度。默认情况下，Trainer 将删除不属于模型 forward() 方法的任何列。这意味着如果您使用了整词掩码拼接器，您还需要设置 remove_unused_columns=False 以确保在训练过程中不会丢失 word_ids 列。

请注意，您可以使用 hub_model_id 参数指定要推送到的存储库的名称（特别是，您将不得不使用此参数将模型推送到组织）。例如，当我们将模型推送到 huggingface-course 组织时，我们将 hub_model_id="huggingface-course/distilbert-finetuned-imdb" 添加到 TrainingArguments 中。默认情况下，使用的存储库将位于您的命名空间中，并以您设置的输出目录命名，因此在我们的情况下将是 "lewtun/distilbert-finetuned-imdb"。

现在我们已经准备好实例化 Trainer 了。这里我们只使用标准的 data_collator，但您可以尝试整词掩码拼接器并比较结果：

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

现在我们准备运行 trainer.train() — 但在这样做之前，让我们简要讨论 困惑度，这是评估语言模型性能的常见指标。

语言模型的困惑度

与文本分类或问答等其他任务不同，我们在语言建模中没有任何明确的标签语料库进行训练。那么我们如何确定什么样的语言模型是好的呢？就像手机上的自动更正功能一样，一个好的语言模型是指对语法正确的句子分配高概率，对无意义的句子分配低概率。为了让您更好地了解这是什么样子，您可以在网上找到整套“自动更正失败”的例子，其中手机中的模型生成了一些相当有趣（而且通常不合适）的完成！

假设我们的测试集主要由语法正确的句子组成，那么衡量语言模型质量的一种方法是计算它对测试集中所有句子中下一个词的概率分布。高概率表明模型对未见示例并不“惊讶”或“困惑”，并且表明它已经学习了语言中的基本语法模式。困惑度有各种数学定义，但我们将使用的定义将其定义为交叉熵损失的指数。因此，我们可以通过使用 Trainer.evaluate() 函数计算测试集上的交叉熵损失，然后取结果的指数来计算我们预训练模型的困惑度：

import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
>>> 困惑度: 21.75

较低的困惑度分数意味着更好的语言模型，我们可以看到我们的初始模型具有相对较大的值。让我们看看是否可以通过微调来降低它！为此，我们首先运行训练循环：

trainer.train()

然后像之前一样计算测试集上的困惑度：

eval_results = trainer.evaluate()
print(f">>> 困惑度: {math.exp(eval_results['eval_loss']):.2f}")
>>> 困惑度: 11.32

很好 — 这是困惑度大幅降低，这告诉我们模型已经学到了关于电影评论领域的一些知识！

训练完成后，我们可以将带有训练信息的模型卡推送到 Hub（检查点在训练过程中保存）：

trainer.push_to_hub()

✏️ 轮到你了！ 在将数据收集器更改为整词遮罩收集器后运行上述训练。你得到了更好的结果吗？

在我们的用例中，我们不需要对训练循环做任何特殊处理，但在某些情况下，您可能需要实现一些自定义逻辑。对于这些应用程序，您可以使用 🤗 Accelerate — 让我们来看看！

使用 🤗 Accelerate 对 DistilBERT 进行微调

正如我们在 Trainer 中看到的，微调遮罩语言模型与第 3 章中的文本分类示例非常相似。实际上，唯一的微妙之处在于使用特殊的数据收集器，我们在本节前面已经介绍过了！

然而，我们看到 DataCollatorForLanguageModeling 在每次评估时也会应用随机遮罩，因此我们会在每次训练运行中看到困惑度分数的波动。消除这种随机性来源的一种方法是在整个测试集上一次应用遮罩，然后在评估期间使用 🤗 Transformers 中的默认数据收集器来收集批次。为了看看这是如何工作的，让我们实现一个简单的函数，它在批次上应用遮罩，类似于我们第一次遇到 DataCollatorForLanguageModeling 时的情况：

def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # 为数据集中的每一列创建一个新的“遮罩”列
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

接下来，我们将此函数应用于我们的测试集，并删除未遮罩的列，以便我们可以用遮罩列替换它们。您可以通过将上述 data_collator 替换为适当的数据收集器来使用整词遮罩，在这种情况下，您应该删除这里的第一行：

downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

然后，我们可以像往常一样设置数据加载器，但是我们将在评估集中使用 🤗 Transformers 的 default_data_collator：

from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

从这里开始，我们按照 🤗 Accelerate 的标准步骤进行。首要任务是加载预训练模型的新版本：

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

然后我们需要指定优化器；我们将使用标准的 AdamW：

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

有了这些对象，我们现在可以使用 Accelerator 对象准备好一切进行训练：

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

现在我们的模型、优化器和数据加载器都配置好了，我们可以按照以下方式指定学习率调度器：

from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

在训练之前只剩最后一件事要做：在 Hugging Face Hub 上创建一个模型仓库！我们可以使用 🤗 Hub 库首先生成我们仓库的完整名称：

from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name
'lewtun/distilbert-base-uncased-finetuned-imdb-accelerate'

然后使用 🤗 Hub 的 Repository 类创建并克隆仓库：

from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

完成这些后，只需简单地编写完整的训练和评估循环：

from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # 训练
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # 评估
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # 保存并上传
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )
>>> Epoch 0: Perplexity: 11.397545307900472
>>> Epoch 1: Perplexity: 10.904909330983092
>>> Epoch 2: Perplexity: 10.729503505340409

很棒，我们已经能够在每个时代评估困惑度，并确保多次训练运行是可复制的！

使用我们微调的模型

您可以通过在 Hub 上使用其小部件或在 🤗 Transformers 中本地使用 pipeline 与您微调的模型进行交互。让我们使用后者通过 fill-mask pipeline 下载我们的模型：

from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="huggingface-course/distilbert-base-uncased-finetuned-imdb"
)

然后我们可以向 pipeline 提供我们的示例文本“这是一个很棒的 [MASK]”，看看前 5 个预测是什么：

preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")
'>>> this is a great movie.'
'>>> this is a great film.'
'>>> this is a great story.'
'>>> this is a great movies.'
'>>> this is a great character.'