4.1-文本分类+超参搜索

最新推荐文章于 2023-01-30 09:32:54 发布

꧁ᝰ苏苏ᝰ꧂

最新推荐文章于 2023-01-30 09:32:54 发布

阅读量425

点赞数 1

分类专栏： nlp 文章标签：自然语言处理深度学习

nlp 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

如果您正在本地打开这个notebook，请确保您已经进行上述依赖包的安装。
您也可以在这里找到本notebook的多GPU分布式训练版本。

pip install transformers datasets

微调预训练模型进行文本分类

我们将展示如何使用 🤗 Transformers代码库中的模型来解决文本分类任务，任务来源于GLUE Benchmark.

GLUE榜单包含了9个句子级别的分类任务，分别是：

CoLA (Corpus of Linguistic Acceptability) 鉴别一个句子是否语法正确.
MNLI (Multi-Genre Natural Language Inference) 给定一个假设，判断另一个句子与该假设的关系：entails, contradicts 或者 unrelated。
MRPC (Microsoft Research Paraphrase Corpus) 判断两个句子是否互为paraphrases.
QNLI (Question-answering Natural Language Inference) 判断第2句是否包含第1句问题的答案。
QQP (Quora Question Pairs2) 判断两个问句是否语义相同。
RTE (Recognizing Textual Entailment)判断一个句子是否与假设成entail关系。
SST-2 (Stanford Sentiment Treebank) 判断一个句子的情感正负向.
STS-B (Semantic Textual Similarity Benchmark) 判断两个句子的相似性（分数为1-5分）。
WNLI (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not.

对于以上任务，我们将展示如何使用简单的Dataset库加载数据集，同时使用transformer中的Trainer接口对预训练模型进行微调。

GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

本notebook理论上可以使用各种各样的transformer模型（模型面板），解决任何文本分类分类任务。

如果您所处理的任务有所不同，大概率只需要很小的改动便可以使用本notebook进行处理。同时，您应该根据您的GPU显存来调整微调训练所需要的btach size大小，避免显存溢出。

task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

加载数据

我们将会使用🤗 Datasets库来加载数据和对应的评测方式。数据加载和评测方式加载只需要简单使用load_dataset和load_metric即可。

from datasets import load_dataset, load_metric

除了mnli-mm以外，其他任务都可以直接通过任务名字进行加载。数据加载之后会自动缓存。

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

这个datasets对象本身是一种DatasetDict数据结构. 对于训练集、验证集和测试集，只需要使用对应的key（train，validation，test）即可得到相应的数据。

dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

给定一个数据切分的key（train、validation或者test）和下标即可查看数据。

dataset["train"][0]

{'idx': 0,
 'label': 1,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

为了能够进一步理解数据长什么样子，下面的函数将从数据集里随机选择几个例子进行展示。

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dataset["train"])

	sentence	label	idx
0	The more I talk to Joe, the less about linguistics I am inclined to think Sally has taught him to appreciate.	acceptable	196
1	Have in our class the kids arrived safely?	unacceptable	3748
2	I gave Mary a book.	acceptable	5302
3	Every student, who attended the party, had a good time.	unacceptable	4944
4	Bill pounded the metal fiat.	acceptable	2178
5	It bit me on the leg.	acceptable	5908
6	The boys were made a good mother by Aunt Mary.	unacceptable	736
7	More of a man is here.	unacceptable	5403
8	My mother baked me a birthday cake.	acceptable	3761
9	Gregory appears to have wanted to be loyal to the company.	acceptable	4334

评估metic是datasets.Metric的一个实例:

metric

直接调用metric的compute方法，传入labels和predictions即可得到metric的值：

import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'matthews_correlation': 0.1513518081969605}

每一个文本分类任务所对应的metic有所不同，具体如下:

for CoLA: Matthews Correlation Coefficient
for MNLI (matched or mismatched): Accuracy
for MRPC: Accuracy and F1 score
for QNLI: Accuracy
for QQP: Accuracy and F1 score
for RTE: Accuracy
for SST-2: Accuracy
for STS-B: Pearson Correlation Coefficient and Spearman’s_Rank_Correlation_Coefficient
for WNLI: Accuracy

所以一定要将metric和任务对齐

数据预处理

在将数据喂入模型之前，我们需要对数据进行预处理。预处理的工具叫Tokenizer。Tokenizer首先对输入进行tokenize，然后将tokens转化为预模型中需要对应的token ID，再转化为模型需要的输入格式。

为了达到数据预处理的目的，我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer，这样可以确保：

我们得到一个与预训练模型一一对应的tokenizer。
使用指定的模型checkpoint对应的tokenizer的时候，我们也下载了模型需要的词表库vocabulary，准确来说是tokens vocabulary。

这个被下载的tokens vocabulary会被缓存起来，从而再次使用的时候不会重新下载。

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

注意：use_fast=True要求tokenizer必须是transformers.PreTrainedTokenizerFast类型，因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性（比如多线程快速tokenizer）。如果对应的模型没有fast tokenizer，去掉这个选项即可。

几乎所有模型对应的tokenizer都有对应的fast tokenizer。我们可以在模型tokenizer对应表里查看所有预训练模型对应的tokenizer所拥有的特点。

tokenizer既可以对单个文本进行预处理，也可以对一对文本进行预处理，tokenizer预处理后得到的数据满足预训练模型输入格式

tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

取决于我们选择的预训练模型，我们将会看到tokenizer有不同的返回，tokenizer和预训练模型是一一对应的，更多信息可以在这里进行学习。

为了预处理我们的数据，我们需要知道不同数据和对应的数据格式，因此我们定义下面这个dict。

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

对数据格式进行检查:

sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.

随后将预处理的代码放到一个函数中：

def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

预处理函数可以处理单个样本，也可以对多个样本进行处理。如果输入是多个样本，那么返回的是一个list：

preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

接下来对数据集datasets里面的所有样本进行预处理，处理的方式是使用map函数，将预处理函数prepare_train_features应用到（map)所有样本上。

encoded_dataset = dataset.map(preprocess_function, batched=True)

更好的是，返回的结果会自动被缓存，避免下次处理的时候重新计算（但是也要注意，如果输入有改动，可能会被缓存影响！）。datasets库函数会对输入的参数进行检测，判断是否有变化，如果没有变化就使用缓存数据，如果有变化就重新处理。但如果输入参数不变，想改变输入的时候，最好清理调这个缓存。清理的方式是使用load_from_cache_file=False参数。另外，上面使用到的batched=True这个参数是tokenizer的特点，以为这会使用多线程同时并行对输入进行处理。

微调预训练模型

既然数据已经准备好了，现在我们需要下载并加载我们的预训练模型，然后微调预训练模型。既然我们是做seq2seq任务，那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSequenceClassification 这个类。和tokenizer相似，from_pretrained方法同样可以帮助我们下载并加载模型，同时也会对模型进行缓存，就不会重复下载模型啦。

需要注意的是：STS-B是一个回归问题，MNLI是一个3分类问题：

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

由于我们微调的任务是文本分类任务，而我们加载的是预训练的语言模型，所以会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数（比如：预训练语言模型的神经网络head被扔掉了，同时随机初始化了文本分类的神经网络head）。

为了能够得到一个Trainer训练工具，我们还需要3个要素，其中最重要的是训练的设定/参数 TrainingArguments。这个训练设定包含了能够定义训练过程的所有属性。

metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

上面evaluation_strategy = "epoch"参数告诉训练代码：我们每个epcoh会做一次验证评估。

上面batch_size在这个notebook之前定义好了。

最后，由于不同的任务需要不同的评测指标，我们定一个函数来根据任务名字得到评价方法:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

全部传给 Trainer:

validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

开始训练:

trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running training *****
  Num examples = 8551
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2675

训练完成后进行评估:

trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16

[66/66 00:00]

{'epoch': 5.0,
 'eval_loss': 0.8822253346443176,
 'eval_matthews_correlation': 0.5138995234247261,
 'eval_runtime': 0.9319,
 'eval_samples_per_second': 1119.255,
 'eval_steps_per_second': 70.825}

To see how your model fared you can compare it to the GLUE Benchmark leaderboard.

超参数搜索

Trainer同样支持超参搜索，使用optuna or Ray Tune代码库。

反注释下面两行安装依赖：

! pip install optuna
! pip install ray[tune]

超参搜索时，Trainer将会返回多个训练好的模型，所以需要传入一个定义好的模型从而让Trainer可以不断重新初始化该传入的模型：

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

和之前调用 Trainer类似:

trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


调用方法`hyperparameter_search`。注意，这个过程可能很久，我们可以先用部分数据集进行超参搜索，再进行全量训练。
比如使用1/10的数据进行搜索：


```python
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

hyperparameter_search会返回效果最好的模型相关的参数：

best_run

将Trainner设置为搜索到的最好参数，进行训练：

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

最后别忘了，查看如何上传模型，上传模型到](https://huggingface.co/transformers/model_sharing.html) 到🤗 Model Hub

꧁ᝰ苏苏ᝰ꧂

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
4.1-文本分类+超参搜索

本文涉及的jupter notebook在篇章4代码库中。也直接使用google colab notebook打开本教程，下载相关数据集和模型。如果您正在google的colab中打开这个notebook，您可能需要安装Transformers和????Datasets库。将以下命令取消注释即可安装。!pip install transformers datasets如果您正在本地打开这个notebook，请确保您已经进行上述依赖包的安装。您也可以在这里找到本notebook的多GPU分布式训练
复制链接

扫一扫

专栏目录