Hugging Face实战-系列教程9：GLUE数据集/文本分类下（NLP实战/Transformer实战/预训练模型/分词器/模型微调/模型自动选择/PyTorch版本/代码逐行解析）

机器学习杨卓越

已于 2023-09-03 22:44:44 修改

阅读量1k

点赞数 1

分类专栏： Hugging Face实战文章标签：自然语言处理 transformer pytorch Hugging Face

于 2023-09-02 18:14:08 首次发布

本文链接：https://blog.csdn.net/weixin_50592077/article/details/132641131

版权

Hugging Face实战专栏收录该内容

22 篇文章 21 订阅

订阅专栏

🚩🚩🚩Hugging Face 实战系列总目录

有任何问题欢迎在下面留言
本篇文章的代码运行界面均在Jupyter Notebook中进行
本篇文章配套的代码资源已经上传

上篇内容：
Hugging Face实战-系列教程8：GLUE数据集/文本分类上（NLP实战/Transformer实战/预训练模型/分词器/模型微调/模型自动选择/PyTorch版本/代码逐行解析）

3 模型训练

咱玩一个东西，要带着问题去玩儿，有的人特别擅长做笔记，拿本拿笔记下来？能把所有参数都记下来，真没什么卵用。什么叫学习，多查，多练，遇到问题了，然后要去解决一个问题的一个过程，这才叫学习。

3.1模型参数

先打开这个API文档：

API文档，实际用的时候一定对应着来

API文档就是说明书，你得认真的看，有你想知道的一切答案

首先第一步，从Transformers中导进来训练参数

from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

设置好后再打印出来看看：

print(training_args )

TrainingArguments(
_n_gpu=0, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False,
dataloader_drop_last=False, dataloader_num_workers=0,
dataloader_pin_memory=True, ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None, debug=[], deepspeed=None,
disable_tqdm=False, do_eval=False, do_predict=False, do_train=False,
eval_accumulation_steps=None, eval_steps=None,
evaluation_strategy=IntervalStrategy.NO, fp16=False,
fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1,
gradient_accumulation_steps=1, gradient_checkpointing=False,
greater_is_better=None, group_by_length=False,
half_precision_backend=auto, hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE, hub_token=<HUB_TOKEN>,
ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0,
learning_rate=5e-05, length_column_name=length,
load_best_model_at_end=False, local_rank=-1, log_level=-1,
log_level_replica=-1, log_on_each_node=True,
logging_dir=test-trainer\runs\May26_10-08-48_WIN-BM410VRSBIO,
logging_first_step=False, logging_nan_inf_filter=True,
logging_steps=500, logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0,
max_steps=-1, metric_for_best_model=None, mp_parameters=,
no_cuda=False, num_train_epochs=3.0, optim=OptimizerNames.ADAMW_HF,
output_dir=test-trainer, overwrite_output_dir=False, past_index=-1,
per_device_eval_batch_size=8, per_device_train_batch_size=8,
prediction_loss_only=False, push_to_hub=False,
push_to_hub_model_id=None, push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>, remove_unused_columns=True,
report_to=[‘tensorboard’, ‘wandb’], resume_from_checkpoint=None,
run_name=test-trainer, save_on_each_node=False, save_steps=500,
save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=42,
sharded_ddp=[], skip_memory_metrics=True, tf32=None,
tpu_metrics_debug=False, tpu_num_cores=None,
use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0,
weight_decay=0.0, xpu_backend=None, )

我的天哪，这么多参数，这些参数都能改吗？

你都能改，要训练模型的时候，这些参数都要指定的

就算你背下来了，你还是要忘，就是要边查边用

比如说我要指定batch怎么指定呢？指定epochs怎么指定呢？

你打开API文档，看看人家API文档做的多漂亮。

鼠标停在第一个参数上：

第一个就是输出路径，自己读一遍，模型保存的位置对不对？后面的也是这样一个一个看的。

前面我们打印出来的都是默认的参数

3.2模型导入

接下来导一下模型：

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

模型有一些提示：

Some weights of the model checkpoint at bert-base-uncased were not
used when initializing BertForSequenceClassification:
[‘cls.predictions.bias’, ‘cls.predictions.transform.dense.bias’,
‘cls.predictions.transform.LayerNorm.weight’,
‘cls.predictions.transform.dense.weight’,
‘cls.predictions.decoder.weight’, ‘cls.seq_relationship.bias’,
‘cls.seq_relationship.weight’,
‘cls.predictions.transform.LayerNorm.bias’]

This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained
on another task or with another architecture (e.g. initializing a
BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you
expect to be exactly identical (initializing a
BertForSequenceClassification model from a
BertForSequenceClassification model). Some weights of
BertForSequenceClassification were not initialized from the model
checkpoint at bert-base-uncased and are newly initialized:
[‘classifier.weight’, ‘classifier.bias’] You should probably TRAIN
this model on a down-stream task to be able to use it for predictions
and inference.

首先确定你任务是什么，比如对序列进行分类，就导入AutoModelForSequenceClassification，选择模型checkpoint，num_labels=2是什么意思？我们要改输出层，输出层不用预训练模型了，输出层自己训练。

所以上面的提示告诉你，很多分类层的权重参数没有指定到，就是分类的输出层被自己初始化了，无法加载预训练模型了，当然了正合我们意。

3.3模型训练
模型咋训练？哎呀，太简单了，真的嗷嗷简单：

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

无论训练什么都把Trainer导进来，看看参数

model：我们在上面已经定义了
training_args：配置参数，前面打印过，现在全是默认的，但是可以改，后续再教怎么改
train_dataset：训练集，自己指定，根据前面定义的字典
eval_dataset：验证集，自己指定，根据前面定义的字典
data_collator：这是前面提到的
batch tokenizer：前面也定义了

不懂没关系，再次点开前面提到的API，搜一下Trainer，要等个几秒钟才会出现：

不懂就去API里面查：

看看人家这在线API做的，多招人稀罕啊，鼠标放上面就有解释了。

指定好参数，直接.train一下就开始训练了：

trainer.train()

训练过程中会给你打印出损失：

再看 training_args参数中，有一个叫logging_steps=500，就是说500次打印一次损失

还会告诉你一些已经指定的参数：

The following columns in the training set don’t have a corresponding
argument in BertForSequenceClassification.forward and have been
ignored: sentence2, idx, sentence1.
***** Running training ***** Num examples = 3668 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w.
parallel, distributed & accumulation) = 8 Gradient Accumulation
steps = 1 Total optimization steps = 1377

其实这个任务CPU也能跑，但是比较慢,但是最好还是有GPU这个东西哈。

跑完之后还有提示：

Saving model checkpoint to test-trainer\checkpoint-500 Configuration
saved in test-trainer\checkpoint-500\config.json Model weights saved
in test-trainer\checkpoint-500\pytorch_model.bin tokenizer config file
saved in test-trainer\checkpoint-500\tokenizer_config.json Special
tokens file saved in
test-trainer\checkpoint-500\special_tokens_map.json Saving model
checkpoint to test-trainer\checkpoint-1000 Configuration saved in
test-trainer\checkpoint-1000\config.json Model weights saved in
test-trainer\checkpoint-1000\pytorch_model.bin tokenizer config file
saved in test-trainer\checkpoint-1000\tokenizer_config.json Special
tokens file saved in
test-trainer\checkpoint-1000\special_tokens_map.json

Training completed. Do not forget to share your model on
huggingface.co/models =)

就是你的模型都保存在哪儿了，训练完成后，就可以得到模型了：

这分别是500打印一次损失的结果，1000打印一次损失的结果，点进去看，pytorch_model.bin这个文件，就是你训练的模型

这就是一个训练过程

4 模型测试

4.1模型测试

模型训练好了，用验证集进行一下验证：

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

打印的结果：(408, 2) (408,)，当然这是打印的维度

前面给到的都是损失值，能不能给出具体的评估呢？datasets 模块专门提供了评估子模块load_metric

from datasets import load_metric

metric = load_metric("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

打印结果：

A Jupyter Widget {‘accuracy’: 0.8186274509803921, ‘f1’:
0.8754208754208753}

在评估的参数中，只需要传入两个值，一个是predictions，一个是references，预测和标签嘛

4.2训练评估函数

我们在训练过程中能不能指定评估参数呢，那就需要将它封装成一个函数了：

def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

逐行解释：

首先函数名字无所谓
还是加载默认的方法
输入参数只有一个值，但是在这个函数中需要做一个解开操作logits, labels = eval_preds
labels 是真实的标签，logits是一个中间结果不是实际预测结果，将logits中最大的取出来（模型中预测的最大概率）
然后再把预测和标签传进去返回
最后在训练参数中将上面的函数指定进去：

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)