网口变压器选型指南_T5变压器多任务处理指南

最新推荐文章于 2024-03-22 00:12:15 发布

weixin_26630173

最新推荐文章于 2024-03-22 00:12:15 发布

阅读量747

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/the-guide-to-multi-tasking-with-the-t5-transformer-90c70a08837b

版权

本文提供了一份关于网口变压器选型的指南，特别关注了T5变压器在处理多任务时的能力和应用。

摘要由CSDN通过智能技术生成

网口变压器选型指南

The T5 (Text-To-Text Transfer Transformer) model was the product of a large-scale study (paper) conducted to explore the limits of transfer learning. It builds upon popular architectures like GPT, BERT, and RoBERTa(to name only a few) models that utilized Transfer Learning with incredible success. While BERT-like models can be fine-tuned to perform a variety of tasks, the constraints of the architecture mean that each model can perform only one task.

T5(文本到文本的传输变压器)模型是进行大规模研究( 论文 )以探索传输学习的局限性的产物。它建立在流行的架构(例如GPT，BERT和RoBERTa(仅举几个例子))的模型上，这些模型利用了Transfer Learning取得了令人难以置信的成功。尽管可以对类似BERT的模型进行微调以执行各种任务，但是体系结构的约束意味着每个模型只能执行一个任务。

Typically, this is done by adding a task-specific layer on top of the Transformer model. For example, a BERT Transformer can be adapted for binary classification by adding a fully-connected layer with two output neurons (corresponding to each class). The T5 model departs from this tradition by reframing all NLP tasks as text-to-text tasks. This results in a shared framework for any NLP task as the input to the model and the output from the model is always a string. In the example of binary classification, the T5 model will simply output a string representation for the class (i.e. "0" or "1").

通常，这是通过在Transformer模型顶部添加特定于任务的层来完成的。例如，可以通过添加具有两个输出神经元(对应于每个类别)的完全连接层，将BERT变压器用于二进制分类。 T5模型通过将所有 NLP任务重新定义为文本到文本任务而背离了这一传统。这将导致任何NLP任务的共享框架，作为模型的输入，而模型的输出始终是字符串。在二进制分类的示例中，T5模型将仅输出该类的字符串表示形式(即"0"或"1" )。

Since the input and output formats are identical for any NLP task, the same T5 model can be taught to perform multiple tasks! To specify which task should be performed, we can simply prepend a prefix (string) to the input of the model. The animation (shown below) from the Google AI Blog article demonstrates this concept.

由于任何NLP任务的输入和输出格式都相同，因此可以教同一个T5模型执行多个任务！要指定应执行的任务，我们可以简单地在模型的输入之前添加前缀(字符串)。 Google AI Blog文章中的动画(如下所示)演示了此概念。

Image for post — Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer 探索使用T5的迁移学习：文本到文本的传输转换器”

In this article, we’ll be using this technique to train a single T5 model capable of performing the 3 NLP tasks, binary classification, multi-label classification, and regression.

在本文中，我们将使用此技术来训练一个能够执行3个NLP任务，二进制分类，多标签分类和回归的T5模型。

All code can also be found on Github.

所有代码也可以在 Github 上找到 。

任务说明 (Task Specification)

二进制分类 (Binary Classification)

The goal of binary classification in NLP is to classify a given text sequence into one of two classes. In our task, we will be using the Yelp Reviews dataset to classify the sentiment of the text as either positive ( "1" ) or negative ( "0" ).

NLP中二进制分类的目标是将给定的文本序列分类为两个类别之一。在我们的任务中，我们将使用Yelp评论数据集将文本的情绪分类为肯定( "1" )或否定( "0" )。

多标签分类 (Multi-label Classification)

In multi-label classification, a given text sequence should be labeled with the correct subset of a set of pre-defined labels (note that the subset can include both the null set and the full set of labels itself). For this, we will be using the Toxic Comments dataset where each text can be labeled with any subset of the labels toxic, severe_toxic, obscene, threat, insult, identity_hate.

在多标签分类中，应使用一组预定义标签的正确子集来标记给定的文本序列(请注意，该子集可以同时包含空集和完整标签集本身)。为此，我们将使用有毒评论数据集，其中每个文本都可以使用toxic, severe_toxic, obscene, threat, insult, identity_hate标签的任何子集进行标记。

回归 (Regression)

In regression tasks, the target variable is a continuous value. In our task, we will use the STS-B (Semantic Textual Similarity Benchmark) dataset where the goal is to predict the similarity of two sentences. The similarity is denoted by a continuous value between 0 and 5.

在回归任务中，目标变量是一个连续值。在我们的任务中，我们将使用STS-B(语义文本相似性基准)数据集，目的是预测两个句子的相似性。相似性由0到5之间的连续值表示。

资料准备 (Data Preparation)

Since we are going to be working with 3 datasets, we’ll put them in 3 separate subdirectories inside the data directory.

由于我们将使用3个数据集，因此将它们放在data目录内的3个单独的子目录中。

data/binary_classification
data/binary_classification
data/multilabel_classification
data/multilabel_classification
data/regression
data/regression

正在下载 (Downloading)

Download the Yelp Reviews Dataset.
下载Yelp评论数据集。
Extract train.csv and test.csv to data/binary_classification.
将train.csv和test.csv提取到data/binary_classification 。
Download the Toxic Comments dataset.
下载有毒评论数据集。
Extract the csv files to data/multilabel_classification.
将csv文件提取到data/multilabel_classification 。
Download the STS-B dataset.
下载STS-B数据集。
Extract the csv files to data/regression.
将csv文件解压缩到data/regression 。

合并数据集 (Combining the datasets)

As mentioned earlier, the inputs and outputs of a T5 model is always text. A particular task is specified by using a prefix text that lets the model know what it should do with the input.

如前所述，T5模型的输入和输出始终是文本。通过使用前缀文本来指定特定任务，该前缀文本使模型知道应如何处理输入。

The input data format for a T5 model in Simple Transformers reflects this fact. The input is a Pandas dataframe with the 3 columns — prefix, input_text, andtarget_text. This makes it quite easy to train the model on multiple tasks as you just need to change the prefix.

Simple Transformers中T5模型的输入数据格式反映了这一事实。输入是数据帧熊猫与3列- prefix ， input_text ，和target_text 。这使您很容易在多个任务上训练模型，因为您只需要更改prefix 。

The notebook above loads each of the datasets, preprocesses them for T5, and finally combines them into a unified dataframe.

上面的笔记本加载了每个数据集，对它们进行了T5预处理，最后将它们组合为一个统一的数据框。

This gives us a dataframe with 3 unique prefixes, namely binary classification, multilabel classification, and similarity. Note that the prefixes themselves are fairly arbitrary, the important thing is to ensure that each task has its own unique prefix. The input to the model will take the following format:

这为我们提供了一个具有3个唯一前缀的数据框，即binary classification ， multilabel classification和similarity 。请注意，前缀本身是相当任意的，重要的是要确保每个任务都有自己的唯一前缀。模型的输入将采用以下格式：

 <prefix>: <input_text>

The ": " is automatically added when training.

该 ": " 会自动添加时的训练。

A few other things to note:

其他注意事项：

The output of the multilabel classification task is a comma-separated list of the predicted labels (toxic, severe_toxic, obscene, threat, insult, identity_hate). If no label is predicted, the output should be clean.
多标签分类任务的输出是用逗号分隔的预测标签列表( toxic, severe_toxic, obscene, threat, insult, identity_hate )。如果没有标签，则输出应该是clean 。
The input_text for the similarity task includes both sentences as shown in the following example;
相似性任务的input_text包括两个句子，如以下示例所示；

The input_text for the similarity task includes both sentences as shown in the following example;sentence1: A man plays the guitar. sentence2: The man sang and played his guitar.
相似性任务的input_text包括两个句子，如以下示例所示； sentence1: A man plays the guitar. sentence2: The man sang and played his guitar.
The output of the similarity task is a number (as a string) between 0.0 and 5.0, going by increments of 0.2. (E.g. 0.0, 0.4, 3.0, 5.0). This follows the same format used by the authors of the T5 paper.
相似性任务的输出是介于0.0和5.0之间的数字(作为字符串)，增量为0.2。 (例如0.0 ， 0.4 ， 3.0 ， 5.0 )。这遵循T5论文作者使用的相同格式。

As you can see from the way the different inputs and outputs are represented, the T5 model’s text-to-text approach gives us a great deal of flexibility both in terms of representing various tasks and in terms of the actual tasks we can perform.

从表示不同输入和输出的方式可以看出，T5模型的文本到文本方法在表示各种任务以及可以执行的实际任务方面给了我们很大的灵活性。

For example;

例如;

The only limitation is imagination! (Well, imagination and compute resources but that’s another story) 😅

唯一的限制是想象力！ (好吧，想象力和计算资源，但这是另一回事了) 😅

Getting back to the data, running the notebook should have given you a train.tsv and an eval.tsv file which we’ll be using to train our model in the next section!

回到数据上，运行笔记本应该已经为您提供了train.tsv和eval.tsv文件，我们将在下一部分中使用它们来训练我们的模型！

建立 (Setup)

We will be using the Simple Transformers library (based on the Hugging Face Transformers) to train the T5 model.

我们将使用Simple Transformers库(基于Hugging Face Transformers )来训练T5模型。

The instructions given below will install all the requirements.

下面给出的说明将安装所有要求。

Install Anaconda or Miniconda Package Manager from here.
从这里安装Anaconda或Miniconda Package Manager。
Create a new virtual environment and install packages.
创建一个新的虚拟环境并安装软件包。

Create a new virtual environment and install packages.conda create -n simpletransformers python
创建一个新的虚拟环境并安装软件包。 conda create -n simpletransformers python

Create a new virtual environment and install packages.conda create -n simpletransformers pythonconda activate simpletransformers
创建一个新的虚拟环境并安装软件包。 conda create -n simpletransformers python conda activate simpletransformers

Create a new virtual environment and install packages.conda create -n simpletransformers pythonconda activate simpletransformersconda install pytorch cudatoolkit=10.1 -c pytorch
创建一个新的虚拟环境并安装软件包。 conda create -n simpletransformers python conda activate simpletransformers conda install pytorch cudatoolkit=10.1 -c pytorch
Install simpletransformers.
安装simpletransformers。

Install simpletransformers.pip install simpletransformers
安装simpletransformers。 pip install simpletransformers

See installation docs

查看安装 文档

训练T5模型 (Training the T5 Model)

As always, training the model with Simple Transformers is quite straightforward.

与往常一样，使用简单变压器训练模型非常简单。

import pandas as pd
 from simpletransformers.t5 import T5Model
 



 train_df = pd.read_csv("data/train.tsv", sep="\t").astype(str)
 eval_df = pd.read_csv("data/eval.tsv", sep="\t").astype(str)
 

 model_args = {
     "max_seq_length": 196,
     "train_batch_size": 16,
     "eval_batch_size": 64,
     "num_train_epochs": 1,
     "evaluate_during_training": True,
     "evaluate_during_training_steps": 15000,
     "evaluate_during_training_verbose": True,
     
     "use_multiprocessing": False,
     "fp16": False,
 

     "save_steps": -1,
     "save_eval_checkpoints": False,
     "save_model_every_epoch": False,
 

     "reprocess_input_data": True,
     "overwrite_output_dir": True,
 

     "wandb_project": "T5 mixed tasks - Binary, Multi-Label, Regression",
 }
 

 model = T5Model("t5-base", args=model_args)
 

 model.train_model(train_df, eval_data=eval_df)

Most of the arguments used here are fairly standard.

这里使用的大多数参数都是相当标准的。

max_seq_length: Chosen such that most samples are not truncated. Increasing the sequence length significantly affects the memory consumption of the model, so it’s usually best to keep it as short as possible (ideally without truncating the input sequences).
max_seq_length ：选择为不截断大多数样本。增加序列长度会显着影响模型的内存消耗，因此通常最好使其尽可能短(理想情况下，不截断输入序列)。
train_batch_size: Bigger the better (as long as it fits on your GPU)
train_batch_size ：越大越好(只要适合您的GPU)
eval_batch_size: Same deal as train_batch_size
eval_batch_size ：同一交易train_batch_size
num_train_epochs: Training for more than 1 epoch would probably improve the model’s performance, but it would obviously increase the training time as well (about 7 hours per epoch on an RTX Titan).
num_train_epochs ：训练1个以上的时间可能会改善模型的性能，但是显然也会增加训练时间(在RTX Titan上，每个时间大约7小时)。
evaluate_during_training: We’ll periodically test the model against the test data to see how it’s learning.
evaluate_during_training ：我们将根据测试数据定期测试模型，以了解其学习方式。
evaluate_during_training_steps: The aforementioned period at which the model is tested.
evaluate_during_training_steps ：上述测试模型的时间段。
evaluate_during_training_verbose: Show us the results when a test is done.
evaluate_during_training_verbose ：完成测试后，向我们显示结果。
use_multiprocessing: Using multiprocessing significantly reduces the time taken for tokenization (done before training starts), however, this currently causes issues with the T5 implementation. So, no multiprocessing for now. 😢
use_multiprocessing ：使用多重处理可显着减少use_multiprocessing化所需的时间(在培训开始之前完成)，但是，当前这会导致T5实施出现问题。因此，暂时没有多重处理。 😢
fp16: FP16 or mixed-precision training reduces the memory consumption of training the models (meaning larger batch sized are possible). Unfortunately, fp16 training is not stable with T5 at the moment, so it’s turned off as well.
fp16 ：FP16或混合精度训练减少了训练模型的内存消耗(这意味着可以使用更大的批处理大小)。不幸的是， fp16训练目前在T5上不稳定，因此也已关闭。
save_steps: Setting this to -1 means that checkpoints aren’t saved.
save_steps ：将其设置为-1表示不保存检查点。
save_eval_checkpoints: By default, a model checkpoint will be saved when an evaluation is performed during training. Since this experiment is being done for demonstration only, let’s not waste space on saving these checkpoints either.
save_eval_checkpoints ：默认情况下，在训练期间执行评估时将保存模型检查点。由于此实验仅用于演示，因此我们也不要浪费空间保存这些检查点。
save_model_every_epoch: We only have 1 epoch, so no. Don’t need this one either.
save_model_every_epoch ：我们只有1个纪元，所以没有。也不需要这个。
reprocess_input_data: Controls whether the features are loaded from cache (saved to disk) or whether tokenization is done again on the input sequences. It only really matters when doing multiple runs.
reprocess_input_data ：控制是否从高速缓存中加载功能(保存到磁盘)或是否对输入序列再次进行标记化。仅在进行多次运行时才真正重要。
overwrite_output_dir: This will overwrite any previously saved models if they are in the same output directory.
overwrite_output_dir ：如果先前保存的模型位于同一输出目录中，则这将覆盖它们。
wandb_project: Used for visualization of training progress.
wandb_project ：用于可视化培训进度。

Speaking of visualization, you can check my training progress here. Shoutout to W&B for their awesome library!

说到可视化，您可以在此处查看我的培训进度。为他们很棒的图书馆大喊W＆B ！

测试T5模型 (Testing the T5 model)

Considering the fact that we are dealing with multiple tasks, it’s a good idea to use suitable metrics to evaluate each task. With that in mind, we’ll be using the following metrics;

考虑到我们正在处理多个任务，最好使用适当的指标来评估每个任务。考虑到这一点，我们将使用以下指标；

Binary Classification: F1 score and Accuracy score
二进制分类： F1得分和准确性得分
Multilabel Classification: F1 score (Hugging Face SQuAD metrics implementation) and Exact matches (Hugging Face SQuAD metrics implementation)
多标签分类：F1得分(Hugging Face SQuAD指标实施)和完全匹配(Hugging Face SQuAD指标实施)
Similarity: Pearson correlation coefficient and Spearman correlation
相似度： Pearson相关系数和Spearman相关

import json
 from datetime import datetime
 from pprint import pprint
 from statistics import mean
 

 import numpy as np
 import pandas as pd
 from scipy.stats import pearsonr, spearmanr
 from simpletransformers.t5 import T5Model
 from sklearn.metrics import accuracy_score, f1_score
 from transformers.data.metrics.squad_metrics import compute_exact, compute_f1
 



 def f1(truths, preds):
     return mean([compute_f1(truth, pred) for truth, pred in zip(truths, preds)])
 



 def exact(truths, preds):
     return mean([compute_exact(truth, pred) for truth, pred in zip(truths, preds)])
 



 def pearson_corr(preds, labels):
     return pearsonr(preds, labels)[0]
 



 def spearman_corr(preds, labels):
     return spearmanr(preds, labels)[0]
 



 model_args = {
     "overwrite_output_dir": True,
     "max_seq_length": 196,
     "eval_batch_size": 32,
     "num_train_epochs": 1,
     "use_multiprocessing": False,
     "num_beams": None,
     "do_sample": True,
     "max_length": 50,
     "top_k": 50,
     "top_p": 0.95,
     "num_return_sequences": 3,
 }
 

 # Load the trained model
 model = T5Model("outputs", args=model_args)
 

 # Load the evaluation data
 df = pd.read_csv("data/eval.tsv", sep="\t").astype(str)
 

 # Prepare the data for testing
 to_predict = [
     prefix + ": " + str(input_text)
     for prefix, input_text in zip(df["prefix"].tolist(), df["input_text"].tolist())
 ]
 truth = df["target_text"].tolist()
 tasks = df["prefix"].tolist()
 

 # Get the model predictions
 preds = model.predict(to_predict)
 

 # Saving the predictions if needed
 with open(f"predictions/predictions_{datetime.now()}.txt", "w") as f:
     for i, text in enumerate(df["input_text"].tolist()):
         f.write(str(text) + "\n\n")
 

         f.write("Truth:\n")
         f.write(truth[i] + "\n\n")
 

         f.write("Prediction:\n")
         for pred in preds[i]:
             f.write(str(pred) + "\n")
         f.write(
             "________________________________________________________________________________\n"
         )
 

 # Taking only the first prediction
 preds = [pred[0] for pred in preds]
 df["predicted"] = preds
 

 # Evaluating the tasks separately
 output_dict = {
     "binary classification": {"truth": [], "preds": [],},
     "multilabel classification": {"truth": [], "preds": [],},
     "similarity": {"truth": [], "preds": [],},
 }
 

 results_dict = {}
 

 for task, truth_value, pred in zip(tasks, truth, preds):
     output_dict[task]["truth"].append(truth_value)
     output_dict[task]["preds"].append(pred)
 

 print("-----------------------------------")
 print("Results: ")
 for task, outputs in output_dict.items():
     if task == "multilabel classification":
         try:
             task_truth = output_dict[task]["truth"]
             task_preds = output_dict[task]["preds"]
             results_dict[task] = {
                 "F1 Score": f1(task_truth, task_preds),
                 "Exact matches": exact(task_truth, task_preds),
             }
             print(f"Scores for {task}:")
             print(f"F1 score: {f1(task_truth, task_preds)}")
             print(f"Exact matches: {exact(task_truth, task_preds)}")
             print()
         except:
             pass
     elif task == "binary classification":
         try:
             task_truth = [int(t) for t in output_dict[task]["truth"]]
             task_preds = [int(p) for p in output_dict[task]["preds"]]
             results_dict[task] = {
                 "F1 Score": f1_score(task_truth, task_preds),
                 "Accuracy Score": accuracy_score(task_truth, task_preds),
             }
             print(f"Scores for {task}:")
             print(f"F1 score: {results_dict[task]['F1 Score']}")
             print(f"Accuracy Score: {results_dict[task]['Accuracy Score']}")
             print()
         except:
             pass
     if task == "similarity":
         task_truth = [float(t) for t in output_dict[task]["truth"]]
         task_preds = [float(p) for p in output_dict[task]["preds"]]
         results_dict[task] = {
             "Pearson Correlation": pearson_corr(task_truth, task_preds),
             "Spearman Correlation": spearman_corr(task_truth, task_preds),
         }
         print(f"Scores for {task}:")
         print(f"Pearson Correlation: {results_dict[task]['Pearson Correlation']}")
         print(f"Spearman Correlation: {results_dict[task]['Spearman Correlation']}")
         print()
 

 with open(f"results/result_{datetime.now()}.json", "w") as f:
     json.dump(results_dict, f)

Note that a ": “ is inserted between the prefix and the input_text when preparing the data. This is done automatically when training but needs to be handled manually for prediction.

请注意，在准备数据时，在prefix和input_text之间插入一个": “ 。训练时会自动完成此操作，但需要手动进行预测。

If you’d like to read more about the decoding arguments (num_beams, do_sample, max_length, top_k, top_p), please refer to this article.

如果您想了解有关解码参数( num_beams ， do_sample ， max_length ， top_k ， top_p )的更多信息，请参考 本文。

Time to see how our model did!

是时候看看我们的模型如何了！

 -----------------------------------
Results: 
Scores for binary classification:
F1 score: 0.96044512420231
Accuracy Score: 0.9605263157894737 Scores for multilabel classification:
F1 score: 0.923048001002632
Exact matches: 0.923048001002632 Scores for similarity:
Pearson Correlation: 0.8673017763553101
Spearman Correlation: 0.8644328787107548

The model performs quite well on each task, despite being trained on 3 separate tasks! We’ll take a quick look at how we can try to improve the performance of the model even more in the next section.

尽管接受了3个单独的任务训练，该模型在每个任务上的表现都非常好！在下一节中，我们将快速介绍如何尝试改善模型的性能。

总结思想 (Closing Thoughts)

可能的改进 (Possible improvements)

A potential issue that arises when mixing tasks is the discrepancy between the sizes of the datasets used for each task. We can see this issue in our dataset by taking a look at the training sample counts.

混合任务时出现的潜在问题是用于每个任务的数据集的大小之间的差异。通过查看训练样本数，我们可以在数据集中看到此问题。

 binary classification        560000
multilabel classification    143613
similarity                     5702

The dataset is substantially unbalanced with the plight of the similarity task seeming particularly dire! This can be clearly seen in the evaluation scores where the similarity task lags behind the others (although it’s important to note that we are not looking at the same metrics between the tasks).

由于similarity任务的困境看起来特别可怕，因此数据集基本上是不平衡的！从similarity任务落后于其他任务的评估分数中可以清楚地看出这一点(尽管必须注意，我们没有在任务之间寻找相同的指标)。

A possible remedy to this problem would be to oversample the similarity tasks so that the model.

解决此问题的一种可能的方法是对similarity任务进行过度采样 ，以便对模型进行建模。

In addition to this, increasing the number of training epochs (and tuning other hyperparameters) is also likely to improve the model.

除此之外，增加训练时期(以及调整其他超参数)的数量也可能会改善模型。

Finally, tuning the decoding parameters could also lead to better results.

最后，调整解码参数也可以带来更好的结果。

结语 (Wrapping up)

The text-to-text format of the T5 model paves the way to apply Transformers and NLP to a wide variety of tasks with next to no customization necessary. The T5 model performs strongly even when the same model is used to perform multiple tasks!

T5模型的文本到文本格式为将Transformers和NLP应用到各种各样的任务铺平了道路，而无需定制。即使使用同一模型来执行多个任务，T5模型也能发挥出色的性能！

Hopefully, this will lead to many innovative applications in the near future.

希望这将在不久的将来带来许多创新的应用。