在另一个数据集微调
There are many articles about Hugging Face fine-tuning with your own dataset. Many of the articles are using PyTorch, some are with TensorFlow. I had a task to implement sentiment classification based on a custom complaints dataset. I decided to go with Hugging Face transformers, as results were not great with LSTM. Despite a large number of available articles, it took me significant time to bring all bits together and implement my own model with Hugging Face trained with TensorFlow. It seems like most, if not all, articles stop when training is explained. I thought it would be useful to share a complete scenario and explain how to save/load the trained model and execute inference. This post is based on Hugging Face API for TensorFlow.
关于使用自己的数据集进行Hugging Face微调的文章很多。 许多文章都使用PyTorch,一些文章使用TensorFlow。 我有一项任务是根据自定义投诉数据集实现情绪分类。 我决定使用Hugging Face变压器,因为LSTM的结果并不理想。 尽管有大量可用的文章,但我花了很多时间才能将所有内容放在一起,并使用由TensorFlow训练的Hugging Face实施我自己的模型。 似乎大多数(如果不是全部)文章都在解释培训时停止。 我认为共享一个完整的场景并解释如何保存/加载经过训练的模型并执行推理将很有用。 这篇文章基于针对TensorFlow的Hugging Face API。
Your starting point should be Hugging Face documentation. There is a very helpful section — Fine-tuning with custom datasets. To understand how to fine-tune Hugging Face model with your own data for sentence classification, I would recommend studying code under this section — Sequence Classification with IMDb Reviews. Hugging Face documentation provides examples for both PyTorch and TensorFlow, which is very convenient.
您的起点应该是Hugging Face文档。 有一个非常有用的部分- 使用自定义数据集进行微调 。 要了解如何使用您自己的数据对Hugging Face模型进行微调以进行句子分类,我建议您在本节“ 使用IMDb Reviews进行序列分类”下研究代码。 Hugging Face文档提供了PyTorch和TensorFlow的示例,这非常方便。
I’m using TFDistilBertForSequenceClassification class to run sentence classification. About DistilBERT — DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base. It has 40% fewer parameters than bert-base-uncased, runs 60% faster while preserving over 95% of Bert’s performances as measured on the GLUE language understanding benchmark.
我正在使用TFDistilBertForSequenceClassification类来运行句子分类。 关于DistilBERT — DistilBERT是一种小型,快速,便宜且轻便的Transformer模型,通过蒸馏Bert基地进行训练。 与没有bert-base的情况相比,它的参数减少了40%,运行速度提高了60%,同时保留了GLUE语言理解基准所测得的Bert 95%的性能 。
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassificationimport tensorflow as tf
Import and prepare data
导入并准备数据
An example is based on sarcasm classification using newspaper headlines. Data was prepared by Laurence Moroney, as part of his Coursera training (source code available on GitHub). I’m fetching data directly from Laurence blog:
一个例子是基于使用报纸头条的讽刺分类。 数据由Laurence Moroney编写,作为Coursera培训的一部分(源代码可在GitHub上获得 )。 我直接从Laurence博客中获取数据:
!wget --no-check-certificate \https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
-O /tmp/sarcasm.json
Then it goes data processing step, reading data, splitting it into training/validation steps, and extracting an array of labels:
然后进入数据处理步骤,读取数据,将其分为训练/验证步骤,并提取标签数组:
training_size = 20000with open("/tmp/sarcasm.json", 'r') as f:
datastore = json.load(f)sentences = []
labels = []
urls = []
for item in datastore:
sentences.append(item['headline'])
labels.append(item['is_sarcastic'])training_sentences = sentences[0:training_size]
validation_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
validation_labels = labels[training_size:]
There are 20000 entries for training and 6709 for validation.
有20000条目用于培训,6709条目用于验证。
2. Setup BERT and run training
2. 设置BERT并进行训练
Next, we would load the tokenizer:
接下来,我们将加载令牌生成器:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
Tokenize training and validation sentences:
标记训练和验证语句:
train_encodings = tokenizer(training_sentences,
truncation=True,
padding=True)
val_encodings = tokenizer(validation_sentences,
truncation=True,
padding=True)
Create TensorFlow datasets we can feed to TensorFlow fit function for training. Here we map sentences with labels, there is no need to pass label into fit function separately:
创建TensorFlow数据集,我们可以将其输入TensorFlow 拟合函数进行训练。 在这里,我们映射带有标签的句子,无需将标签分别传递给fit函数:
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
training_labels
))val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
validation_labels
))
We need to get a pre-trained Hugging Face model, we are going to fine-tune it with our data:
我们需要获得一个预先训练的Hugging Face模型,我们将使用我们的数据对其进行微调:
# We classify two labels in this example. In case of multiclass
# classification, adjust num_labels valuemodel = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
Fine-tune the model with our data by calling TensorFlow fit function. It comes out of the box from TFDistilBertForSequenceClassification model. You can play and experiment with parameters, but the selected options are producing quite good results already:
通过调用TensorFlow fit函数使用我们的数据微调模型。 它来自TFDistilBertForSequenceClassification模型的开箱即用 。 您可以播放和试验参数,但是所选的选项已经产生了很好的结果:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])
model.fit(train_dataset.shuffle(100).batch(16),
epochs=3,
batch_size=16,
validation_data=val_dataset.shuffle(100).batch(16))
In 3 epochs it reaches 0.0387 loss and 0.9875 accuracy, with 0.3315 validation loss and 0.9118 validation accuracy.
在3个时间段内,其损失为0.0387,准确度为0.9875,验证损失为0.3315,验证准确度为0.9118。
Save fine-tuned model with Hugging Face save_pretrained function. It does work to save using Keras save function model.save, but such model doesn't load. That’s why I’m using save_pretrained:
使用Hugging Face save_pretrained函数保存微调的模型。 使用Keras保存功能model.save进行保存确实有效,但是不会加载该模型。 这就是为什么我使用save_pretrained的原因:
model.save_pretrained("/tmp/sentiment_custom_model")
To save a model is the essential step, it takes time to run model fine-tuning and you should save the result when training completes. Another option — you may run fine-runing on cloud GPU and want to save the model, to run it locally for the inference.
保存模型是必不可少的步骤,它需要花费一些时间来运行模型微调,并且您应该在训练完成后保存结果。 另一个选择-您可能在云GPU上运行良好,并且想要保存模型,以便在本地运行以进行推断。
3. Load saved model and run predict function
3. 加载保存的模型并运行预测功能
I’m using TFDistilBertForSequenceClassification class to load the saved model, by calling Hugging Face function from_pretrained (point it to the folder, where the model was saved):
我使用TFDistilBertForSequenceClassification类通过调用Hugging Face函数from_pretrained (将其指向保存模型的文件夹)来加载保存的模型:
loaded_model = TFDistilBertForSequenceClassification.from_pretrained("/tmp/sentiment_custom_model")
Now we want to run the predict function and classify input using fine-tuned model. To be able to execute inference, we need to tokenize the input sentence the same way as it was done for training/validation data. In order to be able to read inference probabilities, pass return_tensors=”tf” flag into tokenizer. Then call predict using the saved model:
现在我们要运行预测函数并使用微调模型对输入进行分类。 为了能够进行推理,我们需要使用与训练/验证数据相同的方式对输入语句进行标记化。 为了能够读取推理概率, 请将return_tensors =“ tf”标志传递到令牌生成器中。 然后使用保存的模型调用预测 :
test_sentence = "With their homes in ashes, residents share harrowing tales of survival after massive wildfires kill 15"
test_sentence_sarcasm = "News anchor hits back at viewer who sent her snarky note about ‘showing too much cleavage’ during broadcast"# replace to test_sentence_sarcasm variable, if you want to test
# sarcasmpredict_input = tokenizer.encode(test_sentence,
truncation=True,
padding=True,
return_tensors="tf")tf_output = loaded_model.predict(predict_input)[0]
Predict function running on top of Hugging Face model returns logits (scores before SoftMax). We need to apply SoftMax function to get result probabilities:
在Hugging Face模型顶部运行的Predict函数返回logit(SoftMax之前的分数)。 我们需要应用SoftMax函数来获得结果概率:
tf_prediction = tf.nn.softmax(tf_output, axis=1).numpy()[0]
Conclusion
结论
The goal of this post was to show a complete scenario for fine-tuning Hugging Face model with custom data — from data processing, training to model save/load, and inference execution.
这篇文章的目的是展示一个完整的方案,以使用自定义数据微调Hugging Face模型-从数据处理,训练到模型保存/加载以及推理执行。
Source code
源代码
翻译自: https://towardsdatascience.com/fine-tuning-hugging-face-model-with-custom-dataset-82b8092f5333
在另一个数据集微调