1.目标
1.Finetune DistilBERT (蒸馏NLP,可以在小型设备上运行的BERT)on the IMDb dataset to determine whether a movie review is positive or negative.(情感分类)
2.Use your finetuned model for inference.
2.Load IMDb dataset
from datasets import load_dataset
imdb = load_dataset("imdb")
数据样例
input:
from huggingface_hub import notebook_login
notebook_login()
from datasets import load_dataset
imdb = load_dataset("imdb")
print(imdb["test"][3])
output:
'text': "STAR RATING: ***** Saturday Night **** Friday Night *** Friday Morning ** Sunday Night * Monday Morning <br /><br />Former New Orleans homicide cop Jack Robideaux (Jean Claude Van Damme) is re-assigned to Columbus, a small but violent town in Mexico to help the police there with their efforts to stop a major heroin smuggling operation into their town. The culprits turn out to be ex-military, lead by former commander Benjamin Meyers (Stephen Lord, otherwise known as Jase from East Enders) who is using a special method he learned in Afghanistan to fight off his opponents. But Jack has a more personal reason for taking him down, that draws the two men into an explosive final showdown where only one will walk away alive.<br /><br />After Until Death, Van Damme appeared to be on a high, showing he could make the best straight to video films in the action market. While that was a far more drama oriented film, with The Shepherd he has returned to the high-kicking, no brainer action that first made him famous and has sadly produced his worst film since Derailed. It's nowhere near as bad as that film, but what I said still stands.<br /><br />A dull, predictable film, with very little in the way of any exciting action. What little there is mainly consists of some limp fight scenes, trying to look cool and trendy with some cheap slo-mo/sped up effects added to them that sadly instead make them look more desperate. Being a Mexican set film, director Isaac Florentine has tried to give the film a Robert Rodriguez/Desperado sort of feel, but this only adds to the desperation.<br /><br />VD gives a particularly uninspired performance and given he's never been a Robert De Niro sort of actor, that can't be good. As the villain, Lord shouldn't expect to leave the beeb anytime soon. He gets little dialogue at the beginning as he struggles to muster an American accent but gets mysteriously better towards the end. All the supporting cast are equally bland, and do nothing to raise the films spirits at all.<br /><br />This is one shepherd that's strayed right from the flock. *",
'label': 0}
There are two fields in this dataset:
text
: the movie review text.label
: a value that is either0
for a negative review or1
for a positive review.
3.预处理
The next step is to load a DistilBERT tokenizer(标识器) to preprocess the text
field。
def preprocess_function(examples):return tokenizer(examples["text"], truncation=True)
Create a preprocessing function to tokenize text
and truncate sequences to be no longer than DistilBERT’s maximum input length:
当我们尝试把预处理函数作用在整个数据集上时,使用map函数并且设置batched(成批的,取这个名字还是很合理滴!)=True
实例如下:
tokenized_imdb = imdb.map(preprocess_function, batched=True)
Now create a batch of examples using DataCollatorWithPadding. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
Hide Pytorch content
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)