学习笔记(使用transformer进行文本分类任务)

1.目标

1.Finetune DistilBERT (蒸馏NLP,可以在小型设备上运行的BERT)on the IMDb dataset to determine whether a movie review is positive or negative.(情感分类)

2.Use your finetuned model for inference.

2.Load IMDb dataset

from datasets import load_dataset

imdb = load_dataset("imdb")

 数据样例

input:

from huggingface_hub import notebook_login
notebook_login()
from datasets import load_dataset
imdb = load_dataset("imdb")
print(imdb["test"][3])

output:

'text': "STAR RATING: ***** Saturday Night **** Friday Night *** Friday Morning ** Sunday Night * Monday Morning <br /><br />Former New Orleans homicide cop Jack Robideaux (Jean Claude Van Damme) is re-assigned to Columbus, a small but violent town in Mexico to help the police there with their efforts to stop a major heroin smuggling operation into their town. The culprits turn out to be ex-military, lead by former commander Benjamin Meyers (Stephen Lord, otherwise known as Jase from East Enders) who is using a special method he learned in Afghanistan to fight off his opponents. But Jack has a more personal reason for taking him down, that draws the two men into an explosive final showdown where only one will walk away alive.<br /><br />After Until Death, Van Damme appeared to be on a high, showing he could make the best straight to video films in the action market. While that was a far more drama oriented film, with The Shepherd he has returned to the high-kicking, no brainer action that first made him famous and has sadly produced his worst film since Derailed. It's nowhere near as bad as that film, but what I said still stands.<br /><br />A dull, predictable film, with very little in the way of any exciting action. What little there is mainly consists of some limp fight scenes, trying to look cool and trendy with some cheap slo-mo/sped up effects added to them that sadly instead make them look more desperate. Being a Mexican set film, director Isaac Florentine has tried to give the film a Robert Rodriguez/Desperado sort of feel, but this only adds to the desperation.<br /><br />VD gives a particularly uninspired performance and given he's never been a Robert De Niro sort of actor, that can't be good. As the villain, Lord shouldn't expect to leave the beeb anytime soon. He gets little dialogue at the beginning as he struggles to muster an American accent but gets mysteriously better towards the end. All the supporting cast are equally bland, and do nothing to raise the films spirits at all.<br /><br />This is one shepherd that's strayed right from the flock. *",

 'label': 0}

There are two fields in this dataset:

  • text: the movie review text.
  • label: a value that is either 0 for a negative review or 1 for a positive review.

3.预处理

The next step is to load a DistilBERT tokenizer(标识器) to preprocess the text field。

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

当我们尝试把预处理函数作用在整个数据集上时,使用map函数并且设置batched(成批的,取这个名字还是很合理滴!)=True

实例如下:

tokenized_imdb = imdb.map(preprocess_function, batched=True)

Now create a batch of examples using DataCollatorWithPadding. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. 


Hide Pytorch content

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值