本文主要介绍一个框架nlp-basictasks
nlp-basictasks是利用PyTorch深度学习框架所构建一个简单的库,旨在快速搭建模型完成一些基础的NLP任务,如分类、匹配、序列标注、语义相似度计算等。
下面利用该框架实现BERT模型做文本分类任务
导入包
import sys,os
import pandas as pd
import random
import numpy as np
from nlp_basictasks.tasks import cls
from nlp_basictasks.evaluation import clsEvaluator
from nlp_basictasks.readers.cls import getExamplesFromData
import nlp_basictasks
print(nlp_basictasks.__version__)
数据集介绍
数据集类型是微博情感分类
来源https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/weibo_senti_100k/intro.ipynb
获取数据
data_path='weibo_senti_100k.csv'
pd_all = pd.read_csv(data_path)
print('评论数目(总体):%d' % pd_all.shape[0])
print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])
打乱数据集,同时划分训练、验证
print(len(pd_all))
random_idx=np.random.permutation(len(pd_all))
sentences=pd_all['review'].values[random_idx].tolist()
labels=pd_all['label'].values[random_idx].tolist()
print(len(sentences),len(labels))
random_idx=np.random.permutation(len(sentences))
label2id={'0':0,'1':1}
dev_ratio=0.2#训练集的20%作为验证
dev_nums=int(len(sentences)*dev_ratio)
train_nums=len(sentences)-dev_nums
print(dev_nums)
train_sentences=sentences[:train_nums]
train_labels=labels[:train_nums]
dev_sentences=sentences[-dev_nums:]
dev_labels=labels[-dev_nums:]
train_examples,max_seq_len=getExamplesFromData(sentences=train_sentences,labels=train_labels,label2id=label2id,mode='train',return_max_len=True)
dev_examples=getExamplesFromData(sentences=dev_sentences,labels=dev_labels,label2id=label2id,mode='dev')
定义路径加载模型
#max_seq_len就是训练集中最长的句子长度
model_path='' #model_path是你下载的BERT模型保存的路径,如:'chinese-roberta-wwm/'
print(max_seq_len)
max_seq_len=min(512,max_seq_len)
cls_model=cls(model_path=model_path,label2id=label2id,max_seq_length=max_seq_len,device='cuda')
定义dataloader和evaluator
from torch.utils.data import DataLoader
batch_size=32
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
evaluator=clsEvaluator(sentences=dev_sentences,label_ids=dev_labels,write_csv=False,label2id=label2id)
训练模型
output_path='' #output_path是训练后保存模型的路径
cls_model.fit(is_pairs=False,train_dataloader=train_dataloader,evaluator=evaluator,output_path=output_path)
测试模型
predict_probs=cls_model.predict(is_pairs=False,dataloader=['这孩子真可爱','这人看起来像傻子似的'])
id2label={id_:label for label,id_ in label2id.items()}
predict_tags=[id2label[id_] for id_ in np.argmax(predict_probs,axis=1)]
print(predict_tags)
1代表正面情绪,0代表负面情绪
不用80行代码即可完成文本分类任务,相关教程见nlp-basictasks框架做文本分类任务,觉得好用的话还请点个star,谢谢