前言
文本分类(Text Classification 或 Text Categorization,TC),又称自动文本分类(Automatic Text Categorization),是指计算机将载有信息的一篇文本映射到预先给定的某一类别或某几类别主题的过程,实现这一过程的算法模型叫做分类器。文本分类问题算是自然语言处理领域中一个非常经典的问题。
文本分类有很多经典算法,比如最经典的朴实贝叶斯、SVM、KNN、LightGBM等,这些都是通过将文字转化为文本表示,再提取特征,输入到分类器中进行分类。
再到后来神经网络的发展,LSTM、 TextRCNN、 BERT等也都应用到文本分类领域了。这里就不展开讲文本分类领域的算法综述了,可以参考这篇博文自然语言处理—文本分类综述/什么是文本分类作者十分用心,写的很详细。
直接上代码
回到这篇博客的主题,在bert的预训练模型基础上,在bert的输出的cls heads后面再增加两层全连接层,最后做一个softmax分类。
模型定义
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast
# specify GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class BERT_Arch(nn.Module):
def __init__(self, bert, num_c):
super(BERT_Arch, self).__init__()
self.bert = bert
# dropout layer
self.dropout = nn.Dropout(p=0.1)
# relu activation function
self.relu = nn.ReLU()
# dense layer 1
self.fc1 = nn.Linear(768,512)
# dense layer 2 (Output layer)
self.fc2 = nn.Linear(512,num_c)
#softmax activation function
self.softmax = nn.LogSoftmax(dim=1)
#define the forward pass
def forward(self, sent_id, mask):
#pass the inputs to the model
_, cls_hs = self.bert(sent_id, attention_mask=mask, return_dict=False)
x = self.fc1(cls_hs)
x = self.relu(x)
x = self.dropout(x)
# output layer
x = self.fc2(x)
# apply softmax activation
x = self.softmax(x)
return x
文本分类模型输入的是bert预训练模型和分类类别数量,模型定义非常简单清晰,没有什么太多可以讲述的。
初始化模型则需要先导入bert的预训练模型和字符处理的AutoTokenizer类,这里需要说明一下,此次模型训练与博主上一篇博客使用transformers框架导入bert模型提取中文词向量都是在transformers和pytorch框架下开发的,那么预训练模型以及其导入方式也和上一篇博客一样
# import BERT-base pretrained model
bert = AutoModel.from_pretrained('pycorrector\\datasets\\bert_models\\chinese_finetuned_lm')
# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('pycorrector\\datasets\\bert_models\\chinese_finetuned_lm')
# pass the pre-trained BERT to our define architecture
model = BERT_Arch(bert, 5)
# push the model to GPU
model = model.to(device)
训练数据处理
这次使用的中文分类语料是今日头条开放的一个新闻分类数据集,一共包含有382688条数据,15个类别。笔者这次的训练就只用了其中的5类数据,每类数据只取了其中1000条,以下是部分数据展示:
df = pd.read_csv("data/data_chinese.csv")
df.head()
label text
0 1 京城最值得你来场文化之旅的博物馆
1 1 发酵床的垫料种类有哪些?哪种更好?
2 1 上联:黄山黄河黄皮肤黄土高原。怎么对下联?
3 1 林徽因什么理由拒绝了徐志摩而选择梁思成为终身伴侣?
4 1 黄杨木是什么树?
在训练之前,还需要将5000条数据分成训练集、验证集和测试集
train_text, temp_text, train_labels, temp_labels = train_test_split(df['text'], df['label'],
random_state=2018,
test_size=0.3,
stratify=df['label'])
# we will use temp_text and temp_labels to create validation and test set
val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels,
random_state=2018,
test_size=0.5,
stratify=temp_labels)
然后再就是将文本数据编码进行编码成序列数据,接着将编码后数据转变成Tensors,创建dataloaders
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
max_seq_len = 30
# tokenize and encode sequences in the training set
tokens_train = tokenizer.batch_encode_plus(
train_text.tolist(),
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
# tokenize and encode sequences in the validation set
tokens_val = tokenizer.batch_encode_plus(
val_text.tolist(),
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
# tokenize and encode sequences in the test set
tokens_test = tokenizer.batch_encode_plus(
test_text.tolist(),
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
# for train set
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())
# for validation set
val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(val_labels.tolist())
# for test set
test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())
# wrap tensors
train_data = TensorDataset(train_seq, train_mask, train_y)
# sampler for sampling the data during training
train_sampler = RandomSampler(train_data)
# dataLoader for train set
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
# wrap tensors
val_data = TensorDataset(val_seq, val_mask, val_y)
# sampler for sampling the data during training
val_sampler = SequentialSampler(val_data)
# dataLoader for validation set
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)
定义模型训练参数
#define a batch size
batch_size = 32
# optimizer from hugging face transformers
from transformers import AdamW
# define the optimizer
optimizer = AdamW(model.parameters(), lr = 1e-3)
from sklearn.utils.class_weight import compute_class_weight
#compute the class weights
class_wts = compute_class_weight('balanced', np.unique(train_labels), train_labels)
# convert class weights to tensor
weights= torch.tensor(class_wts,dtype=torch.float)
weights = weights.to(device)
# loss function
cross_entropy = nn.NLLLoss(weight=weights)
# number of training epochs
epochs = 10
定义训练函数和评估函数
# function to train the model
def train():
model.train()
total_loss, total_accuracy = 0, 0
# empty list to save model predictions
total_preds=[]
# iterate over batches
for step,batch in enumerate(train_dataloader):
# progress update after every 50 batches.
if step % 50 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))
# push the batch to gpu
batch = [r.to(device) for r in batch]
sent_id, mask, labels = batch
# clear previously calculated gradients
model.zero_grad()
# get model predictions for the current batch
preds = model(sent_id, mask)
# compute the loss between actual and predicted values
loss = cross_entropy(preds, labels)
# add on to the total loss
total_loss = total_loss + loss.item()
# backward pass to calculate the gradients
loss.backward()
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
optimizer.step()
# model predictions are stored on GPU. So, push it to CPU
preds=preds.detach().cpu().numpy()
# append the model predictions
total_preds.append(preds)
# compute the training loss of the epoch
avg_loss = total_loss / len(train_dataloader)
# predictions are in the form of (no. of batches, size of batch, no. of classes).
# reshape the predictions in form of (number of samples, no. of classes)
total_preds = np.concatenate(total_preds, axis=0)
#returns the loss and predictions
return avg_loss, total_preds
# function for evaluating the model
def evaluate():
print("\nEvaluating...")
# deactivate dropout layers
model.eval()
total_loss, total_accuracy = 0, 0
# empty list to save the model predictions
total_preds = []
# iterate over batches
for step,batch in enumerate(val_dataloader):
# Progress update every 50 batches.
if step % 50 == 0 and not step == 0:
# Calculate elapsed time in minutes.
elapsed = format_time(time.time() - t0)
# Report progress.
print(' Batch {:>5,} of {:>5,}.'.format(step, len(val_dataloader)))
# push the batch to gpu
batch = [t.to(device) for t in batch]
sent_id, mask, labels = batch
# deactivate autograd
with torch.no_grad():
# model predictions
preds = model(sent_id, mask)
# compute the validation loss between actual and predicted values
loss = cross_entropy(preds,labels)
total_loss = total_loss + loss.item()
preds = preds.detach().cpu().numpy()
total_preds.append(preds)
# compute the validation loss of the epoch
avg_loss = total_loss / len(val_dataloader)
# reshape the predictions in form of (number of samples, no. of classes)
total_preds = np.concatenate(total_preds, axis=0)
return avg_loss, total_preds
模型训练
当然,在模型训练之前,需要将bert中参数freze住,这样就可以不用训练bert模型了,而只是训练bert后连接两层全连接层
# freeze all the parameters
for param in bert.parameters():
param.requires_grad = False
接着就是模型训练代码和保存模型
# set initial loss to infinite
best_valid_loss = float('inf')
# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]
#for each epoch
for epoch in range(epochs):
print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
#train model
train_loss, _ = train()
#evaluate model
valid_loss, _ = evaluate()
#save the best model
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'saved_weights.pt')
# append training and validation loss
train_losses.append(train_loss)
valid_losses.append(valid_loss)
print(f'\nTraining Loss: {train_loss:.3f}')
print(f'Validation Loss: {valid_loss:.3f}')
#load weights of best model
path = 'saved_weights.pt'
model.load_state_dict(torch.load(path))
模型评估
# get predictions for test data
with torch.no_grad():
preds = model(test_seq.to(device), test_mask.to(device))
preds = preds.detach().cpu().numpy()
# model's performance
preds = np.argmax(preds, axis = 1)
print(classification_report(test_y, preds))
precision recall f1-score support
0 0.85 0.75 0.79 150
1 0.72 0.87 0.79 150
2 0.87 0.68 0.76 150
3 0.88 0.85 0.86 150
4 0.81 0.96 0.88 150
accuracy 0.82 750
macro avg 0.83 0.82 0.82 750
weighted avg 0.83 0.82 0.82 750
这里可以看到,训练了10个epoch后,基本上5个类别精度都在0.8左右,最好类别精度达到0.88,最差的则是在0.72。因为数据量比较少的原因吧,每个类别就随机抽取了其中一千条来训练,这在实际应用中肯定是远远不够的,各位小伙伴可以用更多数据来做一些尝试。
总结
其实也没有什么太多可以总结的,只是感叹,现在nlp领域发展得真好,比起最开始需要手工处理文本,手工挑选特征,再输入到一些机器模型中去训练,比如svm 、 KNN等,可能每个类别使用几万条数据精度还比较低,估计能够达到70,80%就不错了。现在有了bert这种大型预训练模型,又有各种开源的还用的框架,使得模型训练变得更加容易了,数据预处理基本上就只需要把文本和对应的标签准备好就行了,然后按照格式输入到训练框架中就行了,使用一千条数据,也还可以取得还比较勉强的精度。