Recurrent Neural Network for Text Classification with Multi Task Learning

最新推荐文章于 2023-05-21 17:46:15 发布

尧景

最新推荐文章于 2023-05-21 17:46:15 发布

阅读量1.5k

点赞数 2

本文链接：https://blog.csdn.net/Ying_M/article/details/118700381

版权

RNN 情感分类多任务学习权重共享预训练

关键词由CSDN通过智能技术生成

在这里插入图片描述

论文阅读准备

前期知识储备

在这里插入图片描述

学习目标

在这里插入图片描述

论文导读

论文研究背景、成果及意义

研究背景

在这里插入图片描述

情感分类任务的应用场景：评论分析、数据挖掘上的分析、舆情舆论上的分析

基于机器学习的情感分类

基于深度学习的情感分类
前馈神经网络

不需要人工定义特征，只需模型自己去学。

自编码器
在这里插入图片描述
卷积神经网络

循环神经网络

循环神经网络的组合结构

数据集

研究意义

情感分类背景
在这里插入图片描述
在机器学习模型阶段，需要人工构造一些特征。

论文泛读

论文结构

在这里插入图片描述

摘要

摘要核心

1.为了应对数据量少，常用的方法是使用一个无监督的预训练模型，比如词向量，实验中也取得了不错的效果，但这样的方法都是间接改善网络效果；
2.作者针对文本多分类任务，提出了基于RNN进行的多任务训练、共享模型权重的方法，并提出了三种共享信息机制；
3.作者对具有特定任务和文本进行建模，在四个基准的文本分类任务中取得了较好的结果。

论文精读

论文算法模型总览

知识树
在这里插入图片描述
RNN结构

变长的LSTM

损失函数的定义

论文算法模型的细节

细节一

权重的共享
神经网络在多任务学习上面已经可以解决很多NLP的任务，作者认为在情感分类任务上也可以应用多任务学习。作者就使用了这种一对多的单词的共享表示。
在这里插入图片描述
图中的任务是共享中间的hidden层，由于不同的任务的输入与输出可能不同，因此输入层与输出层没法共享。作者基于这种思路，中间共享的层有很多不同的方法以及输入、输出，提出了三种不同的结构：
Uniform-Layer
在这里插入图片描述
中间的是共享的，上面和下面分别是两个输入，它是同时进行两个任务的输入去计算，输入时分别做相应的softmax。进一步讲，我们有4个数据集，即有四种不同的task，假设上面是task1，下面是task2，两个任务可以混合的一起做训练，x^(s)_t表示共享的表征，x^(m)_t为本来要输入的一个值，将它再加一个x^(s)_t表征，x^(s)_t的初始值是随机产生的，然后不断的进行训练。
Coupled-Layer
使用两个LSTM实现
在这里插入图片描述

每个任务拥有一个自己的lstm层，认为两个任务之间不会有太多的影响，会更专注于自己的任务。两个任务是同时进行的，两个task的输出会被混合的利用，可以捕获到其他任务中的一些信息。
Shared-Layer
在这里插入图片描述

在这里插入图片描述
此模型集合了前两种模型的优点，共享的权重也是在中间，是双向的lstm，主要公式如下：

共享层的信息混入到输入当中，h^(s)_t具有两个任务的特征。
模型对比

模型一 (Uniform-layer Architecture) : 对于每个分类任务，在每个输入character的embedding vector后拼上一个随机生成的可训练向量，表示该特定任务中，所有任务共享LSTM层，最后一个时刻的hidden state则作为输入传入softmax。该模型共享的部分为：输入部分中随机生成的可训练向量、共享的LSTM层；
模型二 (Coupled-layer Architecture) : 每个任务具有自己独立的LSTM层，但是每个时刻所有任务的hidden state则会和下一时刻的character一起作为输入，最后一个时刻的hidden state进行分类；
模型三 (Shared-layer Architecture) : 除了一个共享的Bi-lstm层用于获取共享信息，每个任务有自己独立的lstm层，lstm的输入包括每一时刻的character和Bi-lstm的hidden state。

训练过程

训练

在这里插入图片描述

损失

在这里插入图片描述
λ_m要考虑数据集的数量、类别的种类等。

数据的选择

在这里插入图片描述
训练方法：

1.随机选择一项任务；
2.从该任务中随机选择一个训练样本；
3.根据基于梯度的优化(论文中使用Adagrad update rule)来更新参数；
4.重复1~3步。

微调

在这里插入图片描述

预训练

在这里插入图片描述

实验设置及结果分析

数据集

在这里插入图片描述

SST-1 : 5个情绪类别的电影影评，来自斯坦福情感数据库
SST-2 : 2分类电影影评，来自斯坦福数据库
SUBJ : 主观性数据集，任务目的是将句子分为主观和客观
IMDB : 2分类的电影影评，大多数评价为长句子

超参数的分析

在这里插入图片描述
使用Word2vec在维基语料获得词向量，字典规模约500,000。词嵌入在训练过程中被微调以提高性能；其它参数在[-0.1,0.1]的范围随机采样，超参数将选择在验证集上性能最好的一组。对于没有验证集的数据集使用10折交叉验证。
特定任务和共享层的嵌入大小为64，对于模型一，每个单词有两个嵌入，大小都为64。LSTM的隐藏层大小为50。初始学习率为0.1。参数的正则化权值为10^-5。

实验结果

模型一
在这里插入图片描述

在这里插入图片描述

模型二
在这里插入图片描述

模型三
在这里插入图片描述
以上的实验是模型内部的对比，下面看一下与其他模型进行对比的结果。

SOTA对比

在这里插入图片描述

可视化分析

在这里插入图片描述
作者使用单层LSTM与模型三进行对比，对于每一时刻的hidden state进行的情感分类，直观的显示每个词对模型的贡献。

错误分析

在这里插入图片描述
(1)复杂的句子结构 (2)特定语境下的句子

论文总结

关键点

共享层——shared-layer
多任务学习——Multi-Task Learning
预训练——Pre-training

创新点

共享权重
多任务混合训练
RNN结构用于情感数据分类

启发点

多任务训练
Multi-task Learning
多任务共享权重
The differences among them are the mechanisms of sharing information among the several tasks.
性能的改善
Experimental results show that our models can imporve the performances of a group of related tasks by exploring common features.
字符级的共享机制研究
In future work, we would like to investigate the other sharing mechanisms of the different task.

在这里插入图片描述

import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)
'''
Another handy feature of torchtext is that it has support for common datasets used in natural language processing.
The following code automatically downloads the IMDB dataset and splits it into the canonical train/test splits as torchtext.datasets objects. It process the data using the Fields we have previous defined. The IMDB dataset consists of 50,000 movie reviews, each marked as being a positive or negatice review.
'''
from torchtext import datasets
train_data,test_data = datasets.IMDB.splits(TEXT, LABEL)
# We can see how many examples are in each split by checking their length.
print(f'number of training examples:{len(train_data)}')
print(f'number of testing examples:{len(test_data)}')
# We can also check an example
print(vars(train_data.examples[0]))
'''
The IMDB dataset only has train/test splits, so we need to create a validation set. We can do this with the .split() method. By default this splits 70/30, however by passing a split_ratio argument, we can change the ratio argument, we can change the ratio of the split, i.e. a split_ratio of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set.
We also pass our random seed to the random_state argument, ensuring that we get the same train/validation split each time.
'''
import random
train_data, valid_data = train_data.split(random_state = random.seed(SEED))
# Again, we'll view how many examples are in each split.
print(f'Number of training examples:{len(train_data)}') #17500
print(f'Number of validing examples:{len(valid_data)}') #7500
print(f'Number of testing examples:{len(test_data)}') #25000

# The following builds the vocabulary, only keeping the most common max_size tokens.
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data, max_size=MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

print(f'Unique tokens in TEXT vocabulary:{len(TEXT.vocab)}') #25002
print(f'Unique tokens in LABEL vocabulary:{len(LABEL.vocab)}') #2
# We can also view the most common words in the vocabulary and theri frequencies.
print(TEXT.vocab.freqs.most_common(20))
#We can also see the vocabulary directly using either the stoi(string to int) or itos(int to string) method.
print(TEXT.vocab.itos[:10])
#We can also check the labels, ensuring 0 is for negative and 1 is for positive.
print(LABEL.vocab.stoi)
'''
The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples(indexed and converted into tensors) at each iteration.
We'll use a BucketIterator which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.
We also want to place the tensors returned by the iterator on the GPU(if you're using one).PyTorch handles this using torch.device, we then pass this device to the iterator.
'''
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator,test_iterator = data.BucketIterator.splits(
	(train_data, valid_data, test_data),
	batch_size=BATCH_SIZE,
	device=device)

#model
import torch.nn as nn
class RNN(nn.Module):
	def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
		super().__init__()
		self.embedding = nn.Embedding(input_dim, embedding_dim)
		self.rnn = nn.RNN(embedding_dim, hidden_dim)
		self.fc = nn.Linear(hidden_dim, output_dim)
	def forward(self, text):
		embedded = self.embedding(text) #text=[sent_len, batch_size]
		output, hidden = self.rnn(embedded)
		#output = [sent_len, batch_size, hid_dim]
		#hidden = [1, batch_size, hid_dim]
		assert torch.equal(output[-1,:,:],hidden.squeeze(0))
		return self.fc(hidden.squeeze(0))
'''
We now create an instance of our RNN class.
The input dimension is the dimension of the one-hot vectors.which is equal to the vocabulary size.
The embedding dimension is the size of the dense word vectors.This is usually around 50-250 dimensions, but depends on the size of the vocabulary.
The hidden dimension is the size of the hidden states.This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and compleity of the task.
The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.
'''
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
# Let's also create a function that will tell us how many trainable parameters our model has so we can compare the number of parameters across different models.
def count_parameters(model):
	return sum(p.unmel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,}trainable parameters')

#create an optimizer
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=1e-3)
#define loss function
criterion = nn.BCEWithLogitsLoss()
# Using .to, we can place the model and the criterion on the GPU
model = model.to(device)
criterion = criterion.to(device)

def binary_accuracy(pred, y):
	'''
	Return accuracy per batch, i.e. if you get 8/10 right, this returns 0.8,Not 8
	'''
	# round predictions to the closest integer
	rounded_preds = torch.round(torch.sigmoid(preds))
	correct = (rounded_preds == y).float()#convert into float for division
	acc = correct.sum() / len(correct)
	return acc

def train(model, iterator, optimizer, criterion):
	epoch_loss = 0
	epoch_acc = 0
	model.train()
	for batch in iterator:
		optimizer.zero_grad()
		predictions = model(batch.text).squeeze(1)
		loss = criterion(predictions, batch.label)
		loss.backward()
		optimizer.step()
		epoch_loss +=  loss.item()
		epoch_acc +=acc.item()
	return epoch_loss/len(iterator), epoch_acc/len(iterator)
def evaluate(model, iterator, criterion):
	epoch_loss = 0
	epoch_acc = 0
	model.eval()
	with torch.no_grad():
		for batch in iterator:
			predictions = model(batch.text).squeeze(1)
			loss = criterion(predictions, batch.label)
			acc = binary_accuracy(predictions, batch.label)
			epoch_loss += loss.item()
			epoch_acc += acc.item()
	return epoch_loss/len(iterator), epoch_acc/len(iterator)
import time
def epoch_time(start_time, end_time):
	elapsed_time = end_time-start_time
	elapsed_mins = int(elapsed_time / 60)
	elapsed_secs = int(elapsed_time - (elapsed_mins*60))	
	return elapsed_mins, elapsed_secs
'''
We then train the model through multiple ephchs, an epoch being a complete pass through all examples in the training and validation sets.
At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.
'''
N_EPOCH = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCH):
	start_time = time.time()
	train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
	valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
	end_time = time.time()
	epoch_mins, epoch_secs = epoch_time(start_time, end_time)
	if valid_loss<best_valid_loss:
		best_valid_loss = valid_loss
		torch.save(model.state_dict(),'tuti-model.pt')
	print(f'Epoch:{epoch+1:02} | Epoch Time:{epoch_mins}m {epoch_secs}s')
	print(f'\tTrain Loss:{train_loss:.3f} | Train Acc:{train_acc*100:.2f}%')
	print(f'\tVal Loss:{valid_loss:.3f} | Val Acc:{valid_acc*100:.2f}%')
'''
You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we'll imporve int the next notebook.
Finally,the metric we actually care about,the test loss and accuracy,which we get from our parameters that gave us the best validation loss.
'''
model.load_state_dict(torch.load('tuti-model.pt'))
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'Test Loss:{test_loss:.3f} | Test acc:{test_acc*100}:.2f%')