数据
借用了这位兄弟的数据,4类文本分类问题:https://blog.csdn.net/qq_28626909/article/details/80382029
代码参考
预处理工具torchtext学习参考了nlpuser和dendi_hust二位兄弟:https://blog.csdn.net/nlpuser/article/details/88067167 https://blog.csdn.net/dendi_hust/article/details/101221922
模型代码学习了dendi_hust兄弟的博客:https://blog.csdn.net/dendi_hust/article/details/94435919
源博客中只有模型部分的代码,所以找了份数据跑下,将整个流程补充完整,顺便加了些代码的注释方便理解。数据集很小,在没用预训练词向量下一定是欠拟合的,结果仅供参考。
预处理
首先,将文件拼成一个DataFrame
后续处理用torchtext包进行
torchtext官方教程:https://torchtext.readthedocs.io/en/latest/index.html
Field 格式定义
重要的参数:
sequential
:是否是可序列化数据(类似于字符串数据),默认值是True
;user_vocab
:是否使用Vocab
对象,如果取False
,则该字段必须是数值类型;默认值是True
;tokenize
:是一个function
类型的对象(如string.cut
、jieba.cut
等),用于对字符串进行分词;batch_first
:如果该属性的值取True
,则该字段返回的Tensor
对象的第一维度是batch
的大小;默认值是False
;fix_length
:该字段是否是定长,如果取None
则按同batch
该字段的最大长度进行pad;stop_words
:停用词,List,在tokenize
步会剔除;
重要函数:
build_vocab
:为该Field
创建Vocab
;
CONTENT = Field(sequential=True, tokenize=jieba.cut, batch_first=True, fix_length=200, stop_words=stopwords)
LABEL = Field(sequential=False, use_vocab=False)
Dataset 构建数据集
重要参数:
examples
:Example
对象列表;fields
:格式是List(tuple(str, Field))
,其中str
是Field
对象的描述;
重要函数:
split()
:此方法用于划分数据集,将数据集划分为train、test、valid数据集;split_ratio
:此参数为float
或list
类型,当参数为float
类型时(参数取值要在[0, 1]),表示数据集中多少比例的数据划分到train
(训练)数据集里,剩余的划分到valid
(验证)数据集里;如果该参数是list
类型(如[0.7, 0.2, 0.1]),表示train
,test
、valid
数据集的比例;该参数默认值是 0.7
# get_dataset构造并返回Dataset所需的examples和fields
def get_dataset(csv_data, text_field, label_field, test=False):
# id数据对训练在训练过程中没用,使用None指定其对应的field
fields = [("id", None), # we won't be needing the id, so we pass in None as the field
("content", text_field), ("label", label_field)]
examples = []
if test:
# 如果为测试集,则不加载label
for text in tqdm(csv_data['content']):
examples.append(Example.fromlist([None, text, None], fields))
else:
for text, label in tqdm(zip(csv_data['content'], csv_data['label'])):
examples.append(Example.fromlist([None, text, label], fields))
return examples, fields
train_examples, train_fields = get_dataset(train_df, CONTENT, LABEL)
test_examples, test_fields = get_dataset(test_df, CONTENT, None, test=True)
# 构建Dataset数据集
train_dataset = Dataset(train_examples, train_fields)
test_dataset = Dataset(test_examples, test_fields)
Iterator 构建迭代器
根据batch_size
生成 Dataset
的 Iteratior
。常用的有 Iterator
和 BucketIterator
。其中 BucketIterator
是 Iterator
的子类,与 Iterator
相比,BucketIterator
会把相同或相近长度的数据(按 sort_key
)属性进行排序,这样可以最小化 pad
。
重要参数:
dataset
:需要生成Iterator
的数据集;batch_size
:每个batch
的大小;sort_key
:用来为每个Example
进行排序的字段,默认是None
;shuffle
:每次epoch
是否进行shuffle
;
重要函数:
splits()
:为数据集生成Iterator
;datasets
:Tuple
类型的Dataset
,Tuple
的第一个元素应该是train
数据集的Dataset
;batch_sizes
:Tuple
类型,和datasets
中的Dataset
一一对应,表示各个数据集生成batch
的大小;
# 同时对训练集和验证集进行迭代器的构建
train_iter, val_iter = BucketIterator.splits(
(train_dataset, test_dataset), # 构建数据集所需的数据集
batch_sizes=(8, 8),
device=-1, # 如果使用gpu,此处将-1更换为GPU的编号
sort_key=lambda x: len(x.content), # the BucketIterator needs to be told what function it should use to group the data.
sort_within_batch=False,
repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)
Vocab 构建词典
重要参数:
counter
:collections.Counter
类型的对象,用于保存数据(如:单词)的频率;vectors
:预训练的词向量,可以是torch.vocab.Vectors
类型,也可以是其他类型;min_freq
: 最低频率限制,如果某个词频率比min_freq
低,则不记录到词典;
CONTENT.build_vocab(train_dataset)
看一下构建的词典
e = list(train_iter)[0]
e.content
tensor([[ 317, 89, 569, ..., 1, 1, 1],
[15317, 673, 24260, ..., 1, 1, 1],
[ 157, 18588, 2441, ..., 1, 1, 1],
...,
[10065, 15698, 120, ..., 1, 1, 1],
[ 2691, 7237, 8676, ..., 1, 1, 1],
[ 390, 1667, 125, ..., 1, 1, 1]])
e.content.shape
torch.Size([8, 50])
跑模型 BILSTM+ATTENTION
搭双向LSTM和最后的全连接层,
def build_model(self):
# 初始化词向量
self.char_embeddings = nn.Embedding(self.char_size, self.char_embedding_size)
# 词向量参与更新
self.char_embeddings.weight.requires_grad = True
# attention layer
self.attention_layer = nn.Sequential(
nn.Linear(self.hidden_dims, self.hidden_dims),
nn.ReLU(inplace=True)
)
# self.attention_weights = self.attention_weights.view(self.hidden_dims, 1)
# 双层lstm
self.lstm_net = nn.LSTM(self.char_embedding_size, self.hidden_dims,
num_layers=self.rnn_layers, dropout=self.keep_dropout,
bidirectional=True)
# FC层
self.fc_out = nn.Sequential(
# nn.Dropout(self.keep_dropout),
# nn.Linear(self.hidden_dims, self.hidden_dims),
# nn.ReLU(inplace=True),
# nn.Dropout(self.keep_dropout),
nn.Linear(self.hidden_dims, self.num_classes),
)
attention部分
attention的理解参考:https://distill.pub/2016/augmented-rnns/
简化了attention层,减少全连接层、Relu层,减少参数量
def attention_net(self, lstm_out, lstm_hidden):
'''
:param lstm_out: [batch_size, len_seq, n_hidden * 2]
:param lstm_hidden: [batch_size, num_layers * num_directions, n_hidden]
:return: [batch_size, n_hidden]
'''
# chunk的方法做的是对张量进行分块,返回一个张量列表。但如果指定轴的元素个数被chunks除不尽,最后一块的元素个数会少。
lstm_tmp_out = torch.chunk(lstm_out, 2, -1)
# h [batch_size, time_step, hidden_dims]
# 将双向LSTM的激活值相加
h = lstm_tmp_out[0] + lstm_tmp_out[1]
# 将最后一层隐藏层两个方向权重加起来
# [batch_size, num_layers * num_directions, n_hidden]
lstm_hidden = torch.sum(lstm_hidden, dim=1)
# [batch_size, 1, n_hidden] 在下标1的位置增加1维
lstm_hidden = lstm_hidden.unsqueeze(1)
# atten_score [batch_size, 1, time_step] attention score层
atten_score = torch.bmm(lstm_hidden, h.transpose(1, 2))
# atten_weight [batch_size, 1, time_step] softmax归一化(attention distribution层)
atten_weight = F.softmax(atten_score, dim=-1)
# [batch_size, 1, n_hidden] 加权输出值
context = torch.bmm(atten_weight, h)
result = context.squeeze(1)
return result
最终结果:非常菜
Epoch: 1 | time in 0 minutes, 47 seconds
Loss: 1.2363(train) | Acc: 49.8%(train)
Loss: 1.0654(valid) | Acc: 70.5%(valid)
Epoch: 2 | time in 0 minutes, 49 seconds
Loss: 0.9987(train) | Acc: 74.4%(train)
Loss: 0.9925(valid) | Acc: 74.0%(valid)
Epoch: 3 | time in 0 minutes, 48 seconds
Loss: 0.8978(train) | Acc: 84.4%(train)
Loss: 0.9435(valid) | Acc: 80.0%(valid)
Epoch: 4 | time in 0 minutes, 48 seconds
Loss: 0.8553(train) | Acc: 88.3%(train)
Loss: 0.9059(valid) | Acc: 84.5%(valid)
Epoch: 5 | time in 0 minutes, 47 seconds
Loss: 0.8406(train) | Acc: 89.9%(train)
Loss: 0.9185(valid) | Acc: 83.0%(valid)
完整代码见:https://github.com/DiegoSong/nlp_on_RM/blob/master/text_cat/BiLSTM_attention.ipynb