nlp 中文文本纠错_NLP中文文本分类任务的笔记（二）-CSDN博客

本文链接：https://blog.csdn.net/weixin_36243860/article/details/112157887

本文探讨了LSTM和Attention在中文文本分类任务中的使用，包括LSTM的演变、GRU的结构，以及self-attention的工作原理。通过结合LSTM和self-attention构建文本分类模型，并介绍了防止过拟合的措施。实验结果显示，TextCNN和LSTM+Self-Attention在不同数据集上表现各有优势。

摘要由CSDN通过智能技术生成

接着上一篇的笔记

邓文涛：NLP中文文本分类任务的笔记（一）zhuanlan.zhihu.com

这一篇介绍下文本分类在LSTM和Attention的探索：

LSTM

在讲到LSTM之前，先介绍下RNN到LSTM的演变过程， RNN的计算公式是：

其中

是

时的hidden state。

由于RNN没办法解决梯度爆炸和梯度消失的问题，引入了LSTM （long short term memory）结构，通过如下的公式来表现这个结构：

forget gate(遗忘门) 选择性记忆

的信息，同时利用input gate(输入门)和output gate(输出门)控制输入和输出的信息整合。

class LSTM(nn.Module):
    def __init__(self,input_dim, output_dim, batch_size, num_direction):
        super().__init__()
        self.hidden_size = output_dim
        self.num_direction = num_direction
        if num_direction == 2:
            bidirectional = True
        else:
            bidirectional = False
        self.hc_state = self.init_hidden(batch_size)
        self.lstm = nn.LSTM(input_dim, output_dim,  batch_first=True, bidirectional=bidirectional)

    def init_hidden(self, batch_size):
        return (Variable(torch.zeros(1*self.num_direction, batch_size, self.hidden_size)),
               Variable(torch.zeros(1*self.num_direction, batch_size, self.hidden_size)))

    def forward(self, embeddings):
        self.hc_state = self.init_hidden(embeddings.size(0))
        outputs, (ht,ct) = self.lstm(embeddings, self.hc_state)
        return outputs, (ht,ct)

相较于LSTM，2014年提出了另一个结构GRU(gated recurrent unit)，它能够实现与LSTM相似的结果，同时能够便于计算，GRU的主要由reset gate和update gate两部分组成：

reset gate：重置

;

update gate：将遗忘门和输入门结合在一起

Attention(主要是分析self-attention)

self-attention 自注意力模型：首先根据定义的 Q（query), K (key), V(value), self-attention可以被定义为，这个也是transformers里面使用的self-attention公式，需要注意的是在代码中我把Q, K, V三者统一为LSTM的outputs，进行权重学习。

import torch
import torch.nn as nn
import numpy as np

class ScaledDotProductAttention(nn.Module):

    def __init__(self, d_model, attn_dropout=0.1):
        super(ScaledDotProductAttention, self).__init__()
        self.temper = np.power(d_model, 0.5)
        self.dropout = nn.Dropout(attn_dropout)
        self.softmax = nn.Softmax(1)

    def forward(self, q, k, v, attn_mask=None):
        attn = torch.bmm(q, k.transpose(1, 2)) / self.temper
        if attn_mask is not None:

            assert attn_mask.size() == attn.size(), 
                    'Attention mask shape {} mismatch ' 
                    'with Attention logit tensor shape ' 
                    '{}.'.format(attn_mask.size(), attn.size())

            attn.data.masked_fill_(attn_mask, -float('inf'))

        attn = self.softmax(attn)
        attn = self.dropout(attn)
        output = torch.bmm(attn, v)
        return output, attn

结合LSTM和self-attention

基于LSTM和self-attention我们可以构建一个文本分类的方式，如下：

class MultiClassCLS(nn.Module):
    '''
    The class is an implementation of the paper A Structured Self-Attentive Sentence Embedding including regularization
    and without pruning. Slight modifications have been done for speedup
    '''
    def __init__(self, args):
        '''

        :param batch_size:{int} batch_size used for training
        :param hidden_size:{int} hidden dimension for lstm
        :param d_a:{int} hidden dimension for the dense layer
        :param r:{int} attention-hops or attention heads
        :param max_len:{int} number of lstm timesteps
        :param emb_dim:{int} embeddingd for time steps
        :param num_labels:{int} number of classes
        :param type:[0,1, 3] 0-->binary classification 1-->multiclass classification 2-->multilabel classification
        '''
        super(MultiClassCLS, self).__init__()
        self.num_labels = args['num_labels']
        self.batch_size = args['batch_size']
        self.hidden_size = args['hidden_size']
        self.max_len = args['max_len']

        self.type =args['type']
        self.vocab_size = args["vocab_size"]
        self.emb_dim = args["emb_dim"]
        self.att_droput = args["att_dropout"]

        self.bidirection = args["bidirection"]
        if self.bidirection:
            self.num_direction = 2
        else:
            self.num_direction = 1

        # self.bert_path = os.path.join(current_path, bert_path)
        self.embeddings = nn.Embedding(self.vocab_size, self.emb_dim)
        if vectors:
            self.embeddings.from_pretrained(torch.tensor(vectors).type(torch.float32), freeze=True)

        self.lstm = LSTM(self.emb_dim, self.hidden_size, self.batch_size, self.num_direction)
        self.bn = LayerNorm(self.num_direction*self.hidden_size)
        # self.attention = Attention(self.num_direction*self.hidden_size, self.d_a, self.r)
        self.attention = SelfAttention(self.num_direction*self.hidden_size, attn_dropout=self.att_droput)

        self.label_layer = nn.Linear(self.num_direction*self.hidden_size, self.num_labels)

    def forward(self, sentence, reduction = 'mean'):
        sentence_emb= self.embeddings(sentence)

        outputs, _ = self.lstm(sentence_emb)
        outputs = self.bn(outputs)

        attention_output, _ = self.attention(outputs, outputs, outputs)

        if reduction == 'mean':
            fc_output = torch.sum(attention_output,1)/self.max_len
        else:
            fc_output= torch.max(attention_output, 1)[0]

        label_logits = self.label_layer(fc_output)

        if self.type in [0,2]:
            return torch.sigmoid(label_logits)
        if self.type == 1:
            return F.log_softmax(label_logits, dim=1)

为了防止过拟合，这边做了两个处理：

将LSTM的outputs进行LayerNorm的操作

self.bn = LayerNorm(self.num_direction*self.hidden_size)
outputs = self.bn(outputs)

class LayerNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))

    def forward(self, input):
        mu = torch.mean(input, dim=-1, keepdim=True)
        sigma = torch.std(input, dim=-1, keepdim=True).clamp(min=self.eps)
        output = (input - mu) / sigma
        return output * self.weight.expand_as(output) + self.bias.expand_as(output)

在self-attention加入dropout机制：

 self.dropout = nn.Dropout(attn_dropout)

一般我会选择把attn_dropout设置为0.5。

对比了textcnn 和 LSTM+self-attention两种方法的一些结论：

数据1: label size = 35

	accuracy	10-epoch	cpu	memory
self-attn	0.776	37s	6-7	340M
text-cnn	0.761	93s	4-5	320M

数据2: label size = 83

	accuracy	10-epoch	cpu	memory
self-attn	0.926	50s	6-7	370M
text-cnn	0.938	128s	4-5	320M

就两个数据上来看，两个算法各有千秋，text-cnn训练时间慢但是需要的资源少，LSTM+self-attention训练时间快了2-3倍，但是需要的cpu和memory资源更多。