文本分类任务中几种attention机制的介绍

最新推荐文章于 2024-08-13 15:43:57 发布

酸辣螺丝粉

最新推荐文章于 2024-08-13 15:43:57 发布

阅读量4.9k

点赞数

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/zhylhy520/article/details/95932754

版权

本文介绍了在文本分类任务中应用的几种Attention机制，包括adversarialLSTM、HiGRU、MDRE和Transformer的self-attention。adversarialLSTM通过矩阵乘法和softmax得到注意力向量；HiGRU利用mask机制处理填充的0；MDRE结合语音和文本，计算两者之间的注意力；Transformer则完全基于self-attention，无需LSTM或GRU，通过多头注意力实现高效表示。

摘要由CSDN通过智能技术生成

文本分类任务的一大核心就是获得文本的准确语义表示，笔者之前在文本分类任务中只是简单地调用LSTM或GRU来获得文本的词向量表示。在阅读论文和github项目时，会发现顶会论文在获得文本的语义向量时会使用Attention机制。下面，博主就介绍几种文本分类任务中在获得文本语义向量表示的过程中Attention机制的运用（后续随着论文的阅读会不断更新）。

adversarialLSTM的attention机制

def _attention(self, H):
        """
        利用Attention机制得到句子的向量表示
        """
        # 获得最后一层LSTM的神经元数量
        hiddenSize = config.model.hiddenSizes[-1]
        
        # 初始化一个权重向量，是可训练的参数
        W = tf.Variable(tf.random_normal([hiddenSize], stddev=0.1))
        
        # 对Bi-LSTM的输出用激活函数做非线性转换
        M = tf.tanh(H)
        
        # 对W和M做矩阵运算，M=[batch_size, time_step, hidden_size]，计算前做维度转换成[batch_size * time_step, hidden_size]
        # newM = [batch_size*time_step, 1]，每一个时间步的输出由向量转换成一个数字
        newM = tf.matmul(tf.reshape(M, [-1, hiddenSize]), tf.reshape(W, [-1, 1]))
        
        # 对newM做维度转换成[batch_size, time_step]
        restoreM = tf.reshape(newM, [-1, config.sequenceLength])
        
        # 用softmax做归一化处理[batch_size, time_step]
        self.alpha = tf.nn.softmax(restoreM)
        
        # 利用求得的alpha的值对H进行加权求和，用矩阵运算直接操作
        r = tf.matmul(tf.transpose(H, [0, 2, 1]), tf.reshape(self.alpha, [-1, config.sequenceLength, 1]))
        
        # 将三维压缩成二维sequeezeR=[batch_size, hidden_size]
        sequeezeR = tf.squeeze(r)
        
        sentenceRepren = tf.tanh(sequeezeR)
        
        # 对Attention的输出可以做dropout处理
        output = tf.nn.dropout(sentenceRepren, self.dropoutKeepProb)
        
        return output

H为Bi-LSTM的输出，H到M时，shape保持不变，为[batch_size,time_step,hiddenSize],W的shape为[hiddenSize]
newM = tf.matmul(tf.reshape(M, [-1, hiddenSize]), tf.reshape(W, [-1,1]))，此时newM的shape为[batch_size*time_step,1],接着restoreM的shape为[batch_size,time_step]，这里面time_step与config.sequenceLength相同
restoreM经过siftmax层后得到的shape与其保持不变。 r = tf.matmul(tf.transpose(H, [0,
2, 1]), tf.reshape(self.alpha, [-1, config.sequenceLength,
1]))中[batch_size,hiddenSize,time_step]与[batch_size,time_step,1]做矩阵乘法，得到的shape为[batch_size,hiddenSize,1]
接着将[batch_size,hiddenSize,1]压缩成二维，在经过Tanh和Dropout后就得到注意力机制后的向量，shape为[batch_size,hiddenSize]

HiGRU的attention机制

# Dot-product attention
def get_attention(q, k, v, attn_mask=None):
	"""
	:param : (batch, seq_len, seq_len)
	:return: (batch, seq_len, seq_len)
	"""
	attn = torch.matmul(q, k.transpose(1, 2))
	
	if attn_mask is not None:
		attn.data.masked_fill_(attn_mask, -1e10)
		
	attn = F.softmax(attn, dim=-1)
	
	output = torch.matmul(attn, v)
	
	return output, attn
# Get mask for attention
def get_attn_pad_mask(seq_q, seq_k):

	assert seq_q.dim() == 2 and seq_k.dim() == 2
	
	pad_attn_mask = torch.matmul(seq_q.unsquee