Python深度学习12——Keras实现注意力机制(self-attention)中文的文本情感分类（详细注释）

阡之尘埃

已于 2024-02-04 17:51:51 修改

阅读量1.1w

点赞数 24

分类专栏： Python深度学习文章标签： keras python 神经网络 nlp 深度学习

于 2022-07-11 12:29:20 首次发布

本文链接：https://blog.csdn.net/weixin_46277779/article/details/125718106

版权

Python深度学习专栏收录该内容

14 篇文章

订阅专栏

本文通过Keras详细介绍了一种外卖评价数据集的中文文本预处理过程，包括分词、去除停用词等，并构建了12种不同的文本分类模型，如MLP、LSTM、CNN+LSTM、BiLSTM+Attention等。作者还自定义了注意力层，并对模型进行了训练和评估，展示了训练损失、精度图表以及混淆矩阵。实验结果显示，尽管注意力机制的模型训练时间更长，但它们更抗过拟合，且在某些情况下（如BiGRU+Attention）表现优于单一网络。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Keras封装性比较高，现在的注意力机制都是用pytorch较为多。但是使用函数API也可以实现，Keras处理文本并且转化为词向量也很方便。

本文使用了一个外卖评价的数据集，标签是0和1，1代表好评，0代表差评。并且构建了12种模型，即 MLP,1DCNN,RNN,GRU,LSTM, CNN+LSTM，TextCNN，BiLSTM, Attention, BiLSTM+Attention，BiGRU+Attention，Attention*3（3个注意力层堆叠）

大家也可以在此基础上参考改进，组合出更好的模型。（需要数据集和停用词可以留言）

本文的注释算是我写博客最详细的一篇了。

中文数据预处理

由于中文不像英文中间有空白可以直接划分词语，需要依靠jieba库切词，然后把没有用的标点符号，或者是“了”，‘的’，‘也’，‘就’，‘很’.....等等没有用的虚词去掉。这就需要一个停用词库，大家可以网上找常用的停用词文本，也可以留言找博主要。我这有一个比较全的停用词，我还有一个简化版的。本次使用的是简化版的停用词。

首先看数据长这样

需要这个演示数据的全部代码的同学可以参考：外卖数据

导入包和数据，读取停用词，用jieba库划分词汇并处理

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['KaiTi']  #指定默认字体 SimHei黑体
plt.rcParams['axes.unicode_minus'] = False   #解决保存图像是负号'
import jieba
stop_list  = pd.read_csv("stopwords_简略版.txt",index_col=False,quoting=3,
                         sep="\t",names=['stopword'], encoding='utf-8')


#Jieba分词函数
def txt_cut(juzi):
    lis=[w for w in jieba.lcut(juzi) if w not in stop_list.values]
    return " ".join(lis)

df=pd.read_excel('外卖.xlsx')
data=pd.DataFrame()
data['label']=df['label']
data['cutword']=df['review'].astype('str').apply(txt_cut)
data

词汇切割好了，得到如下结果

查看标签y的分布

data['label'].value_counts().plot(kind='bar')

负面评价0有将近8000个，正面评价1有4000个，不平衡，划分训练测试集时要分层抽样。

下面将文本变为数组，利用Keras里面的Tokenizer类实现，首先将词汇都索引化。这里有个参数num_words=6000很重要，意思是选择6000个词汇作为索引字典，也就是这个模型里面最多只有6000个词。

from os import listdir
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
# 将文件分割成单字, 建立词索引字典     
tok = Tokenizer(num_words=6000)
tok.fit_on_texts(data['cutword'].values)
print("样本数 : ", tok.document_count)

print({k: tok.word_index[k] for k in list(tok.word_index)[:10]})

由于每个评论的词汇长度不一样，我们训练时需要弄成一样长的张量（多剪少补），需要确定这个词汇最大长度为多少，也就是max_words参数，这个是循环神经网络的时间步的长度，也是注意力机制的维度。如果max_words过小则很多语句的信息损失了，而max_words过大数据矩阵又会过于稀疏，并且计算量过大。我们查看一下X的长度的分布频率：

# 建立训练和测试数据集 
X= tok.texts_to_sequences(data['cutword'].values)
#查看x的长度的分布
length=[]
for i in X:
    length.append(len(i))
v_c=pd.Series(length).value_counts()
print(v_c[v_c>20])   #频率大于20才展现
v_c[v_c>20].plot(kind='bar',figsize=(12,5))

可以看出绝大多数的句子单词长度不超过10....长度为5的评论是最多的，本次选择max_words=20，将句子都裁剪为长为20 的向量。并取出y

# 将序列数据填充成相同长度 
X= sequence.pad_sequences(X, maxlen=20)
Y=data['label'].values
print("X.shape: ", X.shape)
print("Y.shape: ", Y.shape)
#X=np.array(X)
#Y=np.array(Y)

然后划分训练测试集，查看形状：

X_train, X_test, Y_train, Y_test =  train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=0)
X_train.shape,X_test.shape,Y_train.shape, Y_test.shape

将y进行独立热编码，并且保留原始的测试集y_test，方便后面做评价。查看x和y前3个

Y_test_original=Y_test.copy()
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)

print(X_train[:3])
print(Y_test[:3])

开始构建神经网络

由于Keras里面没有封装好的注意力层，需要我们自己定义一个：

#自定义注意力层
from keras import initializers, constraints,activations,regularizers
from keras import backend as K
from keras.layers import Layer
class Attention(Layer):
    #返回值：返回的不是attention权重，而是每个timestep乘以权重后相加得到的向量。
    #输入:输入是rnn的timesteps，也是最长输入序列的长度
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[-1],),initializer=self.init,name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),initializer='zero', name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,constraint=self.b_constraint)
        else:
            self.b = None
        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None     ## 后面的层不需要mask了，所以这里可以直接返回none

    def call(self, x, mask=None):
        features_dim = self.features_dim    ## 这里应该是 step_dim是我们指定的参数，它等于input_shape[1],也就是rnn的timesteps
        step_dim = self.step_dim
        
        # 输入和参数分别reshape再点乘后，tensor.shape变成了(batch_size*timesteps, 1),之后每个batch要分开进行归一化
         # 所以应该有 eij = K.reshape(..., (-1, timesteps))

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
        if self.bias:
            eij += self.b        
        eij = K.tanh(eij)    #RNN一般默认激活函数为tanh, 对attention来说激活函数差别不大，因为要做softmax
        a = K.exp(eij)
        if mask is not None:    ## 如果前面的层有mask，那么后面这些被mask掉的timestep肯定是不能参与计算输出的，也就是将他们attention权重设为0
            a *= K.cast(mask, K.floatx())   ## cast是做类型转换，keras计算时会检查类型，可能是因为用gpu的原因

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)      # a = K.expand_dims(a, axis=-1) , axis默认为-1， 表示在最后扩充一个维度。比如shape = (3,)变成 (3, 1)
        ## 此时a.shape = (batch_size, timesteps, 1), x.shape = (batch_size, timesteps, units)
        weighted_input = x * a    
        # weighted_input的shape为 (batch_size, timesteps, units), 每个timestep的输出向量已经乘上了该timestep的权重
        # weighted_input在axis=1上取和，返回值的shape为 (batch_size, 1, units)
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):    ## 返回的结果是c，其shape为 (batch_size, units)
        return input_shape[0],  self.features_dim

别管这个类多复杂.....不用看，后面直接当成函数用就行。

下面导入Keras里面的常用的神经网络层，定义一些参数

from keras.preprocessing import sequence
from keras.models import Sequential,Model
from keras.layers import Dense,Input, Dropout, Embedding, Flatten,MaxPooling1D,Conv1D,SimpleRNN,LSTM,GRU,Multiply
from keras.layers import Bidirectional,Activation,BatchNormalization
from keras.layers.merge import concatenate
seed = 10
np.random.seed(seed)  # 指定随机数种子  
#单词索引的最大个数6000，单句话最大长度20
top_words=6000  
max_words=20
num_labels=2  #2分类

下面构建模型函数，这个函数较为复杂，因为是12个模型一起定义的，方便代码的复用。但每个模型对应的那一块都写的很清楚：

def build_model(top_words=top_words,max_words=max_words,num_labels=num_labels,mode='LSTM',hidden_dim=[32]):
    if mode=='RNN':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(SimpleRNN(32))  
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='MLP':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(Flatten())
        model.add(Dense(256, activation="relu"))  
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='LSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(LSTM(32))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='GRU':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(GRU(32))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='CNN':        #一维卷积
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(Conv1D(filters=32, kernel_size=3, padding="same",activation="relu"))
        model.add(MaxPooling1D(pool_size=2))
        model.add(Flatten())
        model.add(Dense(256, activation="relu"))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='CNN+LSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))    
        model.add(Conv1D(filters=32, kernel_size=3, padding="same",activation="relu"))
        model.add(MaxPooling1D(pool_size=2))
        model.add(LSTM(64))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='BiLSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Bidirectional(LSTM(64)))
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.25))
        model.add(Dense(num_labels, activation='softmax'))
    #下面的网络采用Funcional API实现
    elif mode=='TextCNN':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        ## 词嵌入使用预训练的词向量
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        ## 词窗大小分别为3,4,5
        cnn1 = Conv1D(32, 3, padding='same', strides = 1, activation='relu')(layer)
        cnn1 = MaxPooling1D(pool_size=2)(cnn1)
        cnn2 = Conv1D(32, 4, padding='same', strides = 1, activation='relu')(layer)
        cnn2 = MaxPooling1D(pool_size=2)(cnn2)
        cnn3 = Conv1D(32, 5, padding='same', strides = 1, activation='relu')(layer)
        cnn3 = MaxPooling1D(pool_size=2)(cnn3)
        # 合并三个模型的输出向量
        cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)
        flat = Flatten()(cnn) 
        drop = Dropout(0.2)(flat)
        main_output = Dense(num_labels, activation='softmax')(drop)
        model = Model(inputs=inputs, outputs=main_output)
        
    elif mode=='Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(64)(attention_mul) #原始的全连接
        fla=Flatten()(mlp)
        output = Dense(num_labels, activation='softmax')(fla)
        model = Model(inputs=[inputs], outputs=output)  
    elif mode=='Attention*3':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(32,activation='relu')(attention_mul) 
        attention_probs = Dense(32, activation='softmax', name='attention_vec1')(mlp)
        attention_mul =  Multiply()([mlp, attention_probs])
        mlp2 = Dense(32,activation='relu')(attention_mul) 
        attention_probs = Dense(32, activation='softmax', name='attention_vec2')(mlp2)
        attention_mul =  Multiply()([mlp2, attention_probs])
        mlp3 = Dense(32,activation='relu')(attention_mul)           
        fla=Flatten()(mlp3)
        output = Dense(num_labels, activation='softmax')(fla)
        model = Model(inputs=[inputs], outputs=output)      
        
    elif mode=='BiLSTM+Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        bilstm = Bidirectional(LSTM(64, return_sequences=True))(layer)  #参数保持维度3
        bilstm = Bidirectional(LSTM(64, return_sequences=True))(bilstm)
        layer = Dense(256, activation='relu')(bilstm)
        layer = Dropout(0.2)(layer)
        ## 注意力机制 
        attention = Attention(step_dim=max_words)(layer)
        layer = Dense(128, activation='relu')(attention)
        output = Dense(num_labels, activation='softmax')(layer)
        model = Model(inputs=inputs, outputs=output)  
        
    elif mode=='BiGRU+Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(64,activation='relu')(attention_mul) #原始的全连接
        #bat=BatchNormalization()(mlp)
        #act=Activation('relu')
        gru=Bidirectional(GRU(32))(mlp)
        mlp = Dense(16,activation='relu')(gru)
        output = Dense(num_labels, activation='softmax')(mlp)
        model = Model(inputs=[inputs], outputs=output) 
        
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model

前几个简单的单一模型使用的是搭积木一样最简单的定义方式。后面复杂一点的模型都是使用的Functional API实现的。

下面再定义损失和精度的图,和混淆矩阵指标等等评价体系的函数

#定义损失和精度的图,和混淆矩阵指标等等
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
def plot_loss(history):
    # 显示训练和验证损失图表
    plt.subplots(1,2,figsize=(10,3))
    plt.subplot(121)
    loss = history.history["loss"]
    epochs = range(1, len(loss)+1)
    val_loss = history.history["val_loss"]
    plt.plot(epochs, loss, "bo", label="Training Loss")
    plt.plot(epochs, val_loss, "r", label="Validation Loss")
    plt.title("Training and Validation Loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend()  
    plt.subplot(122)
    acc = history.history["accuracy"]
    val_acc = history.history["val_accuracy"]
    plt.plot(epochs, acc, "b-", label="Training Acc")
    plt.plot(epochs, val_acc, "r--", label="Validation Acc")
    plt.title("Training and Validation Accuracy")
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.legend()
    plt.tight_layout()
    plt.show()
def plot_confusion_matrix(model,X_test,Y_test_original):
    #预测概率
    prob=model.predict(X_test) 
    #预测类别
    pred=np.argmax(prob,axis=1)
    #数据透视表，混淆矩阵
    table = pd.crosstab(Y_test_original, pred, rownames=['Actual'], colnames=['Predicted'])
    #print(table)
    sns.heatmap(table,cmap='Blues',fmt='.20g', annot=True)
    plt.tight_layout()
    plt.show()
    #计算混淆矩阵的各项指标
    print(classification_report(Y_test_original, pred))
    #科恩Kappa指标
    print('科恩Kappa'+str(cohen_kappa_score(Y_test_original, pred)))

定义训练函数

#定义训练函数
def train_fuc(max_words=max_words,mode='BiLSTM+Attention',batch_size=32,epochs=10,hidden_dim=[32],show_loss=True,show_confusion_matrix=True):
    #构建模型
    model=build_model(max_words=max_words,mode=mode)
    print(model.summary())
    history=model.fit(X_train, Y_train,batch_size=batch_size,epochs=epochs,validation_split=0.2, verbose=1)
    print('————————————训练完毕————————————')
    # 评估模型
    loss, accuracy = model.evaluate(X_test, Y_test)
    print("测试数据集的准确度 = {:.4f}".format(accuracy))
    
    if show_loss:
        plot_loss(history)
    if show_confusion_matrix:
        plot_confusion_matrix(model=model,X_test=X_test,Y_test_original=Y_test_original)

设定一些参数

top_words=6000
max_words=20
batch_size=32
epochs=4
show_confusion_matrix=True
show_loss=True
mode='MLP'

训练轮数为4，比较少，因为这个数据集少，而且太简单了，每个句子很短，所以前面单一模型很容易过拟合，就只训练个4轮，也能节约时间。

下面开始一个个模型去训练并且评价：

MLP

train_fuc(mode='MLP',batch_size=batch_size,epochs=epochs)

如图，给出了训练每一轮的损失精度，和验证集的损失精度。并且画图，然后测试集的精度，画出的混淆矩阵，计算了混淆矩阵的一些指标，还有科恩系数。MLP测试集精度为0.8795

1DCNN

#下面模型都是接受三维数据输入，先把X变个形状
X_train= X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test= X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
train_fuc(mode='CNN',batch_size=batch_size,epochs=epochs)

也差不多，精度为0.8882

RNN

model='RNN' 
train_fuc(mode=model,epochs=epochs)

结果类似，不展示那么多了，测试集精度为0.8912

LSTM

train_fuc(mode='LSTM',epochs=epochs)

结果类似，不展示了，测试集精度为0.8966 （目前来看最高）

GRU

train_fuc(mode='GRU',epochs=epochs)

测试集精度为0.8912

CNN+LSTM

train_fuc(mode='CNN+LSTM',epochs=epochs)

测试集精度为0.8916

BiLSTM

train_fuc(mode='BiLSTM',epochs=epochs)

测试数据集的准确度 0.8816

TextCNN

train_fuc(mode='TextCNN',epochs=30)

这里加大了训练轮数，因为下面的模型都开始比较复杂，不容易过拟合，而且需要更多的训练轮数

测试集精度为0.8474

Attention

train_fuc(mode='Attention',epochs=100)

测试集精度为0.8207

BiLSTM+Attention

train_fuc(mode='BiLSTM+Attention',epochs=30)

测试集精度0.8236

BiGRU+Attention

train_fuc(mode='BiGRU+Attention',epochs=100)

测试集精度0.8607

Attention*3

train_fuc(mode='Attention*3',epochs=50)

测试集精度0.8057

很明显，加了注意力机制的模型训练更加不容易过拟合。单一的循环网络才四轮就会过拟合，而注意力机制同时需要的训练轮数也更多，可以看到验证集精度一直在上升，损失一直在下降。

虽然最后整体的测试集的准确率不如前面的单一网络，但我猜测这应该是训练轮数不够和数据量过小的原因。
这个外卖的数据集实在是太短了，比较简单，而且样本量也不大。

而且和Transform比起来，这里的注意力机制没有采用残差连接，批量归一化等技巧，没有使用编码解码器，也没有堆叠很多层(Transform有18个注意力层)

以后可以在更复杂，更多的数据集上进行测试和训练注意力机制,把网络做大做深一点，多调参尝试，当然前提是需要有更多的计算资源(买台好电脑).....