Bi-LSTM + Attention 模型学习记录（一）

LLC74

已于 2024-09-12 14:10:15 修改

阅读量1.3k

点赞数 11

分类专栏：深度学习文章标签： lstm 人工智能深度学习

于 2023-11-23 09:28:28 首次发布

本文链接：https://blog.csdn.net/weixin_73989383/article/details/134565296

版权

深度学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、引言

二、模型

三、部分解析+个人理解

3.1.词嵌入 World Embedding算法

本周学习了Bi-lstm+attention的模型，主要通过学习分析以下的论文：

https://aclanthology.org/P16-2034.pdf

一、引言

本周我学习了用于关系分类的神经网络Att-BLSTM，该模型利用神经元的注意机制和双向长短记忆网络(BLSTM)来捕捉句子中最重要的语义信息。本模型的好处在于不利用任何来自词汇资源或自然语言处理系统的特征。

二、模型

该模型由以下五部分组成：
(1)输入层:向该模型输入句子;

(2)嵌入层:将每个单词映射为低维向量；
(3)LSTM层:利用步骤 (2)获得高电平特征;
(4)注意力层:产生权重向量，并将每个时间步长的词级特征合并为句子级特征向量，将权重向量相乘；
(5)输出层:最后利用句子级特征向量进行关系分类。

我了解到这个模型，并且研究了一下该模型这五部分的内容。

三、部分解析+个人理解

3.1.词嵌入 World Embedding算法

（以下内容参考视频什么是词嵌入，Word Embedding算法_哔哩哔哩_bilibili）

a.介绍：

将词汇表中的词或短语映射为固定长度向量的技术，我们可以将高维稀疏向量转为低维连续向量。为了进一步说明词与词之间的关系，我们可以使用降维算法，将词嵌入向量转变至二维（将这些词汇在二维平面图上绘制出来，我发现，语义相近的一些词，向量位置之间的距离也会更近一些如下图：

并且，词嵌入向量还能通过向量之间的数学关系来描述词语间的语义关联。如图所示：）

向量（“King")-向量("man")~向量（”queen")-向量（”woman")

通过特定的词嵌入算法，比如world2vec、fasttext、Glove等算法，构建出一个通用的嵌入矩阵。矩阵的行数即词语的个数，矩阵的列即表示词语的维度。

b.说明词嵌入的过程

假设词表中有5000个单词，每个词使用一个128维的向量表示，即构成了一个大小为5000*128的词嵌入矩阵E.

我们将一个句子s="我喜欢学习数学"，使用词嵌入技术，将s中的每一个词，都表示为一个128维的向量。

过程如下：

1.将句子拆分为几个词“我” “喜欢” “学习” “数学”

2.使用One-Hot进行编码，将句子转变为4*5000的词嵌入矩阵V，如图：V

（思考：5000维是语料库的维度，128维是词的特征维度）

E（5000*128）* V（4*5000)

即得到了一个4*128的矩阵，该矩阵则为句子“我喜欢学习数学”的嵌入向量。

总结：词嵌入矩阵是词嵌入的关键；One-Hot编码是不通用的，不同语料得到的One-Hot编码不同，但嵌入矩阵是通用的，可用于不同的nlp任务中。

思考：如何得到词嵌入矩阵（没学）

以下是代码实现：

#author:渣渣的夏天
#原文链接：https://blog.csdn.net/qq_39564555/article/details/105882001
#1.1 单词级的one-hot 编码
import numpy as np
#自行创建简单数据
samples = ['The cat sat on the mat.','The dog ate my homework.']

#构建数据中所有标记的索引，用一个字典来储存
token_index = {}
for sample in samples:
    #利用split方法对样本进行分词
    for word in sample.spilt():
        if world not in token_index:
           #为每个唯一单词指定一个唯一索引
           token_index[word] = len(token_index) + 1
           #没有为索引编号0指定单词

#对样本进行分词
#只考虑每个样本前max_length个单词
max_length = 10

#结果返回给results:
results = np.zeros((len(samples),max_length,max(token_index.values()) + 1))
for i,sample in enumerate(samples):
    for j,word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        #指定唯一的元素为1
        results[i,j,index] = 1.
#查看索引字典
print(token_index)
print(results[1,1])#样本列表的第二个元素的第二个单词编码情况


#1.2 字符级的one-hot 编码
import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable  #所有可以打印的ASCII字符

#创建索引字典
token_index = dict(zip(characters, range(1, len(characters) + 1)))
#为所有可能打印的字符创建一个字典
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample[:max_length]):
        index = token_index.get(character)
        results[i, j, index] = 1.
print(token_index)#查看索引字典
print(results[1,1])#样本列表的第二个元素的第二个字符编码情况


#1.3 用keras实现单词级的one-hot 编码
from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

#创建一个分词器
#只考虑前1000个最常见单词
tokenizer = Tokenizer(num_words=1000)
#构建单词索引
tokenizer.fit_on_texts(samples)

#将字符串转换为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)

# 可以直接得到one-hot编码二进制表示
# 分词器也支持除one-hot编码外的其他向量化模式
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

# 找回单词索引
word_index = tokenizer.word_index
word_index


#2.1 词嵌入模型

#2.2 利用Embedding层学习嵌入

#2.3  实例化Embedding层
from keras.layers import Embedding

# Embedding层至少需要2个参数 
# 标记的个数（这里是1000，即最大单词索引+1）和嵌入维度（这里是64）
embedding_layer = Embedding(1000, 64)

# 2.4 加载IMDB数据
from keras.datasets import imdb
from keras import preprocessing
# 作为特征的单词，即选取出现频率最高的单词数量
max_features = 10000
# 在这么多单词后截断文本
# (这些单词都属于前max_features个最常见单词)
maxlen = 20
# 将数据集加载为列表
(x_train, y_train), (x_test, y_test) = imdb.load_data(path='F:\Desktop\data\imdb.npz', num_words=max_features)
# 将整数列表转换成形状为（samples,maxlen）的二维整数张量
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)


# 2.5 构建keras模型并训练拟合模型
# 完成以下步骤
# 1）导入包Sequential,Flatten,Dense,Embedding
# 2）定义一个序列模型
# 3）添加一个Embedding层，标记个数10000，维度8，输入长度是maxlen
# 4）添加一个Flatten层
# 5）添加一个全连接层，输出维度是1，激活函数‘sigmoid’
# 6）编译模型，优化器选取‘rmsprop’，损失函数选取‘binary_crossentropy’,评估方式是‘acc’
# 7）用.summary()方法查看模型架构
# 8）拟合模型，epoch选取10，batch_size选取32，validation_split为0.2
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
# 指定Embedding层的最大输入长度，以便后面将嵌入输入展平。
model.add(Embedding(10000, 8, input_length=maxlen))
# 将三维的嵌入张量展平为(samples, maxlen * 8)的二维张量
model.add(Flatten())

# 添加分类器
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)
mizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

3.2.LSTM长短期记忆网络

（以下内容参考视频3.结合例子理解LSTM_哔哩哔哩_bilibili ）

LSTM的优势：RNN把所有内容都记住，无论重要或不重要；LSTM拥有记忆细胞，会选择性记忆重要信息，过滤掉无用信息。

a.Lstm的前向传播

（符号：C--记忆细胞、h--状态、西格玛--门单元、f--遗忘门、i--更新门、o--输出门、W--对应权重）

结合数学公式：

（错误修正：g(t)改为C(t) )

原理：结合例子，假设上一场考的是高等数学，本场考的是线性代数

前提是脑容量有限

1.遗忘门：（遗忘部门高等数学的的记忆）

经过softmax 对记忆进行处理，数据会映射到0-1之间，将0部分的记忆进行遗忘，将1对应的记忆保留，比如一些数学运算能力；

2.更新门：过滤掉书中和考点无关的知识；

即新的记忆=数学运算能力 + 【现代】考点知识。

3.输出门：输出答题所需要的记忆；

总结：第一个门决定我要忘记什么，第二个门决定我要记住什么新知识，第三个门决定我要输出什么。

模型实现

#author:噜噜啦啦咯
#原文链接：https://blog.csdn.net/weixin_52910499/article/details/124693212
#导入所需要的库
import matplotlib.pyplot as plt
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense,Dropout
from numpy import concatenate
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from math import sqrt

#设置随机数种子
import tensorflow as tf
tf.random.set_seed(2)

#导入数据集
qy_data=read_csv(r'C:\Users\HUAWEI\Desktop\abc.csv',parse_dates=['num'],index_col='num')
qy_data.index.name='num' #选定索引列

打印前五行数据进行查看

#author:噜噜啦啦咯
#原文链接：https://blog.csdn.net/weixin_52910499/article/details/124693212
#数据处理
# 获取DataFrame中的数据，形式为数组array形式
values = qy_data.values
# 确保所有数据为float类型
values = values.astype('float32')

#归一化处理
#使用MinMaxScaler缩放器，将全部数据都缩放到[0,1]之间，加快收敛。
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)

查看归一化处理后的数据

将时间序列转换为监督学习问题

将时间序列形式的数据转换为监督学习集的形式，例如：[[10],[11],[12],[13],[14]]转换为[[0,10],[10,11],[11,12],[12,13],[13,14]]，即把前一个数作为输入，后一个数作为对应输出。

#author:噜噜啦啦咯
#原文链接：https://blog.csdn.net/weixin_52910499/article/details/124693212
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j + 1, i)) for j in range(n_vars)]
        # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j + 1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j + 1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg
 
reframed = series_to_supervised(scaled, 2, 1)

打印数据前五行

划分训练集和测试集

# 划分训练集和测试集
values = reframed.values
trainNum = int(len(values) * 0.7)
train = values[:trainNum,:]
test = values[trainNum:, :]

查看划分后的数据维度

print(train_X.shape, train_y.shape)
print(test_X.shape, test_y.shape)

搭建LSTM模型

#初始化LSTM模型，设置神经元核心的个数，迭代次数，优化器等等
model = Sequential()
model.add(LSTM(27, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dropout(0.5))
model.add(Dense(15,activation='relu'))#激活函数
model.compile(loss='mae', optimizer='adam')
history = model.fit(train_X, train_y, epochs=95, batch_size=2, validation_data=(test_X, test_y), verbose=2,shuffle=False)

得到损失图

模型预测

y_predict = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

画图展示

plt.figure(figsize=(10,8),dpi=150)
plt.plot(inv_y,color='red',label='Original')
plt.plot(inv_y_predict,color='green',label='Predict')
plt.xlabel('the number of test data')
plt.ylabel('Soil moisture')
plt.legend()
plt.show()

得到预测图像

回归评价指标

#author:噜噜啦啦咯
#原文链接：https://blog.csdn.net/weixin_52910499/article/details/124693212
# calculate MSE 均方误差
mse=mean_squared_error(inv_y,inv_y_predict)
# calculate RMSE 均方根误差
rmse = sqrt(mean_squared_error(inv_y, inv_y_predict))
#calculate MAE 平均绝对误差
mae=mean_absolute_error(inv_y,inv_y_predict)
#calculate R square
r_square=r2_score(inv_y,inv_y_predict)
print('均方误差MSE: %.6f' % mse)
print('均方根误差RMSE: %.6f' % rmse)
print('平均绝对误差MAE: %.6f' % mae)
print('R_square: %.6f' % r_square)

3.3.Attention 注意力机制

(参考：原创 | Attention is all you need 论文解析（附代码）

09 Transformer 之什么是注意力机制（Attention）_哔哩哔哩_bilibili）

优势：CNN、LSTM很难决定什么重要，什么不重要，诞生了注意力机制。

attention的本质思想：从大量信息中有选择地筛选出少量重要信息并聚焦到这些重要信息上，忽略不重要的信息。

怎么做注意力？

给你一张图，我（查询对象Q），这张图（被查询对象K）

信息重要度的计算 == 相似度计算

如图：

Q、K=k1，k2，...，kn，一般使用点乘（求内积）的方法计算Q和K里每一个事物的相似度。就可以得到Q和K1的相似度a1,Q和K2的相似度a2,Q和Kn的相似度an。

做一层softmax(a1，a2，...，an)处理，就可以得到概率（s1,s2,...,sn），进而可以找出哪个对Q更重要。

V=(v1,v2,...,vn)

V'=(v1,v2,...,vn)*(s1,s2,...,sn)=（v1*s1+v2*s2+...+vn*sn）

一般在transformer中，K=V',也可以存在K!=V',但是K和V之间一定存在某种联系，这样QK点乘即能指导V哪些重要，哪些不重要。

分析公式：

1.内积

矩阵可以看成由一些向量组成，一个矩阵乘以它的转置，可以看做这些向量和其他向量的内积。

（矩阵的运算为第一行乘以第一列求和，矩阵转置，列向量变为行向量，实际上便是求第一行向量与自己的内积。）

向量内积的几何意义：表征两个向量的夹角，表征一个向量在另一个向量上的投影。

投影值大，说明两个向量的相关性大；若两个向量的夹角为90°，则说明两个向量线性无关，即没有相关性。扩展一下，如果两个向量为词向量，是词在高维空间的数值映射，若投影越大，则说明在一定程度上，在关注词A的同时，也应该关注一下词B。相关度的本质实际上是由向量的内积度量的。

2.的意义：使Transformer在训练过程中梯度值保持稳定。

attention机制的代码实现

（1）基于pytorch实现

假设输入序列是一个由n个标签组成的序列，输出序列是一个由m个标签组成的序列。首先，我们定义一个包含两个线性变换的网络层，分别用于将输入序列和输出序列的维度映射到一个相同的维度空间。代码如下：

#author:Chaos_Wang_
#原文链接：https://blog.csdn.net/qq_41667743/article/details/128986978
class AttentionLayer(nn.Module):
    def __init__(self, input_size, output_size):
        super(AttentionLayer, self).__init__()
        self.input_proj = nn.Linear(input_size, output_size, bias=False)
        self.output_proj = nn.Linear(output_size, output_size, bias=False)

实现attention的计算过程，具体指计算每一个输入词语与输出标签之间的相似度，然后将相似度进行归一化处理，最终得到一个由n个归一化的权重组成的向量。代码如下：

    def forward(self, inputs, outputs):
        inputs = self.input_proj(inputs) # (batch_size, n, input_size) -> (batch_size, n, output_size)
        outputs = self.output_proj(outputs) # (batch_size, m, output_size) -> (batch_size, m, output_size)
        scores = torch.bmm(inputs, outputs.transpose(1, 2)) # (batch_size, n, output_size) * (batch_size, output_size, m) -> (batch_size, n, m)
        weights = F.softmax(scores, dim=1) # (batch_size, n, m)
        return weights

在代码中，我们要将输入序列和输出序列进行线性变换，并计算它们之间的相似度。然后，我们使用softmax函数将相似度进行归一化处理，从而得到一个n × m的归一化权重矩阵。

最后，我们可以将Attention计算的结果与输入序列相乘，得到一个由m个加权输入向量组成的向量。代码如下：

class AttentionLayer(nn.Module):
    def __init__(self, input_size, output_size):
        super(AttentionLayer, self).__init__()
        self.input_proj = nn.Linear(input_size, output_size, bias=False)
        self.output_proj = nn.Linear(output_size, output_size, bias=False)
    
    def forward(self, inputs, outputs):
        inputs = self.input_proj(inputs) # (batch_size, n, input_size) -> (batch_size, n, output_size)
        outputs = self.output_proj(outputs) # (batch_size, m, output_size) -> (batch_size, m, output_size)
        scores = torch.bmm(inputs, outputs.transpose(1, 2)) # (batch_size, n, output_size) * (batch_size, output_size, m) -> (batch_size, n, m)
        weights = F.softmax(scores, dim=1) # (batch_size, n, m)
        context = torch.bmm(weights.transpose(1, 2), inputs) # (batch_size, m, n) * (batch_size, n, output_size) -> (batch_size, m, output_size)
        return context

我们将归一化权重矩阵和输入序列进行矩阵乘法运算，得到一个由m mm个加权输入向量组成的向量。这个向量就是Attention模型的输出结果。
思考：在使用attention模型时要考虑到一些细节问题，比如输入序列和输出序列的长短不一定相同，故attention模型要根据实际的任务来调整和设计。

（2）TensorFlow实现

在TensorFlow中，我们可以使用tf.keras.layers.Attention层来实现Attention机制。

示例：我们将使用IMDB电影评论情感分类数据集，这是一个二元分类任务，我们需要将评论分为积极或消极两种情感。

代码如下：

#author:Chaos_Wang_
#原文链接：https://blog.csdn.net/qq_41667743/article/details/128986978
#导入必要的库和数据集
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, Dense, Embedding, Flatten, LSTM, Bidirectional, Attention
from tensorflow.keras.models import Model
import numpy as np

# 加载IMDB数据集
max_features = 20000
maxlen = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = pad_sequences(x_train, padding='post', maxlen=maxlen)
x_test = pad_sequences(x_test, padding='post', maxlen=maxlen)

接下来，根据Keras函数式API，构建一个Bi-lstm模型，并且加上一个attention层。

具体实现步骤：

a.使用Embedding层将输入序列转换为向量表示，然后将其输入到一个双向LSTM层中。

b.使用Attention层将LSTM层的输出与自身进行注意力计算，得到每个时间步的权重。

c.将加权后的输出进行展平，并通过一个全连接层得到二元分类的输出。

# 构建模型
input_layer = Input(shape=(maxlen,))
embedding_layer = Embedding(max_features, 128)(input_layer)
lstm_layer = Bidirectional(LSTM(64, return_sequences=True))(embedding_layer)
attention_layer = Attention()([lstm_layer, lstm_layer])
flatten_layer = Flatten()(attention_layer)
output_layer = Dense(1, activation='sigmoid')(flatten_layer)
model = Model(inputs=input_layer, outputs=output_layer)

# 编译模型
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

最后，训练和评估这个模型。

# 训练模型
model.fit(x_train, y_train, batch_size=128, epochs=5, validation_data=(x_test, y_test))

# 评估模型
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print('Test accuracy:', accuracy)

以上，是我本周的学习内容。

此篇博客参考三篇文章：

词嵌入模型：原文链接：https://blog.csdn.net/qq_39564555/article/details/105882001

LSTM模型：原文链接：https://blog.csdn.net/weixin_52910499/article/details/124693212

attention模型：原文链接：https://blog.csdn.net/qq_41667743/article/details/128986978