NLP文本摘要NO.4 seq2seq模型部分（超详细维度解析）_seq2seq的数据维度如何变化-CSDN博客

本文链接：https://blog.csdn.net/xd0dx/article/details/124596862

本文探讨了深度学习如何革新自然语言处理领域，包括词嵌入、神经网络模型如RNN、LSTM和Transformer在文本分类、机器翻译、情感分析等方面的关键作用。通过实例展示了深度学习技术如何提升模型性能并推动NLP技术的发展。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在模型类的实现过程中, 为了代码的解耦和结构清晰, 总共需要完成以下几个函数的实现:
在这里插入图片描述

根据前面的数据处理可以得到：
转换为numpy数组的形状如下:
train_X的shape为: (82871, 314)
train_Y的shape为: (82871, 40)
test_X的shape为: (20000, 314)

前面保存数据的代码：

np.save(train_x_path, train_X)
np.save(train_y_path, train_Y)np.save(test_x_path, test_X)

加载构建好的训练集和测试集的函数：

import numpy as np

# 加载处理好的训练样本和训练标签.npy文件(执行完build_dataset后才能使用)
def load_train_dataset(max_enc_len=300, max_dec_len=50):
    # max_enc_len: 最长样本长度, 后面的截断
    # max_dec_len: 最长标签长度, 后面的截断
    train_X = np.load(train_x_path)
    train_Y = np.load(train_y_path)

    train_X = train_X[:, :max_enc_len]
    train_Y = train_Y[:, :max_dec_len]

    return train_X, train_Y

# 加载处理好的测试样本.npy文件(执行完build_dataset后才能使用)
def load_test_dataset(max_enc_len=300):
    # max_enc_len: 最长样本长度, 后面的截断
    test_X = np.load(test_x_path)
    test_X = test_X[:, :max_enc_len]
    return test_X

①实现批次数据加载的函数batcher.py

首先先用load_train_dataset函数加载训练集数据和标签,
再用torch.from_numpy将原来的numpy数据转为tensor类型，以便于使用TensorDataset。

x_data = torch.from_numpy(train_X)
y_data = torch.from_numpy(train_Y)

再封装：

dataset = TensorDataset(x_data, y_data)

TensorDataset例子：
在这里插入图片描述

再用DataLoader对dataset进行迭代器的构建：

dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True, num_workers=4, pin_memory=True)

Tips：
关于DataLoader和TensorDataset的知识可以点击这里

再计算每个epoch要循环多少次：

steps_per_epoch = len(train_X) // batch_size

最后返回 dataset（封装好的数据集）， steps_per_epoch（次数）

②实现模型中子层的函数layers.py

在这里插入图片描述
相关参数：

vocab_size为 word_to_id 的总长度，即len(word_to_id)。

Encoder层：

需传入vocab_size, embedding_dim, enc_units, batch_size参数

第一层是embedding层：
一共vocab_size（32217）个词，每个词的词向量维度设置为embedding_dim（500）维

self.embedding = nn.Embedding(vocab_size, embedding_dim)

在这里插入图片描述
第二层是gru层：

self.gru = nn.GRU(input_size=embedding_dim, hidden_size=enc_units, num_layers=1, batch_first=True)

关于torch.nn.gru:
在这里插入图片描述

batch_first True与False的影响：
例子：
True：
在这里插入图片描述
False：

接下来是forward部分：

def forward(self, x, h0):
    # x.shape: (batch_size, sequence_length)
    # h0.shape: (num_layers, batch_size, enc_units)
    x = self.embedding(x)
    output, hn = self.gru(x, h0)
    return output, hn.transpose(1, 0)

关于batch_size:
在这里插入图片描述
在forward中，x.shape: (batch_size, sequence_length) （64，300）
其实指的是传入nn.embedding（）的x是64个句子（64行），一个句子最多300个词；
得到的self.embedding维度为：[规整后的句子长度，样本个数（batch_size）,词向量维度]

embedding后这里x维度是（300，64，500）

在这里gru传入的参数：hidden_size=enc_units 为512。

对于h0，h0.shape: (num_layers, batch_size, enc_units) （1，64，512）
output经过gru（batch_first = True）后的维度是[batch_size,seq_len,output_dim]
若（batch_first = False）则output维度是[seq_len,batch_size,output_dim]

在这里output是（64，300，512）

hn经过gru后的维度是[num_layers * num_directions, batch_size, hidden_size]

在这里hn是（1 * 1， 64，512 ）

再经过.transpose(1, 0)函数，最后得到hn维度是（64，1，512）

Attention层：

**
需要传入参数enc_units（512）, dec_units（512）, attn_units（20）；

在init中：

# 计算注意力的三次矩阵乘法, 对应着3个全连接层.
self.w1 = nn.Linear(enc_units, attn_units) #（512，20）
self.w2 = nn.Linear(dec_units, attn_units) #（512，20）
self.v = nn.Linear(attn_units, 1)          #（20，1）

在forward中：

query维度 = decoder隐藏层（Decoder的output）：(batch_size, dec_units)（64，512）
values维度=enc_output（Encoder的output）：(batch_size, enc_seq_len, enc_units)（64，300，512）
计算：
self.v(torch.tanh(self.w1(value) + self.w2(query)))
w1（value）= （64，300，512）×（512，20）= （64, 300, 20）
w2（query）= （64，512）×（512，20） = （64，20）
torch.tanh(self.w1(value) + self.w2(query)) 后得到维度（64，300，20）
v（torch.tanh(self.w1(value) + self.w2(query)) ）：（64，300，20）*（20，1）得到score维度（64，300，1）

然后对score进行F.softmax操作（dim=1,作用在第一个轴上(seq_len的轴)）得到attention_weights： 维度仍是（64，300，1）
Tips：
在这里插入图片描述
再（广播, encoder unit的每个位置都对应相乘）求出context_vector:attention_weights * value：（64，300，1）× （64，300，512）= （64，300，512）
接下来context_vector（在最大长度enc_seq_len这一维度上求和），用torch.sum( context_vector，dim=1)函数，在这里keepdim = False,求和之后这个dim的元素个数为１，所以要被去掉。
Tips：
在这里插入图片描述
得到context_vector求和之后的维度：（64，512）(batch_size, enc_units)

    def forward(self, query, value):
        # query为上次的decoder隐藏层（Decoder的output），shape: (batch_size, dec_units)（64，512）
        # values为编码器的编码结果enc_output，shape: (batch_size, enc_seq_len, enc_units)（64，300，512）
        # 在应用self.v之前，张量的形状是(batch_size, enc_seq_len, attention_units)（64，300，20）
        # 得到score的shape: (batch_size, seq_len, 1)（64，300，1）
        score = self.v(torch.tanh(self.w1(value) + self.w2(query)))

        # 注意力权重，是score经过softmax，但是要作用在第一个轴上(seq_len的轴)
        attention_weights = F.softmax(score, dim=1)

        # (batch_size, enc_seq_len, 1) * (batch_size, enc_seq_len, enc_units)
        # 广播, encoder unit的每个位置都对应相乘
        context_vector = attention_weights * value
        # 在最大长度enc_seq_len这一维度上求和
        context_vector = torch.sum(context_vector, dim=1)
        # context_vector求和之后的shape: (batch_size, enc_units)

        return context_vector, attention_weights

调用：
input0 用了torch.ones后是维度（64，300）个1。

由 torch.zeros(size=(self.num_layers,batch_size,self.num_hiddens),device=device)和torch.zeros(1, self.batch_size, self.enc_units)得到h0维度是（1，64，512）

最后返回context_vector （64，512）和 attention_weights （64，300，1）

if __name__ == '__main__':
    word_to_id, id_to_word = get_vocab_from_model(vocab_path, reverse_vocab_path)
    vocab_size = len(word_to_id)

    # 测试用参数
    EXAMPLE_INPUT_SEQUENCE_LEN = 300
    BATCH_SIZE = 64
    EMBEDDING_DIM = 500
    GRU_UNITS = 512
    ATTENTION_UNITS = 20

    encoder = Encoder(vocab_size, EMBEDDING_DIM, GRU_UNITS, BATCH_SIZE)

    input0 = torch.ones((BATCH_SIZE, EXAMPLE_INPUT_SEQUENCE_LEN), dtype=torch.long)
    h0 = encoder.initialize_hidden_state()
    output, hn = encoder(input0, h0)
    # output 维度：（300，64，512）
    # hn 维度：（64，1，512）

    attention = Attention(GRU_UNITS, GRU_UNITS, ATTENTION_UNITS)
    context_vector, attention_weights = attention(hn, output)
    print(context_vector.shape)
    print(attention_weights.shape)

Decoder层：

**
需要传入参数vocab_size（32217）, embedding_dim（500）, dec_units（512）, batch_size（64），context_vector(从attention中来)。

在init中，也需要进行embedding，gru，只是在GRU中input_size与Encoder不同的是从
embedding_dim变成了embedding_dim + dec_units（500+512），hidden_size也从enc_units(512)变成了dec_size(512)

并且多加了一个全连接层：nn.Linear(dec_units, vocab_size) （512，32217）
如下：

self.embedding = nn.Embedding(vocab_size, embedding_dim)

self.gru = nn.GRU(input_size=embedding_dim + dec_units,
                          hidden_size=dec_units,
                          num_layers=1,
                          batch_first=True)
                          
self.fc = nn.Linear(dec_units, vocab_size)

在forward中则传入x，以及context_vector
x 仍需要经过embedding只是在decoder中一次只解码一个单词，而不是像encoder那样一下子传64个句子进行embedding。
x.shape after passing through embedding（embedding后）: (batch_size, 1（指一个词）, embedding_dim) （64，1，500）

encoder中：embedding后这里x维度是【规整后的句子长度，样本个数（batch_size）,词向量维度】（300，64，500）

embedding后再用torch.cat 将**上一循环的预测结果（x）跟注意力权重值（context_vector）**结合在一起作为本次的GRU网络输入。

x = torch.cat([torch.unsqueeze(context_vector, 1), x], dim=-1)

output和hn 由gru得出：

output, hn = self.gru(x)

这里的得到的output维度是（1，64，512） 1指一个词
然后再将output用squeeze（1）函数进行压缩维度，变成**（64，512）**
再将这个output传入全连接层fc得到prediction
如下：

output = output.squeeze(1)
prediction = self.fc(output)

最后返回prediction以及 hn.transpose(1, 0)

最后的Decoder层的output维度为（64，32217），hn维度为（64，1，512）
（大概率是）计算过程predication：（64，512）进入全连接层（512，32217）得到prediction（64，32217）。
与此对比的是Encoder层的output维度是（300，64，512）

整体模型实现：

if __name__ == '__main__':
    word_to_id, id_to_word = get_vocab_from_model(vocab_path, reverse_vocab_path)

    vocab_size = len(word_to_id)
    batch_size = 64
    input_seq_len = 300

    # 模拟测试参数
    params = {"vocab_size": vocab_size, "embed_size": 500, "enc_units": 512,
              "attn_units": 20, "dec_units": 512,"batch_size": batch_size}

    # 实例化类对象
    model = Seq2Seq(params)

    # 初始化测试输入数据
    sample_input_batch = torch.ones((batch_size, input_seq_len), dtype=torch.long)
    sample_hidden = model.encoder.initialize_hidden_state()

    # 调用Encoder进行编码
    sample_output, sample_hidden = model.encoder(sample_input_batch, sample_hidden)

    # 打印输出张量维度
    print('Encoder output shape: (batch_size, enc_seq_len, enc_units) {}'.format(sample_output.shape))
    print('Encoder Hidden state shape: (batch_size, enc_units) {}'.format(sample_hidden.shape))

    # 调用Attention进行注意力张量
    context_vector, attention_weights = model.attention(sample_hidden, sample_output)

    print("Attention context_vector shape: (batch_size, enc_units) {}".format(context_vector.shape))
    print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

    # 调用Decoder进行解码
    dec_input = torch.ones((batch_size, 1), dtype=torch.long)
    sample_decoder_output, _, = model.decoder(dec_input, context_vector)

    print('Decoder output shape: (batch_size, vocab_size) {}'.format(sample_decoder_output.shape))
    # 这里仅测试一步，没有用到dec_seq_len

结果：

Encoder output shape: (batch_size, enc_seq_len, enc_units) torch.Size([64, 300, 512])
Encoder Hidden state shape: (batch_size, enc_units) torch.Size([64, 1, 512])
Attention context_vector shape: (batch_size, enc_units) torch.Size([64, 512])
Attention weights shape: (batch_size, sequence_length, 1) torch.Size([64, 300, 1])
Decoder output shape: (batch_size, vocab_size) torch.Size([64, 32217])