在模型类的实现过程中, 为了代码的解耦和结构清晰, 总共需要完成以下几个函数的实现:
根据前面的数据处理可以得到:
转换为numpy数组的形状如下:
train_X的shape为: (82871, 314)
train_Y的shape为: (82871, 40)
test_X的shape为: (20000, 314)
前面保存数据的代码:
np.save(train_x_path, train_X)
np.save(train_y_path, train_Y)np.save(test_x_path, test_X)
加载构建好的训练集和测试集的函数:
import numpy as np
# 加载处理好的训练样本和训练标签.npy文件(执行完build_dataset后才能使用)
def load_train_dataset(max_enc_len=300, max_dec_len=50):
# max_enc_len: 最长样本长度, 后面的截断
# max_dec_len: 最长标签长度, 后面的截断
train_X = np.load(train_x_path)
train_Y = np.load(train_y_path)
train_X = train_X[:, :max_enc_len]
train_Y = train_Y[:, :max_dec_len]
return train_X, train_Y
# 加载处理好的测试样本.npy文件(执行完build_dataset后才能使用)
def load_test_dataset(max_enc_len=300):
# max_enc_len: 最长样本长度, 后面的截断
test_X = np.load(test_x_path)
test_X = test_X[:, :max_enc_len]
return test_X
①实现批次数据加载的函数batcher.py
首先先用load_train_dataset函数加载训练集数据和标签,
再用torch.from_numpy将原来的numpy数据转为tensor类型,以便于使用TensorDataset。
x_data = torch.from_numpy(train_X)
y_data = torch.from_numpy(train_Y)
再封装:
dataset = TensorDataset(x_data, y_data)
TensorDataset例子:
再用DataLoader对dataset进行迭代器的构建:
dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True, num_workers=4, pin_memory=True)
Tips:
关于DataLoader和TensorDataset的知识可以点击这里
再计算每个epoch要循环多少次:
steps_per_epoch = len(train_X) // batch_size
最后返回 dataset(封装好的数据集), steps_per_epoch(次数)
②实现模型中子层的函数layers.py
相关参数:
vocab_size为 word_to_id 的总长度,即len(word_to_id)。
Encoder层:
需传入vocab_size, embedding_dim, enc_units, batch_size参数
第一层是embedding层:
一共vocab_size(32217)个词,每个词的词向量维度设置为embedding_dim(500)维
self.embedding = nn.Embedding(vocab_size, embedding_dim)
第二层是gru层:
self.gru = nn.GRU(input_size=embedding_dim, hidden_size=enc_units, num_layers=1, batch_first=True)
关于torch.nn.gru:
batch_first True与False的影响:
例子:
True:
False:
接下来是forward部分:
def forward(self, x, h0):
# x.shape: (batch_size, sequence_length)
# h0.shape: (num_layers, batch_size, enc_units)
x = self.embedding(x)
output, hn = self.gru(x, h0)
return output, hn.transpose(1, 0)
关于batch_size:
在forward中,x.shape: (batch_size, sequence_length) (64,300)
其实指的是传入nn.embedding()的x是64个句子(64行),一个句子最多300个词;
得到的self.embedding维度为:[规整后的句子长度,样本个数(batch_size),词向量维度]
embedding后这里x维度是(300,64,500)
在这里gru传入的参数:hidden_size=enc_units 为512。
对于h0,h0.shape: (num_layers, batch_size, enc_units) (1,64,512)
output经过gru(batch_first = True)后的维度是[batch_size,seq_len,output_dim]
若(batch_first = False)则output维度是[seq_len,batch_size,output_dim]
在这里output是(64,300,512)
hn经过gru后的维度是[num_layers * num_directions, batch_size, hidden_size]
在这里hn是(1 * 1, 64,512 )
再经过.transpose(1, 0)函数,最后得到hn维度是(64,1,512)
**
Attention层:
**
需要传入参数enc_units(512), dec_units(512), attn_units(20);
在init中:
# 计算注意力的三次矩阵乘法, 对应着3个全连接层.
self.w1 = nn.Linear(enc_units, attn_units) #(512,20)
self.w2 = nn.Linear(dec_units, attn_units) #(512,20)
self.v = nn.Linear(attn_units, 1) #(20,1)
在forward中:
query维度 = decoder隐藏层(Decoder的output):(batch_size, dec_units)(64,512)
values维度=enc_output(Encoder的output):(batch_size, enc_seq_len, enc_units)(64,300,512)
计算:
self.v(torch.tanh(self.w1(value) + self.w2(query)))
w1(value)= (64,300,512)×(512,20)= (64, 300, 20)
w2(query)= (64,512)×(512,20) = (64,20)
torch.tanh(self.w1(value) + self.w2(query)) 后 得到维度(64,300,20)
v(torch.tanh(self.w1(value) + self.w2(query)) ) :(64,300,20)*(20,1)得到score维度(64,300,1)
然后对score进行F.softmax操作(dim=1,作用在第一个轴上(seq_len的轴))得到attention_weights: 维度仍是(64,300,1)
Tips:
再(广播, encoder unit的每个位置都对应相乘)求出context_vector:attention_weights * value:(64,300,1)× (64,300,512)= (64,300,512)
接下来context_vector(在最大长度enc_seq_len这一维度上求和),用torch.sum( context_vector,dim=1)函数,在这里keepdim = False,求和之后这个dim的元素个数为1,所以要被去掉。
Tips:
得到context_vector求和之后的维度:(64,512)(batch_size, enc_units)
def forward(self, query, value):
# query为上次的decoder隐藏层(Decoder的output),shape: (batch_size, dec_units)(64,512)
# values为编码器的编码结果enc_output,shape: (batch_size, enc_seq_len, enc_units)(64,300,512)
# 在应用self.v之前,张量的形状是(batch_size, enc_seq_len, attention_units)(64,300,20)
# 得到score的shape: (batch_size, seq_len, 1)(64,300,1)
score = self.v(torch.tanh(self.w1(value) + self.w2(query)))
# 注意力权重,是score经过softmax,但是要作用在第一个轴上(seq_len的轴)
attention_weights = F.softmax(score, dim=1)
# (batch_size, enc_seq_len, 1) * (batch_size, enc_seq_len, enc_units)
# 广播, encoder unit的每个位置都对应相乘
context_vector = attention_weights * value
# 在最大长度enc_seq_len这一维度上求和
context_vector = torch.sum(context_vector, dim=1)
# context_vector求和之后的shape: (batch_size, enc_units)
return context_vector, attention_weights
调用:
input0 用了torch.ones后是维度(64,300)个1。
由 torch.zeros(size=(self.num_layers,batch_size,self.num_hiddens),device=device)和torch.zeros(1, self.batch_size, self.enc_units)得到h0维度是(1,64,512)
最后返回context_vector (64,512)和 attention_weights (64,300,1)
if __name__ == '__main__':
word_to_id, id_to_word = get_vocab_from_model(vocab_path, reverse_vocab_path)
vocab_size = len(word_to_id)
# 测试用参数
EXAMPLE_INPUT_SEQUENCE_LEN = 300
BATCH_SIZE = 64
EMBEDDING_DIM = 500
GRU_UNITS = 512
ATTENTION_UNITS = 20
encoder = Encoder(vocab_size, EMBEDDING_DIM, GRU_UNITS, BATCH_SIZE)
input0 = torch.ones((BATCH_SIZE, EXAMPLE_INPUT_SEQUENCE_LEN), dtype=torch.long)
h0 = encoder.initialize_hidden_state()
output, hn = encoder(input0, h0)
# output 维度:(300,64,512)
# hn 维度:(64,1,512)
attention = Attention(GRU_UNITS, GRU_UNITS, ATTENTION_UNITS)
context_vector, attention_weights = attention(hn, output)
print(context_vector.shape)
print(attention_weights.shape)
**
Decoder层:
**
需要传入参数vocab_size(32217), embedding_dim(500), dec_units(512), batch_size(64),context_vector(从attention中来)。
在init中,也需要进行embedding,gru,只是在GRU中input_size与Encoder不同的是从
embedding_dim变成了embedding_dim + dec_units(500+512),hidden_size也从enc_units(512)变成了dec_size(512)
并且多加了一个全连接层:nn.Linear(dec_units, vocab_size) (512,32217)
如下:
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.gru = nn.GRU(input_size=embedding_dim + dec_units,
hidden_size=dec_units,
num_layers=1,
batch_first=True)
self.fc = nn.Linear(dec_units, vocab_size)
在forward中则传入x,以及context_vector
x 仍需要经过embedding只是在decoder中一次只解码一个单词,而不是像encoder那样一下子传64个句子进行embedding。
x.shape after passing through embedding(embedding后): (batch_size, 1(指一个词), embedding_dim) (64,1,500)
encoder中:embedding后这里x维度是【规整后的句子长度,样本个数(batch_size),词向量维度】(300,64,500)
embedding后再用torch.cat 将**上一循环的预测结果(x)跟注意力权重值(context_vector)**结合在一起作为本次的GRU网络输入。
x = torch.cat([torch.unsqueeze(context_vector, 1), x], dim=-1)
output和hn 由gru得出:
output, hn = self.gru(x)
这里的得到的output维度是(1,64,512) 1指一个词
然后再将output用squeeze(1)函数进行压缩维度,变成**(64,512)**
再将这个output传入全连接层fc得到prediction
如下:
output = output.squeeze(1)
prediction = self.fc(output)
最后返回prediction以及 hn.transpose(1, 0)
最后的Decoder层的output维度为(64,32217),hn维度为(64,1,512)
(大概率是)计算过程predication:(64,512)进入全连接层(512,32217)得到prediction(64,32217)。
与此对比的是Encoder层的output维度是(300,64,512)
整体模型实现:
if __name__ == '__main__':
word_to_id, id_to_word = get_vocab_from_model(vocab_path, reverse_vocab_path)
vocab_size = len(word_to_id)
batch_size = 64
input_seq_len = 300
# 模拟测试参数
params = {"vocab_size": vocab_size, "embed_size": 500, "enc_units": 512,
"attn_units": 20, "dec_units": 512,"batch_size": batch_size}
# 实例化类对象
model = Seq2Seq(params)
# 初始化测试输入数据
sample_input_batch = torch.ones((batch_size, input_seq_len), dtype=torch.long)
sample_hidden = model.encoder.initialize_hidden_state()
# 调用Encoder进行编码
sample_output, sample_hidden = model.encoder(sample_input_batch, sample_hidden)
# 打印输出张量维度
print('Encoder output shape: (batch_size, enc_seq_len, enc_units) {}'.format(sample_output.shape))
print('Encoder Hidden state shape: (batch_size, enc_units) {}'.format(sample_hidden.shape))
# 调用Attention进行注意力张量
context_vector, attention_weights = model.attention(sample_hidden, sample_output)
print("Attention context_vector shape: (batch_size, enc_units) {}".format(context_vector.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
# 调用Decoder进行解码
dec_input = torch.ones((batch_size, 1), dtype=torch.long)
sample_decoder_output, _, = model.decoder(dec_input, context_vector)
print('Decoder output shape: (batch_size, vocab_size) {}'.format(sample_decoder_output.shape))
# 这里仅测试一步,没有用到dec_seq_len
结果:
Encoder output shape: (batch_size, enc_seq_len, enc_units) torch.Size([64, 300, 512])
Encoder Hidden state shape: (batch_size, enc_units) torch.Size([64, 1, 512])
Attention context_vector shape: (batch_size, enc_units) torch.Size([64, 512])
Attention weights shape: (batch_size, sequence_length, 1) torch.Size([64, 300, 1])
Decoder output shape: (batch_size, vocab_size) torch.Size([64, 32217])