序列到序列(seq2seq)-学习笔记

75 篇文章 2 订阅

1. PPT讲解部分

1.1 应用

  • 机器翻译
    (1)给定一个源语言的句子,自动翻译成目标语言
    (2)这两个句子可以有不同的长度

1.2 seq2seq

  • Encoder 编码器是一个RNN,读取输入句子,可以是双向的
  • Decoder 解码器是另外一个RNN,用来输出句子
  • Decoder 接受两个数据
    (1)Encoder 解码器输出的隐状态值
    (2)输出序列已经看见的或者生成的词元来预测下一个词元。
    在这里插入图片描述

1.3 解码器和编码器的细节

  • Encoder编码器是没有输出的RNN
  • Encoder编码器最后时间步的隐状态用作Decoder解码器的初始隐状态
    在这里插入图片描述

1.4 训练

  • 训练时Encoder解码器使用目标句子作为输入;在训练的时候,我们是知道目标句子的
    在这里插入图片描述
  • 推理:
    推理是没有目标句子的。
    在这里插入图片描述

1.5 衡量生成序列的好坏的BLEU

  • P n P_n Pn是预测中所有 n-gram 的精度
  • 标签序列 ABCDEF和预测序列 ABBCD,有 p 1 = 4 / 5 , p 2 = 3 / 4 , p 3 = 1 / 3 , p 4 = 0 p_1=4/5,p_2=3/4,p_3=1/3,p_4=0 p1=4/5,p2=3/4,p3=1/3,p4=0
    p 1 : ( A , B , C , D ) / ( A , B , B , C , D ) = 4 / 5 p_1:(A,B,C,D)/(A,B,B,C,D)=4/5 p1:(A,B,C,D)/(A,B,B,C,D)=4/5
    p 2 : ( A B , B C , C D ) / ( A B , B B , B C , C D ) = 3 / 4 p_2:(AB,BC,CD)/(AB,BB,BC,CD)=3/4 p2:(AB,BC,CD)/(AB,BB,BC,CD)=3/4
    p 3 : ( B C D ) / ( A B B , B B C , B C D ) = 1 / 3 p_3:(BCD)/(ABB,BBC,BCD)=1/3 p3:(BCD)/(ABB,BBC,BCD)=1/3
    p 4 : ( 0 ) / ( A B B C , B B C D ) = 0 p_4:(0)/(ABBC,BBCD)=0 p4:(0)/(ABBC,BBCD)=0
  • BLEU定义
    B L E U = exp ⁡ { min ⁡ ( 0 , 1 − l e n ( l a b e l ) l e n ( p r e d ) ) } ∏ n = 1 k p n 1 / 2 n (1) BLEU=\exp\{\min(0,1-\frac{len(label)}{len(pred)})\}\prod_{n=1}^kp_n^{1/2^n}\tag1 BLEU=exp{min(0,1len(pred)len(label))}n=1kpn1/2n(1)
    惩 罚 过 短 的 预 测 : exp ⁡ { min ⁡ ( 0 , 1 − l e n ( l a b e l ) l e n ( p r e d ) ) } (2) 惩罚过短的预测:\exp\{\min(0,1-\frac{len(label)}{len(pred)})\}\tag2 exp{min(0,1len(pred)len(label))}(2)
    长 匹 配 有 高 权 重 : ∏ n = 1 k p n 1 / 2 n (3) 长匹配有高权重:\prod_{n=1}^kp_n^{1/2^n}\tag3 n=1kpn1/2n(3)
  • BLEU=0表示预测最差,BLUE=1表示预测最好

1.6 小结

  • Seq2seq从一个句子生成另一个句子
  • Encoder编码器和Decoder解码器都是RNN
  • 将编码器最后时间隐状态来初始解码器隐状态来完成信息传递
  • 常用BLEU来衡量生成序列的好坏

2. seq2seq代码

  • 代码
# -*- coding: utf-8 -*-
# @Project: zc
# @Author: zc
# @File name: seq2seq_test
# @Create time: 2022/2/9 17:49


import collections
import math
import torch
from torch import nn
from d2l import torch as d2l
import matplotlib.pyplot as plt


class Seq2SeqEncoder(d2l.Encoder):
	def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
				 dropout=0, **kwargs):
		super(Seq2SeqEncoder, self).__init__(**kwargs)
		self.embedding = nn.Embedding(vocab_size, embed_size)
		self.rnn = nn.GRU(embed_size, num_hiddens, num_layers,
						  dropout=dropout)

	def forward(self, X, *args):
		# 输出'X'的形状:(batch_size,num_steps,embed_size)
		# 因为我们是将X从embedding层里面抽取等长的embed_size词向量,这里是8
		# 假设 x.shape =(4,7),那么最后抽出来的矩阵大小为 (4,7,8)
		# 故输出为(batch_size,num_steps,embed_size)
		X = self.embedding(X)
		# permute负责调换顺序,得到结果为:(num_steps,batch_size,embed_size)
		# 调换后X.shape = (7,4,8)
		X = X.permute(1, 0, 2)
		# 通过 GRU模块得到output,state
		# self.rnn(embed_size=8,hiddens=16,num_layers=2)
		# 因为X.shape=(7,4,8) ->output = (7,4,16);state.shape =(2,batch_size=4,hiddens_out=16)
		output, state = self.rnn(X)
		# output.shape=(7,4,16);state.shape=(2,4,16)
		return output, state


encoder = Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
						 num_layers=2)

encoder.eval()
X = torch.zeros((4, 7), dtype=torch.long)
output, state = encoder(X)


class Seq2SeqDecoder(d2l.Decoder):
	def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
				 dropout=0, **kwargs):
		super(Seq2SeqDecoder, self).__init__(**kwargs)
		self.embedding = nn.Embedding(vocab_size, embed_size)
		self.rnn = nn.GRU(embed_size + num_hiddens, num_hiddens, num_layers,
						  dropout=dropout)
		self.dense = nn.Linear(num_hiddens, vocab_size)

	def init_state(self, enc_outputs, *args):
		return enc_outputs[1]

	def forward(self, X, state):
		# 输出'X'的形状:(batch_size,num_steps,embed_size)
		X = self.embedding(X).permute(1, 0, 2)
		# 广播context,使其具有与X相同的num_steps
		context = state[-1].repeat(X.shape[0], 1, 1)
		X_and_context = torch.cat((X, context), 2)
		output, state = self.rnn(X_and_context, state)
		output = self.dense(output).permute(1, 0, 2)
		# output的形状:(batch_size,num_steps,vocab_size)
		# state[0]的形状:(num_layers,batch_size,num_hiddens)
		return output, state


decoder = Seq2SeqDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
						 num_layers=2)
decoder.eval()
state_decoder = decoder.init_state(encoder(X))
output_decoder, output_state = decoder(X, state_decoder)
print(f"output_decoder.shape={output_decoder.shape}")
print(f"output_state.shape={output_state.shape}")


def sequence_mask(X, valid_len, value=0):
	maxlen = X.size(1)
	mask = torch.arange((maxlen), dtype=torch.float32,
						device=X.device)[None, :] < valid_len[:, None]
	X[~mask] = value
	return X


class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
	def forward(self, pred, label, valid_len):
		# pred的形状:(batch_size,num_steps,vocab_size)=(3,4,10)
		# label的形状:(batch_size,num_steps)=(3,4)
		# valid_len的形状:(batch_size,)=tensor([4,2,0])
		# weights=(3,4)
		weights = torch.ones_like(label)
		# weights=tensor([[1,1,1,1],[1,1,0,0],[0,0,0,0]])
		weights = sequence_mask(weights, valid_len)
		self.reduction = 'none'
		unweight_loss = super(MaskedSoftmaxCELoss, self).forward(
			pred.permute(0, 2, 1), label)
		# pred.permute(0,2,1)=(3,10,4) ;label=(3,4)-->unweight_loss=(3,4)
		weighted_loss = (unweight_loss * weights).mean(dim=1)
		return weighted_loss


loss = MaskedSoftmaxCELoss()
y_loss = loss(torch.ones(3, 4, 10), torch.ones((3, 4), dtype=torch.long), torch.tensor([4, 2, 0]))
print(f"y_loss ={y_loss}")


def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
	def xavier_init_weights(m):
		if type(m) == nn.Linear:
			nn.init.xavier_uniform_(m.weight)
		if type(m) == nn.GRU:
			for param in m._flat_weights_names:
				if "weight" in param:
					nn.init.xavier_uniform_(m._parameters[param])

	net.apply(xavier_init_weights)
	net.to(device)
	optimizer = torch.optim.Adam(net.parameters(), lr=lr)
	loss = MaskedSoftmaxCELoss()
	net.train()
	animator = d2l.Animator(xlabel='epoch', ylabel='loss',
							xlim=[10, num_epochs])
	for epoch in range(num_epochs):
		timer = d2l.Timer()
		metric = d2l.Accumulator(2)
		for batch in data_iter:
			optimizer.zero_grad()
			X, X_vali_len, Y, Y_valid_len = [x.to(device) for x in batch]
			bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
							   device=device).reshape(-1, 1)
			dec_input = torch.cat([bos, Y[:, : -1]], 1)
			Y_hat, _ = net(X, dec_input, X_vali_len)
			l = loss(Y_hat, Y, Y_valid_len)
			l.sum().backward()
			d2l.grad_clipping(net, 1)
			num_tokens = Y_valid_len.sum()
			optimizer.step()
			with torch.no_grad():
				metric.add(l.sum(), num_tokens)
		if (epoch + 1) % 10 == 0:
			animator.add(epoch + 1, (metric[0] / metric[1],))
	print(f'loss{metric[0] / metric[1]:.3f},{metric[1] / timer.stop():.1f}'
		  f'tokens/sec on {str(device)}')


embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
batch_size, num_steps = 64, 10
lr, num_epochs, device = 0.005, 300, d2l.try_gpu()

train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
encoder = Seq2SeqEncoder(len(src_vocab), embed_size, num_hiddens, num_layers,
						 dropout)
decoder = Seq2SeqDecoder(len(tgt_vocab), embed_size, num_hiddens, num_layers,
						 dropout)

net = d2l.EncoderDecoder(encoder, decoder)
train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
plt.show()
  • 结果
loss0.020,14236.7tokens/sec on cuda:0

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值