文章目录
前言
依然是跟随官方文档利用插件扩展(添加一个新的 FairseqEncoderDecoderModel )将简单的LSTM作为encoder和decoder进行机器翻译任务:使用 LSTM 对源句子进行编码,然后将最终的隐藏状态传递给第二个 LSTM 来解码目标句子(不使用注意力机制)。
在自然语言处理中,注意力机制主要用于解决机器翻译和文本生成等任务。其基本思想是,在生成每个输出单元(例如,翻译的单词)时,模型不仅关注输入序列的整体信息,还关注输入序列中与当前输出单元最相关的部分。这种关注机制有助于模型捕捉语义关联性,更好地处理长距离依赖关系和翻译任务中的对齐问题,提高翻译质量。
请注意如果是需要构建自己的模型进行训练,也就是当你需要使用插件时,不能单纯使用pip install fairseq的方式安装
一、Build Encoder
Encoder针对source sentence,encoder都应该实现FairseqEncoder
接口。接口本身扩展torch.nn.Module
,也就是说fairseq本身就是PyTorch的上层接口和扩展,因此 FairseqEncoders
和 decoder对应的FairseqDecoders
可以用与普通 PyTorch 模块相同的方式编写和使用。
Encoder 会将源语句中的进行嵌入,馈送到torch.nn.LSTM
中,并返回最终的隐藏状态。
创建 encoder 的代码保存在fairseq/models/simple_lstm.py
中:
请注意,实际上fairseq文件夹中还有一个fairseq文件夹,models是本来就存在的,也就是说实际上是存在于
fairseq/fairseq/models/simple_lstm.py
中
import torch.nn as nn
from fairseq import utils
from fairseq.models import FairseqEncoder
class SimpleLSTMEncoder(FairseqEncoder):
def __init__(
self, args, dictionary, embed_dim=128, hidden_dim=128, dropout=0.1,
):
super().__init__(dictionary)
self.args = args
# Our encoder will embed the inputs before feeding them to the LSTM.
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
# We'll use a single-layer, unidirectional LSTM for simplicity.
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_dim,
num_layers=1,
bidirectional=False,
batch_first=True,
)
def forward(self, src_tokens, src_lengths):
# The inputs to the ``forward()`` function are determined by the
# Task, and in particular the ``'net_input'`` key in each
# mini-batch. We discuss Tasks in the next tutorial, but for now just
# know that *src_tokens* has shape `(batch, src_len)` and *src_lengths*
# has shape `(batch)`.
# Note that the source is typically padded on the left. This can be
# configured by adding the `--left-pad-source "False"` command-line
# argument, but here we'll make the Encoder handle either kind of
# padding by converting everything to be right-padded.
if self.args.left_pad_source:
# Convert left-padding to right-padding.
src_tokens = utils.convert_padding_direction(
src_tokens,
padding_idx=self.dictionary.pad(),
left_to_right=True
)
# Embed the source.
x = self.embed_tokens(src_tokens)
# Apply dropout.
x = self.dropout(x)
# Pack the sequence into a PackedSequence object to feed to the LSTM.
x = nn.utils.rnn.pack_padded_sequence(x, src_lengths, batch_first=True)
# Get the output from the LSTM.
_outputs, (final_hidden, _final_cell) = self.lstm(x)
# Return the Encoder's output. This can be any object and will be
# passed directly to the Decoder.
return {
# this will have shape `(bsz, hidden_dim)`
'final_hidden': final_hidden.squeeze(0),
}
# Encoders are required to implement this method so that we can rearrange
# the order of the batch elements during inference (e.g., beam search).
def reorder_encoder_out(self, encoder_out, new_order):
"""
Reorder encoder output according to `new_order`.
Args:
encoder_out: output from the ``forward()`` method
new_order (LongTensor): desired order
Returns:
`encoder_out` rearranged according to `new_order`
"""
final_hidden = encoder_out['final_hidden']
return {
'final_hidden': final_hidden.index_select(0, new_order),
}
二、Build Decoder
Decoder针对target sentence,应该实现FairseqDecoder
接口,Decoder将根据Encoder的最终隐藏状态和前一个目标单词的嵌入表示(有时称为教师强制(teacher forcing
))来预测下一个单词。更具体地说,我们将使用 torch.nn.LSTM
生成一系列隐藏状态,然后将其投影到输出词汇表的大小,以预测每个目标单词。
import torch
from fairseq.models import FairseqDecoder
class SimpleLSTMDecoder(FairseqDecoder):
def __init__(
self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
dropout=0.1,
):
super().__init__(dictionary)
# Our decoder will embed the inputs before feeding them to the LSTM.
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
# We'll use a single-layer, unidirectional LSTM for simplicity.
self.lstm = nn.LSTM(
# For the first layer we'll concatenate the Encoder's final hidden
# state with the embedded target tokens.
input_size=encoder_hidden_dim + embed_dim,
hidden_size=hidden_dim,
num_layers=1,
bidirectional=False,
)
# Define the output projection.
self.output_projection = nn.Linear(hidden_dim, len(dictionary))
# During training Decoders are expected to take the entire target sequence
# (shifted right by one position) and produce logits over the vocabulary.
# The *prev_output_tokens* tensor begins with the end-of-sentence symbol,
# ``dictionary.eos()``, followed by the target sequence.
def forward(self, prev_output_tokens, encoder_out):
"""
Args:
prev_output_tokens (LongTensor): previous decoder outputs of shape
`(batch, tgt_len)`, for teacher forcing
encoder_out (Tensor, optional): output from the encoder, used for
encoder-side attention
Returns:
tuple:
- the last decoder layer's output of shape
`(batch, tgt_len, vocab)`
- the last decoder layer's attention weights of shape
`(batch, tgt_len, src_len)`
"""
bsz, tgt_len = prev_output_tokens.size()
# Extract the final hidden state from the Encoder.
final_encoder_hidden = encoder_out['final_hidden']
# Embed the target sequence, which has been shifted right by one
# position and now starts with the end-of-sentence symbol.
x = self.embed_tokens(prev_output_tokens)
# Apply dropout.
x = self.dropout(x)
# Concatenate the Encoder's final hidden state to *every* embedded
# target token.
x = torch.cat(
[x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
dim=2,
)
# Using PackedSequence objects in the Decoder is harder than in the
# Encoder, since the targets are not sorted in descending length order,
# which is a requirement of ``pack_padded_sequence()``. Instead we'll
# feed nn.LSTM directly.
initial_state = (
final_encoder_hidden.unsqueeze(0), # hidden
torch.zeros_like(final_encoder_hidden).unsqueeze(0), # cell
)
output, _ = self.lstm(
x.transpose(0, 1), # convert to shape `(tgt_len, bsz, dim)`
initial_state,
)
x = output.transpose(0, 1) # convert to shape `(bsz, tgt_len, hidden)`
# Project the outputs to the size of the vocabulary.
x = self.output_projection(x)
# Return the logits and ``None`` for the attention weights
return x, None
三、注册模型
必须使用register_model()
函数装饰器向 fairseq注册我们的模型。模型注册后,我们将能够将其与现有的命令行工具一起使用。
所有注册的模型都必须实现 BaseFairseqModel
接口。对于序列到序列模型(即具有单个编码器和解码器的任何模型),我们可以改为实现FairseqEncoderDecoderModel
接口。
在同一文件中创建一个小包装类,并将其注册到 fairseq 中,名称为simple_lstm
:
from fairseq.models import FairseqEncoderDecoderModel, register_model
# Note: the register_model "decorator" should immediately precede the
# definition of the Model class.
@register_model('simple_lstm')
class SimpleLSTMModel(FairseqEncoderDecoderModel):
@staticmethod
def add_args(parser):
# Models can override this method to add new command-line arguments.
# Here we'll add some new command-line arguments to configure dropout
# and the dimensionality of the embeddings and hidden states.
parser.add_argument(
'--encoder-embed-dim', type=int, metavar='N',
help='dimensionality of the encoder embeddings',
)
parser.add_argument(
'--encoder-hidden-dim', type=int, metavar='N',
help='dimensionality of the encoder hidden state',
)
parser.add_argument(
'--encoder-dropout', type=float, default=0.1,
help='encoder dropout probability',
)
parser.add_argument(
'--decoder-embed-dim', type=int, metavar='N',
help='dimensionality of the decoder embeddings',
)
parser.add_argument(
'--decoder-hidden-dim', type=int, metavar='N',
help='dimensionality of the decoder hidden state',
)
parser.add_argument(
'--decoder-dropout', type=float, default=0.1,
help='decoder dropout probability',
)
@classmethod
def build_model(cls, args, task):
# Fairseq initializes models by calling the ``build_model()``
# function. This provides more flexibility, since the returned model
# instance can be of a different type than the one that was called.
# In this case we'll just return a SimpleLSTMModel instance.
# Initialize our Encoder and Decoder.
encoder = SimpleLSTMEncoder(
args=args,
dictionary=task.source_dictionary,
embed_dim=args.encoder_embed_dim,
hidden_dim=args.encoder_hidden_dim,
dropout=args.encoder_dropout,
)
decoder = SimpleLSTMDecoder(
dictionary=task.target_dictionary,
encoder_hidden_dim=args.encoder_hidden_dim,
embed_dim=args.decoder_embed_dim,
hidden_dim=args.decoder_hidden_dim,
dropout=args.decoder_dropout,
)
model = SimpleLSTMModel(encoder, decoder)
# Print the model architecture.
print(model)
return model
# We could override the ``forward()`` if we wanted more control over how
# the encoder and decoder interact, but it's not necessary for this
# tutorial since we can inherit the default implementation provided by
# the FairseqEncoderDecoderModel base class, which looks like:
#
# def forward(self, src_tokens, src_lengths, prev_output_tokens):
# encoder_out = self.encoder(src_tokens, src_lengths)
# decoder_out = self.decoder(prev_output_tokens, encoder_out)
# return decoder_out
最后,让我们使用register_model_architecture()
函数装饰器来定义一个具有我们模型配置的命名架构。之后,这个命名架构可以通过–arch命令行参数来使用,例如:–arch tutorial_simple_lstm:
from fairseq.models import register_model_architecture
# The first argument to ``register_model_architecture()`` should be the name
# of the model we registered above (i.e., 'simple_lstm'). The function we
# register here should take a single argument *args* and modify it in-place
# to match the desired architecture.
@register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
def tutorial_simple_lstm(args):
# We use ``getattr()`` to prioritize arguments that are explicitly given
# on the command-line, so that the defaults defined below are only used
# when no other value has been specified.
args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)
四、训练模型
可以使用现有的fairseq-train 命令行工具来实现训练的目的,并确保指定我们的新模型架构 --arch tutorial_simple_lstm
。
训练前需要先确保已预处理examples/translation/目录中 IWSLT 示例中的数据 。见fairseq入门:Getting Started:
fairseq-train data-bin/iwslt14.tokenized.de-en \
--arch tutorial_simple_lstm \
--encoder-dropout 0.2 --decoder-dropout 0.2 \
--optimizer adam --lr 0.005 --lr-shrink 0.5 \
--max-tokens 12000
模型文件应出现在checkpoints/
目录中。虽然这个模型架构不是很好,但我们可以使用fairseq-generate
脚本来生成翻译并计算测试集上的 BLEU 分数:
fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--beam 5 \
--remove-bpe
我的结果:
四、加快生成速度
尽管从sequence-to-sequence的模型进行自回归生成本质上较慢,但上述实现尤其慢,因为它为每个输出标记重新计算整个解码器隐藏状态的序列(即其时间复杂度为O(n^2))。通过缓存先前的隐藏状态,我们可以显著提高性能。
在fairseq中,这被称为增量解码。增量解码是一种特殊的推理模式,在这种模式下,模型只接收与前一个输出标记相对应的单个时间步的输入(用于teacher forcing),并且必须逐步产生下一个输出。因此,模型必须缓存有关序列的任何需要的长期状态,例如隐藏状态、卷积状态等等。
为了实现增量解码,需要修改模型以实现FairseqIncrementalDecoder
接口。与标准的FairseqDecoder
接口相比,增量解码接口允许forward()方法接受一个额外的关键字参数(incremental_state
),该参数可用于在时间步之间缓存状态。
用以下代码替换SimpleLSTMDecoder
:
import torch
from fairseq.models import FairseqIncrementalDecoder
class SimpleLSTMDecoder(FairseqIncrementalDecoder):
def __init__(
self, dictionary, encoder_hidden_dim=128, embed_dim=128, hidden_dim=128,
dropout=0.1,
):
# This remains the same as before.
super().__init__(dictionary)
self.embed_tokens = nn.Embedding(
num_embeddings=len(dictionary),
embedding_dim=embed_dim,
padding_idx=dictionary.pad(),
)
self.dropout = nn.Dropout(p=dropout)
self.lstm = nn.LSTM(
input_size=encoder_hidden_dim + embed_dim,
hidden_size=hidden_dim,
num_layers=1,
bidirectional=False,
)
self.output_projection = nn.Linear(hidden_dim, len(dictionary))
# We now take an additional kwarg (*incremental_state*) for caching the
# previous hidden and cell states.
def forward(self, prev_output_tokens, encoder_out, incremental_state=None):
if incremental_state is not None:
# If the *incremental_state* argument is not ``None`` then we are
# in incremental inference mode. While *prev_output_tokens* will
# still contain the entire decoded prefix, we will only use the
# last step and assume that the rest of the state is cached.
prev_output_tokens = prev_output_tokens[:, -1:]
# This remains the same as before.
bsz, tgt_len = prev_output_tokens.size()
final_encoder_hidden = encoder_out['final_hidden']
x = self.embed_tokens(prev_output_tokens)
x = self.dropout(x)
x = torch.cat(
[x, final_encoder_hidden.unsqueeze(1).expand(bsz, tgt_len, -1)],
dim=2,
)
# We will now check the cache and load the cached previous hidden and
# cell states, if they exist, otherwise we will initialize them to
# zeros (as before). We will use the ``utils.get_incremental_state()``
# and ``utils.set_incremental_state()`` helpers.
initial_state = utils.get_incremental_state(
self, incremental_state, 'prev_state',
)
if initial_state is None:
# first time initialization, same as the original version
initial_state = (
final_encoder_hidden.unsqueeze(0), # hidden
torch.zeros_like(final_encoder_hidden).unsqueeze(0), # cell
)
# Run one step of our LSTM.
output, latest_state = self.lstm(x.transpose(0, 1), initial_state)
# Update the cache with the latest hidden and cell states.
utils.set_incremental_state(
self, incremental_state, 'prev_state', latest_state,
)
# This remains the same as before
x = output.transpose(0, 1)
x = self.output_projection(x)
return x, None
# The ``FairseqIncrementalDecoder`` interface also requires implementing a
# ``reorder_incremental_state()`` method, which is used during beam search
# to select and reorder the incremental state.
def reorder_incremental_state(self, incremental_state, new_order):
# Load the cached state.
prev_state = utils.get_incremental_state(
self, incremental_state, 'prev_state',
)
# Reorder batches according to *new_order*.
reordered_state = (
prev_state[0].index_select(1, new_order), # hidden
prev_state[1].index_select(1, new_order), # cell
)
# Update the cached state.
utils.set_incremental_state(
self, incremental_state, 'prev_state', reordered_state,
)
速度对比示例如下:
# Before
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--beam 5 \
--remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 17.3s (389.12 sentences/s, 8827.68 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
# After
> fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--beam 5 \
--remove-bpe
(...)
| Translated 6750 sentences (153132 tokens) in 5.5s (1225.54 sentences/s, 27802.94 tokens/s)
| Generate test with beam=5: BLEU4 = 8.18, 38.8/12.1/4.7/2.0 (BP=1.000, ratio=1.066, syslen=139865, reflen=131146)
我的结果:(190.9s v.s. 187.8s)
感觉在我的设备上差别不大…