NLP实验14——基于Transformer实现机器翻译（日译中）

Wu、、

已于 2024-06-30 11:01:55 修改

阅读量352

点赞数 4

文章标签：机器翻译 transformer pytorch

于 2024-06-20 22:05:35 首次发布

本文链接：https://blog.csdn.net/m0_68492036/article/details/139843660

版权

Japanese-Chinese Machine Translation Model with Transformer & PyTorch

A tutorial using Jupyter Notebook, PyTorch, Torchtext, and SentencePiece

云平台及GPU类型

实例监控

前言

机器翻译（Machine Translation, MT）是自然语言处理领域的重要研究方向之一，旨在通过计算机系统自动将一种语言转换为另一种语言。近年来，随着深度学习技术的发展，基于Transformer的模型在机器翻译任务中表现出了卓越的效果。Transformer模型通过引入自注意力机制，克服了传统序列模型在处理长距离依赖时的局限性，使得机器翻译系统能够更准确地捕捉句子中词语之间的关系。

本篇博客将详细介绍如何使用Transformer模型实现日语到中文的机器翻译。通过对Transformer模型的编码器、解码器结构以及训练过程进行深入剖析，帮助读者理解并掌握基于Transformer的机器翻译技术。同时，我们还将提供一个完整的代码实现示例，供读者参考和实践。

Transformer

Transformer模型完全基于注意力机制，没有任何卷积层或循环神经网络层 (Vaswani et al., 2017)。尽管Transformer最初是应用于在文本数据上的序列到序列学习，但现在已经推广到各种现代的深度学习中，例如语言、视觉、语音和强化学习领域。

模型

Trans former的编码器和解码器是基于自注意力的模块叠加而成的，源（输入）序列和目标（输出）序列的嵌入（embedding）表示将加上位置编码（positional encoding），再分别输入到编码器和解码器中。

从宏观角度来看，Transformer的编码器是由多个相同的层叠加而成的，每个层都有两个子层（子层表示为sublayer）。第一个子层是多头自注意力（multi‐head self‐attention）汇聚；第二个子层是基于位置的前馈网络（positionwise feed‐forward network）。具体来说，在计算编码器的自注意力时，查询、键和值都来自前一个编码器层的输出。受 7.6节中残差网络的启发，每个子层都采用了残差连接（residual connection）。

在Transformer中，对于序列中任何位置的任何输入x ∈ R d，都要求满足sublayer(x) ∈ R d，以便残差连接满足x + sublayer(x) ∈ R d。在残差连接的加法计算之后，紧接着应用层规范化（layer normalization）(Ba et al., 2016)。因此，输入序列对应的每个位置，Transformer编码器都将输出一个d维表示向量。

Transformer解码器也是由多个相同的层叠加而成的，并且层中使用了残差连接和层规范化。除了编码器中描述的两个子层之外，解码器还在这两个子层之间插入了第三个子层，称为编码器－解码器注意力（encoder‐ decoder attention）层。在编码器－解码器注意力中，查询来自前一个解码器层的输出，而键和值来自整个编码器的输出。在解码器自注意力中，查询、键和值都来自上一个解码器层的输出。但是，解码器中的每个位置只能考虑该位置之前的所有位置。这种掩蔽（masked）注意力保留了自回归（auto‐regressive）属性，确保预测仅依赖于已生成的输出词元。

多头注意力

在实践中，当给定相同的查询、键和值的集合时，我们希望模型可以基于相同的注意力机制学习到不同的行为，然后将不同的行为作为知识组合起来，捕获序列内各种范围的依赖关系（例如，短距离依赖和长距离依赖关系）。

因此，允许注意力机制组合使用查询、键和值的不同子空间表示（representation subspaces）可能是有益的。为此，与其只使用单独一个注意力汇聚，我们可以用独立学习得到的h组不同的线性投影（linear projections）来变换查询、键和值。然后，这h组变换后的查询、键和值将并行地送到注意力汇聚中。最后，将这h个注意力汇聚的输出拼接在一起，并且通过另一个可以学习的线性投影进行变换，以产生最终输出。

这种设计被称为多头注意力（multihead attention）(Vaswani et al., 2017)。对于h个注意力汇聚输出，每一个注意力汇聚都被称作一个头（head）。图10.5.1 展示了使用全连接层来实现可学习的线性变换的多头注意力。

自注意力和位置编码

在深度学习中，经常使用卷积神经网络（CNN）或循环神经网络（RNN）对序列进行编码。想象一下，有了注意力机制之后，我们将词元序列输入注意力池化中，以便同一组词元同时充当查询、键和值。具体来说，每个查询都会关注所有的键－值对并生成一个注意力输出。由于查询、键和值来自同一组输入，因此被称为自注意力（self‐attention）(Lin et al., 2017, Vaswani et al., 2017)，也被称为内部注意力（intra‐attention）(Cheng et al., 2016, Parikh et al., 2016, Paulus et al., 2017)。

在处理词元序列时，循环神经网络是逐个的重复地处理词元的，而自注意力则因为并行计算而放弃了顺序操作。为了使用序列的顺序信息，通过在输入表示中添加位置编码（positional encoding）来注入绝对的或相对的位置信息。位置编码可以通过学习得到也可以直接固定得到。

Import required packages

Firstly, let’s make sure we have the below packages installed in our system, if you found that some packages are missing, make sure to install them.

导入所需的包

首先，确保我们系统中安装了以下包，如果发现某些包缺失，请确保安装它们。

import math # 导入数学模块

import torchtext # 导入torchtext库

import torch # 导入torch库

import torch.nn as nn # 导入torch.nn模块并简化命名为nn

from torch import Tensor # 从torch模块导入Tensor类

from torch.nn.utils.rnn import pad_sequence # 从torch.nn.utils.rnn模块导入pad_sequence函数

from torch.utils.data import DataLoader # 从torch.utils.data模块导入DataLoader类

from collections import Counter # 从collections模块导入Counter类

from torchtext.vocab import Vocab # 从torchtext.vocab模块导入Vocab类

from torch.nn import TransformerEncoder, TransformerDecoder, TransformerEncoderLayer, TransformerDecoderLayer # 从torch.nn模块导入TransformerEncoder, TransformerDecoder, TransformerEncoderLayer, TransformerDecoderLayer类

import io # 导入io模块

import time # 导入time模块

import pandas as pd # 导入pandas库并简化命名为pd

import numpy as np # 导入numpy库并简化命名为np

import pickle # 导入pickle模块

import tqdm # 导入tqdm库

import sentencepiece as spm # 导入sentencepiece库并简化命名为spm

torch.manual_seed(0) # 设置随机数生成器的种子为0

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # 设置设备为GPU（如果可用），否则为CPU

# print(torch.cuda.get_device_name(0)) ## 如果你有GPU，请在你自己的电脑上尝试运行这一套代码

实验环境配置

Get the parallel dataset

In this tutorial, we will use the Japanese-English parallel dataset downloaded from JParaCrawl![http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl] which is described as the “largest publicly available English-Japanese parallel corpus created by NTT. It was created by largely crawling the web and automatically aligning parallel sentences.” You can also see the paper here.

获取平行数据集

在本教程中，我们将使用从JParaCrawl下载的日语-英语平行数据集（JParaCrawl ），该数据集被描述为“由NTT创建的最大公开可用的英日平行语料库。它主要通过网络爬取并自动对齐平行句子创建的。”你也可以在这里查看相关论文。

df = pd.read_csv('zh-ja.bicleaner05.txt', sep='\\t', engine='python', header=None) # 读取文件 'zh-ja.bicleaner05.txt'，以制表符为分隔符，使用python引擎，不设置表头

trainen = df[2].values.tolist() # 将数据框的第三列转换为列表并赋值给trainen

trainja = df[3].values.tolist() # 将数据框的第四列转换为列表并赋值给trainja

# trainen.pop(5972) # 移除trainen列表中索引为5972的元素

# trainja.pop(5972) # 移除trainja列表中索引为5972的元素

After importing all the Japanese and their English counterparts, I deleted the last data in the dataset because it has a missing value. In total, the number of sentences in both trainen and trainja is 5,973,071, however, for learning purposes, it is often recommended to sample the data and make sure everything is working as intended, before using all the data at once, to save time.

Here is an example of sentence contained in the dataset.

在导入所有的日语及其对应的英语句子后，我删除了数据集中最后一条数据，因为它有一个缺失值。训练集中日语（trainja）和英语（trainen）的句子总数为5,973,071，但为了学习目的，通常建议对数据进行采样，确保一切按预期工作，然后再一次性使用所有数据，以节省时间。

以下是数据集中包含的句子的一个示例。

print(trainen[500]) # 打印trainen列表中索引为500的元素

print(trainja[500]) # 打印trainja列表中索引为500的元素

We can also use different parallel datasets to follow along with this article, just make sure that we can process the data into the two lists of strings as shown above, containing the Japanese and English sentences.

我们还可以使用不同的平行数据集来跟随这篇文章，只需确保我们可以将数据处理成如上所示的包含日语和英语句子的两个字符串列表。

Prepare the tokenizers

Unlike English or other alphabetical languages, a Japanese sentence does not contain whitespaces to separate the words. We can use the tokenizers provided by JParaCrawl which was created using SentencePiece for both Japanese and English, you can visit the JParaCrawl website to download them, or click here.

准备分词器

与英语或其他字母语言不同，日语句子不包含用于分隔单词的空格。我们可以使用JParaCrawl提供的分词器，这些分词器是使用SentencePiece为日语和英语创建的，你可以访问JParaCrawl网站下载它们，或者点击这里。

en_tokenizer = spm.SentencePieceProcessor(model_file='spm.en.nopretok.model') # 使用'SentencePieceProcessor'类加载英文分词模型'spm.en.nopretok.model'，并将其赋值给en_tokenizer

ja_tokenizer = spm.SentencePieceProcessor(model_file='spm.ja.nopretok.model') # 使用'SentencePieceProcessor'类加载日文分词模型'spm.ja.nopretok.model'，并将其赋值给ja_tokenizer

After the tokenizers are loaded, you can test them, for example, by executing the below code.

在加载分词器后，你可以通过执行以下代码来测试它们。

# 使用en_tokenizer对字符串进行编码

# en_tokenizer.encode方法将输入字符串转换为编码后的序列

# "All residents aged 20 to 59 years who live in Japan must enroll in public pension system." 是要编码的字符串

# out_type='str' 表示输出的编码序列类型为字符串

en_tokenizer.encode("All residents aged 20 to 59 years who live in Japan must enroll in public pension system.", out_type='str')

# 使用 ja_tokenizer 对字符串进行编码

# 输入的字符串是 "年金日本に住んでいる20歳~60歳の全ての人は、公的年金制度に加入しなければなりません。"

# 参数 out_type='str' 指定输出类型为字符串

ja_tokenizer.encode("年金日本に住んでいる20歳~60歳の全ての人は、公的年金制度に加入しなければなりません。", out_type='str')

Build the TorchText Vocab objects and convert the sentences into Torch tensors

Using the tokenizers and raw sentences, we then build the Vocab object imported from TorchText. This process can take a few seconds or minutes depending on the size of our dataset and computing power. Different tokenizer can also affect the time needed to build the vocab, I tried several other tokenizers for Japanese but SentencePiece seems to be working well and fast enough for me.

构建TorchText词汇对象并将句子转换为Torch张量

使用分词器和原始句子，我们接着构建从TorchText导入的Vocab对象。这个过程可能需要几秒钟到几分钟，具体取决于数据集的大小和计算能力。不同的分词器也会影响构建词汇表所需的时间，我尝试了几种其他的日语分词器，但SentencePiece对我来说效果好且足够快。

# 定义一个函数 build_vocab，接收句子列表和一个分词器作为参数

def build_vocab(sentences, tokenizer):

# 创建一个 Counter 对象，用于统计词频

counter = Counter()

# 遍历每个句子

for sentence in sentences:

# 使用分词器对句子进行编码，并更新 Counter 对象

counter.update(tokenizer.encode(sentence, out_type=str))

# 返回一个 Vocab 对象，包含词频统计结果和特殊标记

return Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

# 使用 build_vocab 函数为日语训练数据构建词汇表

ja_vocab = build_vocab(trainja, ja_tokenizer)

# 使用 build_vocab 函数为英语训练数据构建词汇表

en_vocab = build_vocab(trainen, en_tokenizer)

After we have the vocabulary objects, we can then use the vocab and the tokenizer objects to build the tensors for our training data.

在我们有了词汇对象后，我们可以使用词汇和分词器对象来构建训练数据的张量。

# 定义一个函数 data_process，接收日语句子列表和英语句子列表作为参数

def data_process(ja, en):

# 初始化一个空列表 data，用于存储处理后的数据

data = []

# 使用 zip 函数并行遍历日语和英语句子

for (raw_ja, raw_en) in zip(ja, en):

# 对每个日语句子去掉末尾的换行符，然后使用分词器进行编码，

# 将编码后的标记转换为张量，数据类型为长整型

ja_tensor_ = torch.tensor([ja_vocab[token] for token in ja_tokenizer.encode(raw_ja.rstrip("\n"), out_type=str)],

dtype=torch.long)

# 对每个英语句子去掉末尾的换行符，然后使用分词器进行编码，

# 将编码后的标记转换为张量，数据类型为长整型

en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer.encode(raw_en.rstrip("\n"), out_type=str)],

dtype=torch.long)

# 将处理后的日语和英语张量作为元组添加到数据列表中

data.append((ja_tensor_, en_tensor_))

# 返回处理后的数据列表

return data

# 使用 data_process 函数处理训练数据，得到处理后的训练数据 train_data

train_data = data_process(trainja, trainen)

Create the DataLoader object to be iterated during training

Here, I set the BATCH_SIZE to 16 to prevent “cuda out of memory”, but this depends on various things such as your machine memory capacity, size of data, etc., so feel free to change the batch size according to your needs (note: the tutorial from PyTorch sets the batch size as 128 using the Multi30k German-English dataset.)

创建在训练期间迭代的DataLoader对象

这里，我将BATCH_SIZE设置为16，以防止“cuda out of memory”，但这取决于多种因素，例如你的机器内存容量、数据大小等，所以请随意根据需要更改批量大小（注意：PyTorch的教程使用Multi30k德语-英语数据集将批量大小设置为128）。

# 定义批处理大小为 8

BATCH_SIZE = 8

# 获取填充标记的索引

PAD_IDX = ja_vocab['<pad>']

# 获取句子开始标记的索引

BOS_IDX = ja_vocab['<bos>']

# 获取句子结束标记的索引

EOS_IDX = ja_vocab['<eos>']

# 定义生成批处理数据的函数 generate_batch，接收一个数据批次作为参数

def generate_batch(data_batch):

# 初始化空列表，用于存储批处理的日语和英语数据

ja_batch, en_batch = [], []

# 遍历数据批次中的每个样本

for (ja_item, en_item) in data_batch:

# 为日语样本添加句子开始和结束标记，并将其合并为一个张量

ja_batch.append(torch.cat([torch.tensor([BOS_IDX]), ja_item, torch.tensor([EOS_IDX])], dim=0))

# 为英语样本添加句子开始和结束标记，并将其合并为一个张量

en_batch.append(torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0))

# 使用填充标记对日语批处理数据进行填充

ja_batch = pad_sequence(ja_batch, padding_value=PAD_IDX)

# 使用填充标记对英语批处理数据进行填充

en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)

# 返回填充后的日语和英语批处理数据

return ja_batch, en_batch

# 创建 DataLoader 对象，用于生成训练数据的批处理迭代器

# 训练数据 train_data，批处理大小 BATCH_SIZE，随机打乱数据，使用 generate_batch 函数进行数据整理

train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,

shuffle=True, collate_fn=generate_batch)

Sequence-to-sequence Transformer

The next couple of codes and text explanations (written in italic) are taken from the original PyTorch tutorial [https://pytorch.org/tutorials/beginner/translation_transformer.html]. I did not make any change except for the BATCH_SIZE and the word de_vocabwhich is changed to ja_vocab.

Transformer is a Seq2Seq model introduced in “Attention is all you need” paper for solving machine translation task. Transformer model consists of an encoder and decoder block each containing fixed number of layers.

Encoder processes the input sequence by propagating it, through a series of Multi-head Attention and Feed forward network layers. The output from the Encoder referred to as memory, is fed to the decoder along with target tensors. Encoder and decoder are trained in an end-to-end fashion using teacher forcing technique.

序列到序列的Transformer

接下来的几段代码和文字解释（以斜体书写）取自原版PyTorch教程[Language Translation with nn.Transformer and torchtext — PyTorch Tutorials 2.3.0+cu121 documentation ]。我没有做任何改动，除了将BATCH_SIZE和单词de_vocab改为ja_vocab。

Transformer是“Attention is all you need”论文中提出的用于解决机器翻译任务的Seq2Seq模型。Transformer模型包括一个编码器和一个解码器块，每个块包含固定数量的层。

编码器通过多头注意力机制和前馈网络层处理输入序列。编码器的输出被称为记忆，与目标张量一起输入解码器。编码器和解码器使用教师强制技术进行端到端训练。

# 从 torch.nn 导入 Transformer 编码器、解码器及其对应的层

from torch.nn import (TransformerEncoder, TransformerDecoder,

TransformerEncoderLayer, TransformerDecoderLayer)

# 定义 Seq2SeqTransformer 类，继承自 nn.Module

class Seq2SeqTransformer(nn.Module):

# 初始化方法，定义模型的结构

def __init__(self, num_encoder_layers: int, num_decoder_layers: int,

emb_size: int, src_vocab_size: int, tgt_vocab_size: int,

dim_feedforward:int = 512, dropout:float = 0.1):

# 调用父类的初始化方法

super(Seq2SeqTransformer, self).__init__()

# 定义 Transformer 编码器层

encoder_layer = TransformerEncoderLayer(d_model=emb_size, nhead=NHEAD,

dim_feedforward=dim_feedforward)

# 定义 Transformer 编码器，包含多个编码器层

self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)

# 定义 Transformer 解码器层

decoder_layer = TransformerDecoderLayer(d_model=emb_size, nhead=NHEAD,

dim_feedforward=dim_feedforward)

# 定义 Transformer 解码器，包含多个解码器层

self.transformer_decoder = TransformerDecoder(decoder_layer, num_layers=num_decoder_layers)

# 定义线性层，将解码器输出映射到目标词汇表大小

self.generator = nn.Linear(emb_size, tgt_vocab_size)

# 定义源语言和目标语言的嵌入层

self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)

self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)

# 定义位置编码层

self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)

# 定义前向传播方法

def forward(self, src: Tensor, trg: Tensor, src_mask: Tensor,

tgt_mask: Tensor, src_padding_mask: Tensor,

tgt_padding_mask: Tensor, memory_key_padding_mask: Tensor):

# 对源语言嵌入并加上位置编码

src_emb = self.positional_encoding(self.src_tok_emb(src))

# 对目标语言嵌入并加上位置编码

tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))

# 编码器处理源语言嵌入，生成记忆

memory = self.transformer_encoder(src_emb, src_mask, src_padding_mask)

# 解码器处理目标语言嵌入和记忆，生成输出

outs = self.transformer_decoder(tgt_emb, memory, tgt_mask, None,

tgt_padding_mask, memory_key_padding_mask)

# 使用线性层生成最终输出

return self.generator(outs)

# 定义编码方法

def encode(self, src: Tensor, src_mask: Tensor):

# 对源语言嵌入并加上位置编码，然后进行编码

return self.transformer_encoder(self.positional_encoding(

self.src_tok_emb(src)), src_mask)

# 定义解码方法

def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):

# 对目标语言嵌入并加上位置编码，然后进行解码

return self.transformer_decoder(self.positional_encoding(

self.tgt_tok_emb(tgt)), memory,

tgt_mask)

Text tokens are represented by using token embeddings. Positional encoding is added to the token embedding to introduce a notion of word order.

文本标记使用标记嵌入来表示。位置编码添加到标记嵌入中，以引入词序的概念。

# 定义 PositionalEncoding 类，继承自 nn.Module，用于实现位置编码

class PositionalEncoding(nn.Module):

# 初始化方法，定义位置编码的结构

def __init__(self, emb_size: int, dropout, maxlen: int = 5000):

# 调用父类的初始化方法

super(PositionalEncoding, self).__init__()

# 计算位置编码中的指数项

den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)

# 创建位置索引

pos = torch.arange(0, maxlen).reshape(maxlen, 1)

# 初始化位置编码矩阵

pos_embedding = torch.zeros((maxlen, emb_size))

# 计算位置编码中的正弦和余弦部分

pos_embedding[:, 0::2] = torch.sin(pos * den)

pos_embedding[:, 1::2] = torch.cos(pos * den)

# 为位置编码矩阵添加一个额外的维度

pos_embedding = pos_embedding.unsqueeze(-2)

# 定义 Dropout 层

self.dropout = nn.Dropout(dropout)

# 注册位置编码缓冲区，确保在模型保存和加载时能够保持不变

self.register_buffer('pos_embedding', pos_embedding)

# 定义前向传播方法

def forward(self, token_embedding: Tensor):

# 将位置编码添加到输入的嵌入向量上，并应用 Dropout

return self.dropout(token_embedding +

self.pos_embedding[:token_embedding.size(0),:])

# 定义 TokenEmbedding 类，继承自 nn.Module，用于实现词嵌入

class TokenEmbedding(nn.Module):

# 初始化方法，定义词嵌入的结构

def __init__(self, vocab_size: int, emb_size):

# 调用父类的初始化方法

super(TokenEmbedding, self).__init__()

# 定义嵌入层

self.embedding = nn.Embedding(vocab_size, emb_size)

# 保存嵌入维度大小

self.emb_size = emb_size

# 定义前向传播方法

def forward(self, tokens: Tensor):

# 返回经过嵌入层和缩放后的嵌入向量

return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

We create a subsequent word mask to stop a target word from attending to its subsequent words. We also create masks, for masking source and target padding tokens

我们创建一个后续词屏蔽，以防止目标词关注其后续词。我们还创建屏蔽，用于屏蔽源和目标填充标记。

# 定义函数 generate_square_subsequent_mask，用于生成序列的后续遮罩

def generate_square_subsequent_mask(sz):

# 创建一个上三角矩阵，大小为 (sz, sz)，元素全为 1，并将其转置

mask = (torch.triu(torch.ones((sz, sz), device=device)) == 1).transpose(0, 1)

# 将矩阵的浮点型元素值为 0 的位置填充为负无穷，值为 1 的位置填充为 0.0

mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))

# 返回生成的遮罩矩阵

return mask

# 定义函数 create_mask，用于生成源和目标序列的遮罩

def create_mask(src, tgt):

# 获取源序列的长度

src_seq_len = src.shape[0]

# 获取目标序列的长度

tgt_seq_len = tgt.shape[0]

# 生成目标序列的后续遮罩

tgt_mask = generate_square_subsequent_mask(tgt_seq_len)

# 生成源序列的遮罩，全为 0 的布尔型矩阵

src_mask = torch.zeros((src_seq_len, src_seq_len), device=device).type(torch.bool)

# 生成源序列的填充遮罩，标记出填充标记的位置

src_padding_mask = (src == PAD_IDX).transpose(0, 1)

# 生成目标序列的填充遮罩，标记出填充标记的位置

tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)

# 返回源序列的遮罩、目标序列的后续遮罩、源序列的填充遮罩和目标序列的填充遮罩

return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Define model parameters and instantiate model. 这里我们服务器实在是计算能力有限，按照以下配置可以训练但是效果应该是不行的。如果想要看到训练的效果请使用你自己的带GPU的电脑运行这一套代码。

当你使用自己的GPU的时候，NUM_ENCODER_LAYERS 和 NUM_DECODER_LAYERS 设置为3或者更高，NHEAD设置8，EMB_SIZE设置为512。

# 定义源词汇表大小为 ja_vocab 的长度

SRC_VOCAB_SIZE = len(ja_vocab)

# 定义目标词汇表大小为 en_vocab 的长度

TGT_VOCAB_SIZE = len(en_vocab)

# 定义嵌入维度大小为 512

EMB_SIZE = 512

# 定义多头注意力机制中的头数为 8

NHEAD = 8

# 定义前馈神经网络隐藏层的维度为 512

FFN_HID_DIM = 512

# 定义批处理大小为 16

BATCH_SIZE = 16

# 定义编码器层的数量为 3

NUM_ENCODER_LAYERS = 3

# 定义解码器层的数量为 3

NUM_DECODER_LAYERS = 3

# 定义训练的轮数为 16

NUM_EPOCHS = 16

# 创建 Seq2SeqTransformer 模型实例

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS,

EMB_SIZE, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE,

FFN_HID_DIM)

# 初始化模型参数，使用 Xavier 均匀分布

for p in transformer.parameters():

if p.dim() > 1:

nn.init.xavier_uniform_(p)

# 将模型移动到指定设备上（如 GPU）

transformer = transformer.to(device)

# 定义损失函数为交叉熵损失，并忽略填充标记的损失

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# 定义优化器为 Adam，并设置学习率和优化器参数

optimizer = torch.optim.Adam(

transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9

)

# 定义训练一个 epoch 的函数

def train_epoch(model, train_iter, optimizer):

# 将模型设置为训练模式

model.train()

# 初始化损失值

losses = 0

# 遍历训练数据迭代器

for idx, (src, tgt) in enumerate(train_iter):

# 将源和目标数据移动到指定设备上

src = src.to(device)

tgt = tgt.to(device)

# 取目标数据的输入部分（不包含最后一个时间步）

tgt_input = tgt[:-1, :]

# 创建源和目标的遮罩

src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

# 使用模型进行前向传播，获取预测结果

logits = model(src, tgt_input, src_mask, tgt_mask,

src_padding_mask, tgt_padding_mask, src_padding_mask)

# 清空梯度

optimizer.zero_grad()

# 取目标数据的输出部分（不包含第一个时间步）

tgt_out = tgt[1:, :]

# 计算损失

loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))

# 反向传播计算梯度

loss.backward()

# 更新模型参数

optimizer.step()

# 累加损失

losses += loss.item()

# 返回平均损失

return losses / len(train_iter)

# 定义评估模型的函数

def evaluate(model, val_iter):

# 将模型设置为评估模式

model.eval()

# 初始化损失值

losses = 0

# 遍历验证数据迭代器

for idx, (src, tgt) in enumerate(valid_iter):

# 将源和目标数据移动到指定设备上

src = src.to(device)

tgt = tgt.to(device)

# 取目标数据的输入部分（不包含最后一个时间步）

tgt_input = tgt[:-1, :]

# 创建源和目标的遮罩

src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

# 使用模型进行前向传播，获取预测结果

logits = model(src, tgt_input, src_mask, tgt_mask,

src_padding_mask, tgt_padding_mask, src_padding_mask)

# 取目标数据的输出部分（不包含第一个时间步）

tgt_out = tgt[1:, :]

# 计算损失

loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))

# 累加损失

losses += loss.item()

# 返回平均损失

return losses / len(val_iter)

Start training

Finally, after preparing the necessary classes and functions, we are ready to train our model. This goes without saying but the time needed to finish training could vary greatly depending on a lot of things such as computing power, parameters, and size of datasets.

When I trained the model using the complete list of sentences from JParaCrawl which has around 5.9 million sentences for each language, it took around 5 hours per epoch using a single NVIDIA GeForce RTX 3070 GPU.

Here is the code:

开始训练

最后，在准备好必要的类和函数后，我们可以开始训练模型。不言而喻，完成训练所需的时间可能会因许多因素而异，例如计算能力、参数和数据集的大小。

当我使用包含约590万句子（每种语言）的完整JParaCrawl句子列表训练模型时，使用单个NVIDIA GeForce RTX 3070 GPU，每个epoch大约需要5小时。

代码如下：

# 使用 tqdm 生成进度条，遍历训练的轮数

for epoch in tqdm.tqdm(range(1, NUM_EPOCHS + 1)):

# 记录当前 epoch 开始的时间

start_time = time.time()

# 训练一个 epoch，并计算训练损失

train_loss = train_epoch(transformer, train_iter, optimizer)

# 记录当前 epoch 结束的时间

end_time = time.time()

# 打印当前 epoch 的编号、训练损失和耗时

print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, "

f"Epoch time = {(end_time - start_time):.3f}s"))

Try translating a Japanese sentence using the trained model

First, we create the functions to translate a new sentence, including steps such as to get the Japanese sentence, tokenize, convert to tensors, inference, and then decode the result back into a sentence, but this time in English.

尝试使用训练好的模型翻译日语句子

首先，我们创建翻译新句子的函数，包括获取日语句子、分词、转换为张量、推理，然后将结果解码回句子，这次是英文。

# 定义贪婪解码函数，用于生成翻译结果

def greedy_decode(model, src, src_mask, max_len, start_symbol):

# 将源序列和源序列掩码移动到设备上

src = src.to(device)

src_mask = src_mask.to(device)

# 使用模型的编码器对源序列进行编码，得到记忆张量

memory = model.encode(src, src_mask)

# 初始化目标序列，以起始符号填充，并移动到设备上

ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)

# 遍历目标序列的最大长度减去1

for i in range(max_len - 1):

# 确保记忆张量在设备上

memory = memory.to(device)

# 生成记忆张量的掩码，全为 0 的布尔型矩阵

memory_mask = torch.zeros(ys.shape[0], memory.shape[0]).to(device).type(torch.bool)

# 生成目标序列的后续遮罩

tgt_mask = (generate_square_subsequent_mask(ys.size(0)).type(torch.bool)).to(device)

# 使用模型的解码器对目标序列进行解码，得到输出

out = model.decode(ys, memory, tgt_mask)

# 转置输出张量

out = out.transpose(0, 1)

# 使用生成器生成预测概率

prob = model.generator(out[:, -1])

# 选择概率最大的单词作为下一个单词

_, next_word = torch.max(prob, dim=1)

next_word = next_word.item()

# 将下一个单词添加到目标序列中

ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)

# 如果下一个单词是结束符号，则停止解码

if next_word == EOS_IDX:

break

# 返回解码得到的目标序列

return ys

# 定义翻译函数，用于将源语言翻译为目标语言

def translate(model, src, src_vocab, tgt_vocab, src_tokenizer):

# 将模型设置为评估模式

model.eval()

# 将源语言句子进行分词，并添加起始符号和结束符号

tokens = [BOS_IDX] + [src_vocab.stoi[tok] for tok in src_tokenizer.encode(src, out_type=str)] + [EOS_IDX]

# 获取源序列的长度

num_tokens = len(tokens)

# 将源序列转换为张量并调整形状

src = torch.LongTensor(tokens).reshape(num_tokens, 1)

# 生成源序列的掩码，全为 0 的布尔型矩阵

src_mask = torch.zeros(num_tokens, num_tokens).type(torch.bool)

# 使用贪婪解码函数生成目标序列

tgt_tokens = greedy_decode(model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()

# 将目标序列的标记转换为单词，并去除起始符号和结束符号

return " ".join([tgt_vocab.itos[tok] for tok in tgt_tokens]).replace("<bos>", "").replace("<eos>", "")

Then, we can just call the translate function and pass the required parameters.

然后，我们只需调用translate函数并传递所需的参数。

# 调用 translate 函数，使用 transformer 模型将日文句子翻译成英文

translate(transformer, "HSコード 8515 はんだ付け用、ろう付け用又は溶接用の機器(電気式(電気加熱ガス式を含む。)", ja_vocab, en_vocab, ja_tokenizer)

# 从训练集的英文句子列表中移除索引为 5 的元素

trainen.pop(5)

# 从训练集的日文句子列表中移除索引为 5 的元素

trainja.pop(5)

Save the Vocab objects and trained model

Finally, after the training has finished, we will save the Vocab objects (en_vocab and ja_vocab) first, using Pickle.

保存词汇对象和训练好的模型

最后，训练完成后，我们将首先使用Pickle保存词汇对象（en_vocab和ja_vocab）。

import pickle # 导入 pickle 模块，用于对象的序列化和反序列化

# 打开一个文件，用于存储数据（以二进制写模式打开文件 'en_vocab.pkl'）

file = open('en_vocab.pkl', 'wb')

# 将英文词汇表对象 en_vocab 序列化并写入文件

pickle.dump(en_vocab, file)

# 关闭文件

file.close()

# 打开一个文件，用于存储数据（以二进制写模式打开文件 'ja_vocab.pkl'）

file = open('ja_vocab.pkl', 'wb')

# 将日文词汇表对象 ja_vocab 序列化并写入文件

pickle.dump(ja_vocab, file)

# 关闭文件

file.close()

Lastly, we can also save the model for later use using PyTorch save and load functions. Generally, there are two ways to save the model depending what we want to use them for later. The first one is for inference only, we can load the model later and use it to translate from Japanese to English.

最后，我们还可以使用PyTorch的保存和加载功能保存模型以备后用。通常，有两种方式保存模型，取决于我们稍后想如何使用它们。第一种仅用于推理，我们可以稍后加载模型并使用它从日语翻译到英语。

# 保存模型以供推理使用

torch.save(transformer.state_dict(), 'inference_model')

The second one is for inference too, but also for when we want to load the model later, and want to resume the training.

第二种也是用于推理，但也适用于当我们稍后想加载模型，并希望继续训练的情况。

# 保存模型和检查点，以便稍后恢复训练

torch.save({

'epoch': NUM_EPOCHS, # 当前训练的轮次（epoch）数

'model_state_dict': transformer.state_dict(), # 模型的状态字典，包含了模型的参数

'optimizer_state_dict': optimizer.state_dict(), # 优化器的状态字典，包含了优化器的参数

'loss': train_loss, # 当前训练的损失值

}, 'model_checkpoint.tar') # 将以上内容保存到名为 'model_checkpoint.tar' 的文件中