Japanese-Chinese Machine Translation Model with Transformer & PyTorch

KIZG

已于 2024-06-26 23:05:07 修改

阅读量556

点赞数 27

文章标签：机器翻译 transformer pytorch

于 2024-06-25 11:04:21 首次发布

本文链接：https://blog.csdn.net/zdx1967173712/article/details/139953499

版权

Japanese-Chinese Machine Translation Model with Transformer & PyTorch

训练平台:AutoDL AI算力云
在这里插入图片描述

A tutorial using Jupyter Notebook, PyTorch, Torchtext, and SentencePiece

Import required packages

Firstly, let’s make sure we have the below packages installed in our system, if you found that some packages are missing, make sure to install them.

# pip install torchtext
# import torchtext
# print(torchtext.__version__)
# !pip install torchtext
# !pip install sentencepiece
!pip install pandas

Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: pandas in ./miniconda3/lib/python3.8/site-packages (2.0.3)
Requirement already satisfied: python-dateutil>=2.8.2 in ./miniconda3/lib/python3.8/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./miniconda3/lib/python3.8/site-packages (from pandas) (2022.7.1)
Requirement already satisfied: tzdata>=2022.1 in ./miniconda3/lib/python3.8/site-packages (from pandas) (2024.1)
Requirement already satisfied: numpy>=1.20.3 in ./miniconda3/lib/python3.8/site-packages (from pandas) (1.24.2)
Requirement already satisfied: six>=1.5 in ./miniconda3/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[0m

import math
import torchtext
import torch
import torch.nn as nn
from torch import Tensor
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
from collections import Counter
from torchtext.vocab import Vocab
from torch.nn import TransformerEncoder, TransformerDecoder, TransformerEncoderLayer, TransformerDecoderLayer
import io
import time
import pandas as pd
import numpy as np
import pickle
import tqdm
import sentencepiece as spm
torch.manual_seed(0)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# print(torch.cuda.get_device_name(0)) ## 如果你有GPU，请在你自己的电脑上尝试运行这一套代码

device

device(type='cuda')

Get the parallel dataset

In this tutorial, we will use the Japanese-English parallel dataset downloaded from JParaCrawl![http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl] which is described as the “largest publicly available English-Japanese parallel corpus created by NTT. It was created by largely crawling the web and automatically aligning parallel sentences.” You can also see the paper here.

import pandas as pd

# 读取以制表符分隔的文本文件，并存储到DataFrame中
df = pd.read_csv('./zh-ja.bicleaner05.txt', sep='\\t', engine='python', header=None)

# 提取DataFrame中第二列和第三列的数据，并转换为列表
trainen = df[2].values.tolist()  # trainen存储第二列数据
trainja = df[3].values.tolist()  # trainja存储第三列数据

# 以下是被注释掉的代码部分，原本可能用于限制数据量或者移除特定行
# trainen = trainen[:10000]  # 取trainen列表的前10000个元素
# trainja = trainja[:10000]  # 取trainja列表的前10000个元素
# trainen.pop(5972)  # 移除trainen列表中索引为5972的元素
# trainja.pop(5972)  # 移除trainja列表中索引为5972的元素

After importing all the Japanese and their English counterparts, I deleted the last data in the dataset because it has a missing value. In total, the number of sentences in both trainen and trainja is 5,973,071, however, for learning purposes, it is often recommended to sample the data and make sure everything is working as intended, before using all the data at once, to save time.

Here is an example of sentence contained in the dataset.

print(trainen[500])
print(trainja[500])

Chinese HS Code Harmonized Code System < HS编码 2905 无环醇及其卤化、磺化、硝化或亚硝化衍生物 HS Code List (Harmonized System Code) for US, UK, EU, China, India, France, Japan, Russia, Germany, Korea, Canada ...
Japanese HS Code Harmonized Code System < HSコード 2905 非環式アルコール並びにそのハロゲン化誘導体、スルホン化誘導体、ニトロ化誘導体及びニトロソ化誘導体 HS Code List (Harmonized System Code) for US, UK, EU, China, India, France, Japan, Russia, Germany, Korea, Canada ...

We can also use different parallel datasets to follow along with this article, just make sure that we can process the data into the two lists of strings as shown above, containing the Japanese and English sentences.

Prepare the tokenizers

Unlike English or other alphabetical languages, a Japanese sentence does not contain whitespaces to separate the words. We can use the tokenizers provided by JParaCrawl which was created using SentencePiece for both Japanese and English, you can visit the JParaCrawl website to download them, or click here.

en_tokenizer = spm.SentencePieceProcessor(model_file='./spm.en.nopretok.model')
ja_tokenizer = spm.SentencePieceProcessor(model_file='./spm.ja.nopretok.model')

After the tokenizers are loaded, you can test them, for example, by executing the below code.

# 使用 en_tokenizer 对英文文本进行编码
encoded_text = en_tokenizer.encode("All residents aged 20 to 59 years who live in Japan must enroll in public pension system.")

# 使用 ja_tokenizer 对日语文本进行编码
encoded_text = ja_tokenizer.encode("年金 日本に住んでいる20歳~60歳の全ての人は、公的年金制度に加入しなければなりません。")

Build the TorchText Vocab objects and convert the sentences into Torch tensors

Using the tokenizers and raw sentences, we then build the Vocab object imported from TorchText. This process can take a few seconds or minutes depending on the size of our dataset and computing power. Different tokenizer can also affect the time needed to build the vocab, I tried several other tokenizers for Japanese but SentencePiece seems to be working well and fast enough for me.

def build_vocab(sentences, tokenizer):
    """
    根据给定的句子列表和分词器构建词汇表。

    Args:
    - sentences (list): 包含句子的列表。
    - tokenizer: 分词器对象，用于对句子进行编码。

    Returns:
    - Vocab: 构建好的词汇表对象，包括特殊标记如 '<unk>', '<pad>', '<bos>', '<eos>'。
    """
    counter = Counter()
    for sentence in sentences:
        counter.update(tokenizer.encode(sentence))
    return Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

# 使用日语分词器构建日语词汇表
ja_vocab = build_vocab(trainja, ja_tokenizer)

# 使用英语分词器构建英语词汇表
en_vocab = build_vocab(trainen, en_tokenizer)

After we have the vocabulary objects, we can then use the vocab and the tokenizer objects to build the tensors for our training data.

def data_process(ja, en):
    data = []
    for (raw_ja, raw_en) in zip(ja, en):
        # 使用日语分词器编码日语句子，并转换为张量
        ja_tensor_ = torch.tensor([ja_vocab[token] for token in ja_tokenizer.encode(raw_ja.rstrip("\n"), out_type=str)],
                                  dtype=torch.long)
        # 使用英语分词器编码英语句子，并转换为张量
        en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer.encode(raw_en.rstrip("\n"), out_type=str)],
                                  dtype=torch.long)
        # 将编码后的张量对添加到数据列表中
        data.append((ja_tensor_, en_tensor_))
    return data

# 对训练数据进行处理，得到训练数据集
train_data = data_process(trainja, trainen)

Create the DataLoader object to be iterated during training

Here, I set the BATCH_SIZE to 16 to prevent “cuda out of memory”, but this depends on various things such as your machine memory capacity, size of data, etc., so feel free to change the batch size according to your needs (note: the tutorial from PyTorch sets the batch size as 128 using the Multi30k German-English dataset.)

BATCH_SIZE = 8  # 批量大小
PAD_IDX = ja_vocab['<pad>']  # <pad> 的索引
BOS_IDX = ja_vocab['<bos>']  # <bos> 的索引
EOS_IDX = ja_vocab['<eos>']  # <eos> 的索引

def generate_batch(data_batch):

    ja_batch, en_batch = [], []
    for (ja_item, en_item) in data_batch:
        # 在日语句子的开头和结尾添加 <bos> 和 <eos> 标记，并拼接成张量
        ja_batch.append(torch.cat([torch.tensor([BOS_IDX]), ja_item, torch.tensor([EOS_IDX])], dim=0))
        # 在英语句子的开头和结尾添加 <bos> 和 <eos> 标记，并拼接成张量
        en_batch.append(torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0))
    
    # 对日语句子批次进行填充，使其具有相同的长度
    ja_batch = pad_sequence(ja_batch, padding_value=PAD_IDX)
    # 对英语句子批次进行填充，使其具有相同的长度
    en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
    
    return ja_batch, en_batch

# 创建训练数据加载器，用于按批次加载数据
train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)

Sequence-to-sequence Transformer

The next couple of codes and text explanations (written in italic) are taken from the original PyTorch tutorial [https://pytorch.org/tutorials/beginner/translation_transformer.html]. I did not make any change except for the BATCH_SIZE and the word de_vocabwhich is changed to ja_vocab.

Transformer is a Seq2Seq model introduced in “Attention is all you need” paper for solving machine translation task. Transformer model consists of an encoder and decoder block each containing fixed number of layers.

Encoder processes the input sequence by propagating it, through a series of Multi-head Attention and Feed forward network layers. The output from the Encoder referred to as memory, is fed to the decoder along with target tensors. Encoder and decoder are trained in an end-to-end fashion using teacher forcing technique.

from torch.nn import (TransformerEncoder, TransformerDecoder,
                      TransformerEncoderLayer, TransformerDecoderLayer)

class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_encoder_layers: int, num_decoder_layers: int,
                 emb_size: int, src_vocab_size: int, tgt_vocab_size: int,
                 dim_feedforward:int = 512, dropout:float = 0.1):
        super(Seq2SeqTransformer, self).__init__()

        # 初始化Transformer的编码器层和解码器层
        encoder_layer = TransformerEncoderLayer(d_model=emb_size, nhead=NHEAD,
                                                dim_feedforward=dim_feedforward)
        self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)
        
        decoder_layer = TransformerDecoderLayer(d_model=emb_size, nhead=NHEAD,
                                                dim_feedforward=dim_feedforward)
        self.transformer_decoder = TransformerDecoder(decoder_layer, num_layers=num_decoder_layers)

        # 线性层用于生成输出的词汇表分布
        self.generator = nn.Linear(emb_size, tgt_vocab_size)

        # 源语言和目标语言的词嵌入层
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)

        # 位置编码层，用于在输入中注入位置信息
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)

    def forward(self, src: Tensor, trg: Tensor, src_mask: Tensor,
                tgt_mask: Tensor, src_padding_mask: Tensor,
                tgt_padding_mask: Tensor, memory_key_padding_mask: Tensor):
        # 对源语言和目标语言进行位置编码
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        
        # 编码器将源语言编码成内部记忆
        memory = self.transformer_encoder(src_emb, src_mask, src_padding_mask)
        
        # 解码器根据目标语言和编码器的记忆生成输出序列
        outs = self.transformer_decoder(tgt_emb, memory, tgt_mask, None,
                                        tgt_padding_mask, memory_key_padding_mask)
        
        # 最终输出通过线性层生成目标语言的词汇表分布
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        # 编码器单独对源语言进行编码，返回编码后的记忆
        return self.transformer_encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        # 解码器根据记忆和目标语言进行解码
        return self.transformer_decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

Text tokens are represented by using token embeddings. Positional encoding is added to the token embedding to introduce a notion of word order.

class PositionalEncoding(nn.Module):
    def __init__(self, emb_size: int, dropout, maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        # 初始化位置编码矩阵
        den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        # 使用Dropout进行正则化
        self.dropout = nn.Dropout(dropout)
        # 将位置编码矩阵注册为模型的缓冲区
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        # 将位置编码矩阵与输入的token嵌入向量相加，并进行Dropout处理
        return self.dropout(token_embedding +
                            self.pos_embedding[:token_embedding.size(0), :])
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        # 创建词嵌入层，vocab_size 是词汇表大小，emb_size 是词嵌入维度
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        # 前向传播：根据输入的tokens获取对应的词嵌入向量，并乘以 sqrt(emb_size) 进行缩放
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

We create a subsequent word mask to stop a target word from attending to its subsequent words. We also create masks, for masking source and target padding tokens

def generate_square_subsequent_mask(sz):
    # 创建一个下三角矩阵
    mask = (torch.triu(torch.ones((sz, sz), device=device)) == 1).transpose(0, 1)
    # 将矩阵转换为浮点数类型，并根据条件填充特定值
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask
def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    # 生成目标语言序列的遮蔽矩阵
    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    # 生成源语言序列的遮蔽矩阵
    src_mask = torch.zeros((src_seq_len, src_seq_len), device=device).type(torch.bool)

    # 创建源语言和目标语言的填充遮蔽
    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)

    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Define model parameters and instantiate model. 这里我们服务器实在是计算能力有限，按照以下配置可以训练但是效果应该是不行的。如果想要看到训练的效果请使用你自己的带GPU的电脑运行这一套代码。

当你使用自己的GPU的时候，NUM_ENCODER_LAYERS 和 NUM_DECODER_LAYERS 设置为3或者更高，NHEAD设置8，EMB_SIZE设置为512。

SRC_VOCAB_SIZE = len(ja_vocab)  # 源语言词汇表大小
TGT_VOCAB_SIZE = len(en_vocab)  # 目标语言词汇表大小
EMB_SIZE = 512  # 词嵌入维度
NHEAD = 8  # 头数
FFN_HID_DIM = 512  # 前馈神经网络隐藏层维度
BATCH_SIZE = 16  # 批量大小
NUM_ENCODER_LAYERS = 3  # 编码器层数
NUM_DECODER_LAYERS = 3  # 解码器层数
NUM_EPOCHS = 16  # 训练轮数

# 初始化Transformer模型
transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS,
                                 EMB_SIZE, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE,
                                 FFN_HID_DIM)

# 使用Xavier初始化权重
for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

# 将模型移动到指定设备（如GPU）
transformer = transformer.to(device)

# 定义损失函数：交叉熵损失，忽略填充位置的损失计算
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# 定义优化器：Adam优化器
optimizer = torch.optim.Adam(
    transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9
)

# 定义训练函数
def train_epoch(model, train_iter, optimizer):
    model.train()  # 设置模型为训练模式
    losses = 0
    for idx, (src, tgt) in enumerate(train_iter):
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:-1, :]  # 去掉目标语言序列的最后一个位置，作为输入

        # 创建源语言和目标语言的遮蔽
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        # 前向传播
        logits = model(src, tgt_input, src_mask, tgt_mask,
                       src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()  # 梯度清零

        tgt_out = tgt[1:, :]  # 去掉目标语言序列的第一个位置，作为输出
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))  # 计算损失
        loss.backward()  # 反向传播
        optimizer.step()  # 更新参数
        losses += loss.item()
    return losses / len(train_iter)  # 返回平均损失

# 定义评估函数
def evaluate(model, val_iter):
    model.eval()  # 设置模型为评估模式
    losses = 0
    for idx, (src, tgt) in enumerate(val_iter):
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:-1, :]  # 去掉目标语言序列的最后一个位置，作为输入

        # 创建源语言和目标语言的遮蔽
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        # 前向传播
        logits = model(src, tgt_input, src_mask, tgt_mask,
                       src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]  # 去掉目标语言序列的第一个位置，作为输出
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))  # 计算损失
        losses += loss.item()
    return losses / len(val_iter)  # 返回平均损失

Start training

Finally, after preparing the necessary classes and functions, we are ready to train our model. This goes without saying but the time needed to finish training could vary greatly depending on a lot of things such as computing power, parameters, and size of datasets.

When I trained the model using the complete list of sentences from JParaCrawl which has around 5.9 million sentences for each language, it took around 5 hours per epoch using a single NVIDIA GeForce RTX 3070 GPU.

Here is the code:


# 循环训练多个epochs
for epoch in tqdm.tqdm(range(1, NUM_EPOCHS+1)):
    start_time = time.time()  # 记录当前epoch开始时间
    train_loss = train_epoch(transformer, train_iter, optimizer)  # 训练一个epoch
    end_time = time.time()  # 记录当前epoch结束时间

    # 打印当前epoch的训练损失和训练时间
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, "
           f"Epoch time = {(end_time - start_time):.3f}s"))

  0%|          | 0/16 [00:00<?, ?it/s]/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:4999: UserWarning: Support for mismatched src_key_padding_mask and src_mask is deprecated. Use same type for both instead.
  warnings.warn(
/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py:4999: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.
  warnings.warn(
   6%|▋         | 1/16 [06:26<1:36:32, 386.20s/it]

Epoch: 1, Train loss: 5.078, Epoch time = 386.195s


 12%|█▎        | 2/16 [12:56<1:30:38, 388.47s/it]

Epoch: 2, Train loss: 5.065, Epoch time = 390.053s


 19%|█▉        | 3/16 [19:26<1:24:19, 389.21s/it]

Epoch: 3, Train loss: 5.068, Epoch time = 390.090s


 25%|██▌       | 4/16 [25:54<1:17:46, 388.89s/it]

Epoch: 4, Train loss: 4.868, Epoch time = 388.397s


 31%|███▏      | 5/16 [32:21<1:11:08, 388.03s/it]

Epoch: 5, Train loss: 4.268, Epoch time = 386.505s


 38%|███▊      | 6/16 [38:45<1:04:28, 386.82s/it]

Epoch: 6, Train loss: 3.934, Epoch time = 384.463s


 44%|████▍     | 7/16 [45:13<58:03, 387.08s/it]  

Epoch: 7, Train loss: 4.025, Epoch time = 387.606s


 50%|█████     | 8/16 [51:40<51:37, 387.17s/it]

Epoch: 8, Train loss: 3.793, Epoch time = 387.375s


 56%|█████▋    | 9/16 [58:08<45:12, 387.51s/it]

Epoch: 9, Train loss: 3.768, Epoch time = 388.239s


 62%|██████▎   | 10/16 [1:04:37<38:46, 387.78s/it]

Epoch: 10, Train loss: 3.729, Epoch time = 388.387s


 69%|██████▉   | 11/16 [1:08:44<28:43, 344.61s/it]

Epoch: 11, Train loss: 3.662, Epoch time = 246.728s


 75%|███████▌  | 12/16 [1:12:17<20:19, 304.76s/it]

Epoch: 12, Train loss: 3.436, Epoch time = 213.606s


 81%|████████▏ | 13/16 [1:15:49<13:50, 276.73s/it]

Epoch: 13, Train loss: 3.225, Epoch time = 212.219s


 88%|████████▊ | 14/16 [1:19:22<08:34, 257.50s/it]

Epoch: 14, Train loss: 3.198, Epoch time = 213.063s


 94%|█████████▍| 15/16 [1:22:55<04:03, 243.91s/it]

Epoch: 15, Train loss: 3.247, Epoch time = 212.429s


100%|██████████| 16/16 [1:26:29<00:00, 324.33s/it]

Epoch: 16, Train loss: 3.239, Epoch time = 213.948s

Try translating a Japanese sentence using the trained model

First, we create the functions to translate a new sentence, including steps such as to get the Japanese sentence, tokenize, convert to tensors, inference, and then decode the result back into a sentence, but this time in English.

def greedy_decode(model, src, src_mask, max_len, start_symbol):
    """
    使用贪婪解码生成翻译结果。

    Args:
    - model (Seq2SeqTransformer): 训练好的Seq2Seq Transformer模型。
    - src (Tensor): 源语言张量。
    - src_mask (Tensor): 源语言遮蔽矩阵。
    - max_len (int): 生成的最大长度。
    - start_symbol (int): 目标语言起始符号。

    Returns:
    - Tensor: 生成的目标语言张量。
    """
    src = src.to(device)
    src_mask = src_mask.to(device)
    memory = model.encode(src, src_mask)  # 编码源语言句子
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)  # 目标语言输入起始符号

    for i in range(max_len-1):
        memory = memory.to(device)
        memory_mask = torch.zeros(ys.shape[0], memory.shape[0]).to(device).type(torch.bool)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                                    .type(torch.bool)).to(device)
        out = model.decode(ys, memory, tgt_mask)  # 解码生成下一个词
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])  # 生成概率分布
        _, next_word = torch.max(prob, dim=1)  # 取概率最高的词作为下一个词
        next_word = next_word.item()
        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)  # 将生成的词拼接到目标语言张量中
        if next_word == EOS_IDX:  # 如果生成了结束符号，停止生成
            break
    return ys
def translate(model, src, src_vocab, tgt_vocab, src_tokenizer):
    """
    对源语言句子进行翻译。

    Args:
    - model (Seq2SeqTransformer): 训练好的Seq2Seq Transformer模型。
    - src (str): 源语言句子。
    - src_vocab (Vocab): 源语言词汇表。
    - tgt_vocab (Vocab): 目标语言词汇表。
    - src_tokenizer (Tokenizer): 源语言分词器。

    Returns:
    - str: 翻译后的目标语言句子。
    """
    model.eval()  # 设置模型为评估模式
    tokens = [BOS_IDX] + [src_vocab.stoi[tok] for tok in src_tokenizer.encode(src, out_type=str)] + [EOS_IDX]  # 将源语言句子转换为索引序列
    num_tokens = len(tokens)
    src = (torch.LongTensor(tokens).reshape(num_tokens, 1))  # 转换为张量形式
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)  # 创建源语言遮蔽矩阵
    tgt_tokens = greedy_decode(model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()  # 使用贪婪解码生成目标语言张量
    return " ".join([tgt_vocab.itos[tok] for tok in tgt_tokens]).replace("<bos>", "").replace("<eos>", "")  # 将生成的目标语言张量转换为句子形式

Then, we can just call the translate function and pass the required parameters.

translate(transformer, "HSコード 8515 はんだ付け用、ろう付け用又は溶接用の機器(電気式(電気加熱ガス式を含む。)", ja_vocab, en_vocab, ja_tokenizer)

'H S 用 于 _85 15_ 焊 、焊 接 设 备 、焊 接 电 气 式 （ 包 括 电 气 加 热 ) 。'

trainen.pop(5)

'美国 设施: 停车场, 24小时前台, 健身中心, 报纸, 露台, 禁烟客房, 干洗, 无障碍设施, 免费停车, 上网服务, 电梯, 快速办理入住/退房手续, 保险箱, 暖气, 传真/复印, 行李寄存, 无线网络, 免费无线网络连接, 酒店各处禁烟, 空调, 阳光露台, 自动售货机(饮品), 自动售货机(零食), 每日清洁服务, 内部停车场, 私人停车场, WiFi(覆盖酒店各处), 停车库, 无障碍停车场, 简短描述Gateway Hotel Santa Monica酒店距离海滩2英里(3.2公里),提供24小时健身房。每间客房均提供免费WiFi,客人可以使用酒店的免费地下停车场。'

trainja.pop(5)

'アメリカ合衆国 施設・設備: 駐車場, 24時間対応フロント, フィットネスセンター, 新聞, テラス, 禁煙ルーム, ドライクリーニング, バリアフリー, 無料駐車場, インターネット, エレベーター, エクスプレス・チェックイン / チェックアウト, セーフティボックス, 暖房, FAX / コピー, 荷物預かり, Wi-Fi, 無料Wi-Fi, 全館禁煙, エアコン, サンテラス, 自販機(ドリンク類), 自販機(スナック類), 客室清掃サービス(毎日), 敷地内駐車場, 専用駐車場, Wi-Fi(館内全域), 立体駐車場, 障害者用駐車場, 短い説明Gateway Hotel Santa Monicaはビーチから3.2kmの場所に位置し、24時間利用可能なジム、無料Wi-Fi付きのお部屋、無料の地下駐車場を提供しています。'

Save the Vocab objects and trained model

Finally, after the training has finished, we will save the Vocab objects (en_vocab and ja_vocab) first, using Pickle.

import pickle
# open a file, where you want to store the data
file = open('en_vocab.pkl', 'wb')
# dump information to that file
pickle.dump(en_vocab, file)
file.close()
file = open('ja_vocab.pkl', 'wb')
pickle.dump(ja_vocab, file)
file.close()

Lastly, we can also save the model for later use using PyTorch save and load functions. Generally, there are two ways to save the model depending what we want to use them for later. The first one is for inference only, we can load the model later and use it to translate from Japanese to English.

# save model for inference
torch.save(transformer.state_dict(), 'inference_model')

The second one is for inference too, but also for when we want to load the model later, and want to resume the training.

# save model + checkpoint to resume training later
torch.save({
  'epoch': NUM_EPOCHS,
  'model_state_dict': transformer.state_dict(),
  'optimizer_state_dict': optimizer.state_dict(),
  'loss': train_loss,from torch.nn import (TransformerEncoder, TransformerDecoder,
                      TransformerEncoderLayer, TransformerDecoderLayer)
  }, 'model_checkpoint.tar')