基于Transformer实现机器翻译(日译中)

1. 项目准备

  • 需求分析:明确翻译系统的应用场景、性能要求和预期效果。
  • 资源收集:收集大量的日中双语语料库,确保语料库的质量和多样性。

2. 数据预处理

  • 数据清洗:去除语料库中的噪声,如HTML标签、非日中文字符等。
  • 分词:对日语和中文文本进行分词处理,可以使用MeCab、Jieba等工具。
  • 句子对齐:确保日语文本和中文文本句子对齐。
  • 构建词汇表:为日语和中文分别构建词汇表,包括单词、标点符号等。
  • 文本编码:将文本转换为模型可处理的数字编码。

3. 模型设计

  • 选择架构:选择基于Transformer的模型架构,如原始的Transformer模型或其变种(如BERT、GPT等)。
  • 编码器和解码器:设计多层的编码器和解码器结构,包括多头自注意力机制和前馈网络。
  • 位置编码:引入位置编码来表示单词在句子中的位置信息。
  • 输出层:设计输出层,通常使用softmax函数输出每个词汇的概率分布。

4. 模型训练

  • 损失函数:选择适当的损失函数,如交叉熵损失。
  • 优化器:选择合适的优化器,如Adam。
  • 超参数调整:调整学习率、批次大小、训练轮数等超参数。
  • 训练过程:进行模型的训练,监控训练损失和验证集的性能。

5. 模型评估

  • 评价指标:选择评价指标,如BLEU、NIST、METEOR等。
  • 测试集评估:在独立的测试集上评估模型的性能。
  • 误差分析:分析模型错误,了解模型在哪些类型的句子或词汇上表现不佳。

6. 模型优化

  • 调整模型结构:根据评估结果调整模型结构,如增加层数、注意力头数等。
  • 数据增强:通过回译、随机插入等方法增加训练数据的多样性。
  • 知识蒸馏:使用大型模型蒸馏技术来提升模型性能。

7. 模型部署

  • 模型导出:将训练好的模型导出为可用于生产环境的格式。
  • 接口开发:开发API接口,以便用户可以通过网页或应用程序使用翻译服务。
  • 性能优化:优化模型的推理速度和资源消耗,确保部署的模型能够高效运行。

A tutorial using Jupyter Notebook, PyTorch, Torchtext, and SentencePiece

Import required packages

Firstly, let’s make sure we have the below packages installed in our system, if you found that some packages are missing, make sure to install them.

import math
import torchtext
import torch
import torch.nn as nn
from torch import Tensor
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
from collections import Counter
from torchtext.vocab import Vocab
from torch.nn import TransformerEncoder, TransformerDecoder, TransformerEncoderLayer, TransformerDecoderLayer
import io
import time
import pandas as pd
import numpy as np
import pickle
import tqdm
import sentencepiece as spm
torch.manual_seed(0)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.cuda.get_device_name(0)) ## 如果你有GPU,请在你自己的电脑上尝试运行这一套代码

Get the parallel dataset

In this tutorial, we will use the Japanese-English parallel dataset downloaded from JParaCrawl![http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl] which is described as the “largest publicly available English-Japanese parallel corpus created by NTT. It was created by largely crawling the web and automatically aligning parallel sentences.” You can also see the paper here.

df = pd.read_csv('zh-ja.bicleaner05.txt', sep='\\t', engine='python', header=None)
trainen = df[2].values.tolist()#[:10000]
trainja = df[3].values.tolist()#[:10000]
# trainen.pop(5972)
# trainja.pop(5972)

After importing all the Japanese and their English counterparts, I deleted the last data in the dataset because it has a missing value. In total, the number of sentences in both trainen and trainja is 5,973,071, however, for learning purposes, it is often recommended to sample the data and make sure everything is working as intended, before using all the data at once, to save time.

Here is an example of sentence contained in the dataset.

print(trainen[500])
print(trainja[500])

We can also use different parallel datasets to follow along with this article, just make sure that we can process the data into the two lists of strings as shown above, containing the Japanese and English sentences.

Prepare the tokenizers

Unlike English or other alphabetical languages, a Japanese sentence does not contain whitespaces to separate the words. We can use the tokenizers provided by JParaCrawl which was created using SentencePiece for both Japanese and English, you can visit the JParaCrawl website to download them, or click here.

en_tokenizer = spm.SentencePieceProcessor(model_file='spm.en.nopretok.model')
ja_tokenizer = spm.SentencePieceProcessor(model_file='spm.ja.nopretok.model')

After the tokenizers are loaded, you can test them, for example, by executing the below code.

en_tokenizer.encode("All residents aged 20 to 59 years who live in Japan must enroll in public pension system.", out_type='str')

ja_tokenizer.encode("年金 日本に住んでいる20歳~60歳の全ての人は、公的年金制度に加入しなければなりません。", out_type='str')

Build the TorchText Vocab objects and convert the sentences into Torch tensors

Using the tokenizers and raw sentences, we then build the Vocab object imported from TorchText. This process can take a few seconds or minutes depending on the size of our dataset and computing power. Different tokenizer can also affect the time needed to build the vocab, I tried several other tokenizers for Japanese but SentencePiece seems to be working well and fast enough for me.

def build_vocab(sentences, tokenizer):
  counter = Counter()
  for sentence in sentences:
    counter.update(tokenizer.encode(sentence, out_type=str))
  return Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])
ja_vocab = build_vocab(trainja, ja_tokenizer)
en_vocab = build_vocab(trainen, en_tokenizer)

After we have the vocabulary objects, we can then use the vocab and the tokenizer objects to build the tensors for our training data.

def data_process(ja, en):
  data = []
  for (raw_ja, raw_en) in zip(ja, en):
    ja_tensor_ = torch.tensor([ja_vocab[token] for token in ja_tokenizer.encode(raw_ja.rstrip("\n"), out_type=str)],
                            dtype=torch.long)
    en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer.encode(raw_en.rstrip("\n"), out_type=str)],
                            dtype=torch.long)
    data.append((ja_tensor_, en_tensor_))
  return data
train_data = data_process(trainja, trainen)

Create the DataLoader object to be iterated during training

Here, I set the BATCH_SIZE to 16 to prevent “cuda out of memory”, but this depends on various things such as your machine memory capacity, size of data, etc., so feel free to change the batch size according to your needs (note: the tutorial from PyTorch sets the batch size as 128 using the Multi30k German-English dataset.)

BATCH_SIZE = 8
PAD_IDX = ja_vocab['<pad>']
BOS_IDX = ja_vocab['<bos>']
EOS_IDX = ja_vocab['<eos>']
def generate_batch(data_batch):
  ja_batch, en_batch = [], []
  for (ja_item, en_item) in data_batch:
    ja_batch.append(torch.cat([torch.tensor([BOS_IDX]), ja_item, torch.tensor([EOS_IDX])], dim=0))
    en_batch.append(torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0))
  ja_batch = pad_sequence(ja_batch, padding_value=PAD_IDX)
  en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
  return ja_batch, en_batch
train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)

Sequence-to-sequence Transformer

The next couple of codes and text explanations (written in italic) are taken from the original PyTorch tutorial [https://pytorch.org/tutorials/beginner/translation_transformer.html]. I did not make any change except for the BATCH_SIZE and the word de_vocabwhich is changed to ja_vocab.

Transformer is a Seq2Seq model introduced in “Attention is all you need” paper for solving machine translation task. Transformer model consists of an encoder and decoder block each containing fixed number of layers.

Encoder processes the input sequence by propagating it, through a series of Multi-head Attention and Feed forward network layers. The output from the Encoder referred to as memory, is fed to the decoder along with target tensors. Encoder and decoder are trained in an end-to-end fashion using teacher forcing technique.

from torch.nn import (TransformerEncoder, TransformerDecoder,
                      TransformerEncoderLayer, TransformerDecoderLayer)

# 导入PyTorch中Transformer相关的模块,这些模块用于构建Transformer模型的编码器、解码器及其层结构。

class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_encoder_layers: int, num_decoder_layers: int,
                 emb_size: int, src_vocab_size: int, tgt_vocab_size: int,
                 dim_feedforward:int = 512, dropout:float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        # 初始化Seq2SeqTransformer类,继承自nn.Module。设置编码器和解码器的层数、嵌入大小、源语言和目标语言的词汇表大小等参数。

        encoder_layer = TransformerEncoderLayer(d_model=emb_size, nhead=NHEAD,
                                                dim_feedforward=dim_feedforward)
        # 创建编码器层,其中d_model是嵌入维度,nhead是多头注意力的头数,dim_feedforward是前馈网络中间层的维度。

        self.transformer_encoder = TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)
        # 使用编码器层创建编码器,num_layers指定编码器层的数量。

        decoder_layer = TransformerDecoderLayer(d_model=emb_size, nhead=NHEAD,
                                                dim_feedforward=dim_feedforward)
        # 创建解码器层,参数与编码器层相同。

        self.transformer_decoder = TransformerDecoder(decoder_layer, num_layers=num_decoder_layers)
        # 使用解码器层创建解码器,num_layers指定解码器层的数量。

        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        # 创建一个线性层,用于将解码器的输出映射到目标词汇表的大小。

        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        # 创建源语言和目标语言的令牌嵌入层。

        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)
        # 创建位置编码层,用于在嵌入中添加位置信息。

    def forward(self, src: Tensor, trg: Tensor, src_mask: Tensor,
                tgt_mask: Tensor, src_padding_mask: Tensor,
                tgt_padding_mask: Tensor, memory_key_padding_mask: Tensor):
        # 定义前向传播过程。
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        # 对源语言令牌进行嵌入和位置编码。

        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        # 对目标语言令牌进行嵌入和位置编码。

        memory = self.transformer_encoder(src_emb, src_mask, src_padding_mask)
        # 将编码后的源语言令牌传递给编码器。

        outs = self.transformer_decoder(tgt_emb, memory, tgt_mask, None,
                                        tgt_padding_mask, memory_key_padding_mask)
        # 将编码后的目标语言令牌和解码器记忆传递给解码器。

        return self.generator(outs)
        # 将解码器的输出传递给生成器,以生成最终的输出。

    def encode(self, src: Tensor, src_mask: Tensor):
        # 定义编码过程,只运行编码器部分。
        return self.transformer_encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        # 定义解码过程,只运行解码器部分。
        return self.transformer_decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

Text tokens are represented by using token embeddings. Positional encoding is added to the token embedding to introduce a notion of word order

class PositionalEncoding(nn.Module):
    def __init__(self, emb_size: int, dropout, maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding +
                            self.pos_embedding[:token_embedding.size(0),:])

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size
    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

We create a subsequent word mask to stop a target word from attending to its subsequent words. We also create masks, for masking source and target padding tokens

def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=device)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def create_mask(src, tgt):
  src_seq_len = src.shape[0]
  tgt_seq_len = tgt.shape[0]

  tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
  src_mask = torch.zeros((src_seq_len, src_seq_len), device=device).type(torch.bool)

  src_padding_mask = (src == PAD_IDX).transpose(0, 1)
  tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
  return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Define model parameters and instantiate model. 这里我们服务器实在是计算能力有限,按照以下配置可以训练但是效果应该是不行的。如果想要看到训练的效果请使用你自己的带GPU的电脑运行这一套代码。

当你使用自己的GPU的时候,NUM_ENCODER_LAYERS 和 NUM_DECODER_LAYERS 设置为3或者更高,NHEAD设置8,EMB_SIZE设置为512。

# 定义模型参数
SRC_VOCAB_SIZE = len(ja_vocab)  # 源语言(假设是日语)词汇表的大小
TGT_VOCAB_SIZE = len(en_vocab)  # 目标语言(假设是英语)词汇表的大小
EMB_SIZE = 512  # 嵌入层的大小
NHEAD = 8  # 多头注意力的头数
FFN_HID_DIM = 512  # 前馈网络隐藏层的大小
BATCH_SIZE = 16  # 训练批次的大小
NUM_ENCODER_LAYERS = 3  # 编码器的层数
NUM_DECODER_LAYERS = 3  # 解码器的层数
NUM_EPOCHS = 16  # 训练的轮数

# 创建Seq2SeqTransformer模型的实例
transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS,
                                 EMB_SIZE, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE,
                                 FFN_HID_DIM)

# 初始化模型参数
for p in transformer.parameters():
    if p.dim() > 1:  # 如果参数的维度大于1,则使用Xavier初始化
        nn.init.xavier_uniform_(p)

# 将模型移动到相应的设备(CPU或GPU)
transformer = transformer.to(device)

# 定义损失函数,忽略填充的令牌
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# 定义优化器
optimizer = torch.optim.Adam(
    transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9
)

# 定义训练一个轮次的函数
def train_epoch(model, train_iter, optimizer):
    model.train()  # 设置模型为训练模式
    losses = 0  # 初始化损失
    for idx, (src, tgt) in enumerate(train_iter):  # 遍历训练数据
        src = src.to(device)  # 将源数据移动到设备
        tgt = tgt.to(device)  # 将目标数据移动到设备

        tgt_input = tgt[:-1, :]  # 目标输入数据,不包括最后一个令牌

        # 创建注意力掩码
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        # 通过模型进行前向传播
        logits = model(src, tgt_input, src_mask, tgt_mask,
                       src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()  # 清空梯度

        tgt_out = tgt[1:,:]  # 目标输出数据,不包括第一个令牌
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))  # 计算损失
        loss.backward()  # 反向传播梯度

        optimizer.step()  # 更新模型参数
        losses += loss.item()  # 累加损失
    return losses / len(train_iter)  # 返回平均损失

# 定义评估模型的函数
def evaluate(model, val_iter):
    model.eval()  # 设置模型为评估模式
    losses = 0  # 初始化损失
    for idx, (src, tgt) in enumerate(valid_iter):  # 遍历验证数据
        src = src.to(device)  # 将源数据移动到设备
        tgt = tgt.to(device)  # 将目标数据移动到设备

        tgt_input = tgt[:-1, :]  # 目标输入数据,不包括最后一个令牌

        # 创建注意力掩码
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        # 通过模型进行前向传播
        logits = model(src, tgt_input, src_mask, tgt_mask,
                       src_padding_mask, tgt_padding_mask, src_padding_mask)
        tgt_out = tgt[1:,:]  # 目标输出数据,不包括第一个令牌
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))  # 计算损失
        losses += loss.item()  # 累加损失
    return losses / len(val_iter)  # 返回平均损失

Start training

Finally, after preparing the necessary classes and functions, we are ready to train our model. This goes without saying but the time needed to finish training could vary greatly depending on a lot of things such as computing power, parameters, and size of datasets.

When I trained the model using the complete list of sentences from JParaCrawl which has around 5.9 million sentences for each language, it took around 5 hours per epoch using a single NVIDIA GeForce RTX 3070 GPU.

Here is the code:

for epoch in tqdm.tqdm(range(1, NUM_EPOCHS+1)):
  start_time = time.time()
  train_loss = train_epoch(transformer, train_iter, optimizer)
  end_time = time.time()
  print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, "
          f"Epoch time = {(end_time - start_time):.3f}s"))

Try translating a Japanese sentence using the trained model

First, we create the functions to translate a new sentence, including steps such as to get the Japanese sentence, tokenize, convert to tensors, inference, and then decode the result back into a sentence, but this time in English.

# greedy_decode函数使用贪婪解码策略来生成翻译序列
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    # 将源序列和源掩码发送到定义的设备上(比如GPU或CPU)
    src = src.to(device)
    src_mask = src_mask.to(device)
    # 使用模型编码源序列,得到编码后的记忆表示
    memory = model.encode(src, src_mask)
    # 初始化输出序列为1x1的张量,填入起始符号,并转换为长整型
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)
    # 进行最大长度-1次迭代,每次迭代添加一个新词
    for i in range(max_len-1):
        # 确保记忆表示在正确的设备上
        memory = memory.to(device)
        # 创建记忆掩码,用于注意力机制,防止模型关注未来的位置
        memory_mask = torch.zeros(ys.shape[0], memory.shape[0]).to(device).type(torch.bool)
        # 创建目标掩码,用于防止模型在解码时关注未来的位置
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                                    .type(torch.bool)).to(device)
        # 使用模型解码得到当前输出
        out = model.decode(ys, memory, tgt_mask)
        # 转置输出,以便于处理
        out = out.transpose(0, 1)
        # 通过模型的生成器得到概率分布
        prob = model.generator(out[:, -1])
        # 选择概率最高的词作为下一个词
        _, next_word = torch.max(prob, dim=1)
        # 转换下一个词为Python标量
        next_word = next_word.item()
        # 将下一个词添加到输出序列中
        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        # 如果下一个词是结束符号,则停止迭代
        if next_word == EOS_IDX:
            break
    # 返回解码后的输出序列
    return ys

# translate函数将源语言句子翻译成目标语言句子
def translate(model, src, src_vocab, tgt_vocab, src_tokenizer):
    # 将模型设置为评估模式
    model.eval()
    # 将源语言句子转换为词汇索引列表,并添加开始和结束符号
    tokens = [BOS_IDX] + [src_vocab.stoi[tok] for tok in src_tokenizer.encode(src, out_type=str)] + [EOS_IDX]
    # 计算源语言句子的词数
    num_tokens = len(tokens)
    # 将词汇索引列表转换为二维张量
    src = (torch.LongTensor(tokens).reshape(num_tokens, 1))
    # 创建源掩码,用于注意力机制
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    # 使用贪婪解码策略生成翻译序列的词索引
    tgt_tokens = greedy_decode(model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    # 将词索引转换为目标语言句子,并去除开始和结束符号
    return " ".join([tgt_vocab.itos[tok] for tok in tgt_tokens]).replace("<bos>", "").replace("<eos>", "")
 

Then, we can just call the translate function and pass the required parameters.

Save the Vocab objects and trained model

Finally, after the training has finished, we will save the Vocab objects (en_vocab and ja_vocab) first, using Pickle.

[ ]:

import pickle
# open a file, where you want to store the data
file = open('en_vocab.pkl', 'wb')
# dump information to that file
pickle.dump(en_vocab, file)
file.close()
file = open('ja_vocab.pkl', 'wb')
pickle.dump(ja_vocab, file)
file.close()

Lastly, we can also save the model for later use using PyTorch save and load functions. Generally, there are two ways to save the model depending what we want to use them for later. The first one is for inference only, we can load the model later and use it to translate from Japanese to English.

[ ]:

# save model for inference
torch.save(transformer.state_dict(), 'inference_model')

The second one is for inference too, but also for when we want to load the model later, and want to resume the training.

[ ]:

# save model + checkpoint to resume training later
torch.save({
  'epoch': NUM_EPOCHS,
  'model_state_dict': transformer.state_dict(),
  'optimizer_state_dict': optimizer.state_dict(),
  'loss': train_loss,
  }, 'model_checkpoint.tar')

Conclusion

That’s it!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值