Datawhale 2024年AI夏令营第二期NLP方向Task02-CSDN博客

本文链接：https://blog.csdn.net/weixin_62528564/article/details/140506618

Datawhale 2024年AI夏令营第二期NLP方向Task02

基于Seq2Seq模型的机器翻译

摘要

本实验旨在使用基于序列到序列（Seq2Seq）模型的神经网络技术来实现英文到中文的机器翻译。实验中使用了PyTorch框架，并结合了torchtext、jieba和sacrebleu等库来进行数据处理、模型训练和评估。通过多个训练周期，模型在开发集上的表现得到了优化，并在测试集上进行了翻译任务。

1. 引言

机器翻译是自然语言处理领域的一个重要研究方向，旨在将一种语言的文本自动翻译成另一种语言。随着深度学习技术的发展，基于神经网络的机器翻译模型已经取得了显著的进展。本次实验采用了Seq2Seq模型，这是一种经典的神经网络结构，广泛用于机器翻译任务。

2. 环境配置

在实验开始之前，需要安装和配置必要的软件和库。以下是环境配置的详细步骤：

创建模型和结果目录：
```
!mkdir ../model
!mkdir ../results
```

安装必要的Python库：

!pip install torchtext
!pip install jieba
!pip install sacrebleu

安装并配置英文分词器spacy：

!pip install -U pip setuptools wheel -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install -U 'spacy[cuda12x]' -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install ../dataset/en_core_web_trf-3.7.3-py3-none-any.whl

3. 数据预处理

数据预处理是机器翻译任务的关键步骤之一。以下是数据预处理的具体步骤：

定义英文和中文的分词器：

en_tokenizer = get_tokenizer('spacy', language='en_core_web_trf')
zh_tokenizer = lambda x: list(jieba.cut(x))  # 使用jieba分词

读取并预处理数据：

def read_data(file_path: str) -> List[str]:
    with open(file_path, 'r', encoding='utf-8') as f:
        return [line.strip() for line in f]

def preprocess_data(en_data: List[str], zh_data: List[str]) -> List[Tuple[List[str], List[str]]]:
    processed_data = []
    for en, zh in zip(en_data, zh_data):
        en_tokens = en_tokenizer(en.lower())[:MAX_LENGTH]
        zh_tokens = zh_tokenizer(zh)[:MAX_LENGTH]
        if en_tokens and zh_tokens:  # 确保两个序列都不为空
            processed_data.append((en_tokens, zh_tokens))
    return processed_data

4. 模型构建

本实验中使用了Seq2Seq模型，该模型包含编码器（Encoder）和解码器（Decoder）两个主要部分。以下是模型构建的详细步骤：

定义编码器、解码器和注意力机制（Attention）：

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.gru = nn.GRU(emb_dim, hid_dim, n_layers, dropout=dropout, batch_first=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, hidden = self.gru(embedded)
        return outputs, hidden

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.attention = attention

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.gru = nn.GRU(hid_dim + emb_dim, hid_dim, n_layers, dropout=dropout, batch_first=True)
        self.fc_out = nn.Linear(hid_dim * 2 + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, encoder_outputs):
        input = input.unsqueeze(1)
        embedded = self.dropout(self.embedding(input))
        a = self.attention(hidden[-1:], encoder_outputs)
        a = a.unsqueeze(1)
        weighted = torch.bmm(a, encoder_outputs)
        rnn_input = torch.cat((embedded, weighted), dim=2)
        output, hidden = self.gru(rnn_input, hidden)
        embedded = embedded.squeeze(1)
        output = output.squeeze(1)
        weighted = weighted.squeeze(1)
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim=1))
        return prediction, hidden

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
        encoder_outputs, hidden = self.encoder(src)

        input = trg[:, 0]
        for t in range(1, trg_len):
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            outputs[:, t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

5. 训练

模型的训练过程包括定义优化器、训练函数和评估函数。以下是训练过程的详细步骤：

定义优化器：

def initialize_optimizer(model, learning_rate=0.001):
    return optim.Adam(model.parameters(), lr=learning_rate)

训练函数：

def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0

    for i, batch in enumerate(iterator):
        src, trg = batch
        if src.numel() == 0 or trg.numel() == 0:
            continue  # 跳过空的批次

        src, trg = src.to(DEVICE), trg.to(DEVICE)

        optimizer.zero_grad()
        output = model(src, trg)
        output_dim = output.shape[-1]
        output = output[:, 1:].contiguous().view(-1, output_dim)
        trg = trg[:, 1:].contiguous().view(-1)

        loss = criterion(output, trg)
        loss.backward()

        clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        epoch_loss += loss.item()

    print(f"Average loss for this epoch: {epoch_loss / len(iterator)}")
    return epoch_loss / len(iterator)

评估函数：

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src, trg = batch
            if src.numel() == 0 or trg.numel() == 0:
                continue  # 跳过空批次

            src, trg = src.to(DEVICE), trg.to(DEVICE)

            output = model(src, trg, 0)  # 关闭 teacher forcing
            output_dim = output.shape[-1]
            output = output[:, 1:].contiguous().view(-1, output_dim)
            trg = trg[:, 1:].contiguous().view(-1)

            loss = criterion(output, trg)
            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

6. 结果与分析

在训练过程中，模型在开发集上的表现逐渐提升。以下是训练过程中的一些关键结果：

训练损失和验证损失：

Average loss for this epoch: 11.16251309712728
Epoch: 01 | Time: 0m 9s
\tTrain Loss: 11.163 | Train PPL: 70439.756
\t Val. Loss: 10.523 |  Val. PPL: 37165.378
Average loss for this epoch: 9.531808853149414
Epoch: 02 | Time: 0m 8s
\tTrain Loss: 9.532 | Train PPL: 13791.515
\t Val. Loss: 8.691 |  Val. PPL: 5948.359