Datawhale 2024年AI夏令营第二期NLP方向Task02
基于Seq2Seq模型的机器翻译
摘要
本实验旨在使用基于序列到序列(Seq2Seq)模型的神经网络技术来实现英文到中文的机器翻译。实验中使用了PyTorch框架,并结合了torchtext
、jieba
和sacrebleu
等库来进行数据处理、模型训练和评估。通过多个训练周期,模型在开发集上的表现得到了优化,并在测试集上进行了翻译任务。
1. 引言
机器翻译是自然语言处理领域的一个重要研究方向,旨在将一种语言的文本自动翻译成另一种语言。随着深度学习技术的发展,基于神经网络的机器翻译模型已经取得了显著的进展。本次实验采用了Seq2Seq模型,这是一种经典的神经网络结构,广泛用于机器翻译任务。
2. 环境配置
在实验开始之前,需要安装和配置必要的软件和库。以下是环境配置的详细步骤:
- 创建模型和结果目录:
!mkdir ../model !mkdir ../results
- 安装必要的Python库:
!pip install torchtext !pip install jieba !pip install sacrebleu
- 安装并配置英文分词器
spacy
:!pip install -U pip setuptools wheel -i https://pypi.tuna.tsinghua.edu.cn/simple !pip install -U 'spacy[cuda12x]' -i https://pypi.tuna.tsinghua.edu.cn/simple !pip install ../dataset/en_core_web_trf-3.7.3-py3-none-any.whl
3. 数据预处理
数据预处理是机器翻译任务的关键步骤之一。以下是数据预处理的具体步骤:
- 定义英文和中文的分词器:
en_tokenizer = get_tokenizer('spacy', language='en_core_web_trf') zh_tokenizer = lambda x: list(jieba.cut(x)) # 使用jieba分词
- 读取并预处理数据:
def read_data(file_path: str) -> List[str]: with open(file_path, 'r', encoding='utf-8') as f: return [line.strip() for line in f] def preprocess_data(en_data: List[str], zh_data: List[str]) -> List[Tuple[List[str], List[str]]]: processed_data = [] for en, zh in zip(en_data, zh_data): en_tokens = en_tokenizer(en.lower())[:MAX_LENGTH] zh_tokens = zh_tokenizer(zh)[:MAX_LENGTH] if en_tokens and zh_tokens: # 确保两个序列都不为空 processed_data.append((en_tokens, zh_tokens)) return processed_data
4. 模型构建
本实验中使用了Seq2Seq模型,该模型包含编码器(Encoder)和解码器(Decoder)两个主要部分。以下是模型构建的详细步骤:
- 定义编码器、解码器和注意力机制(Attention):
class Encoder(nn.Module): def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout): super().__init__() self.hid_dim = hid_dim self.n_layers = n_layers self.embedding = nn.Embedding(input_dim, emb_dim) self.gru = nn.GRU(emb_dim, hid_dim, n_layers, dropout=dropout, batch_first=True) self.dropout = nn.Dropout(dropout) def forward(self, src): embedded = self.dropout(self.embedding(src)) outputs, hidden = self.gru(embedded) return outputs, hidden class Decoder(nn.Module): def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout, attention): super().__init__() self.output_dim = output_dim self.hid_dim = hid_dim self.n_layers = n_layers self.attention = attention self.embedding = nn.Embedding(output_dim, emb_dim) self.gru = nn.GRU(hid_dim + emb_dim, hid_dim, n_layers, dropout=dropout, batch_first=True) self.fc_out = nn.Linear(hid_dim * 2 + emb_dim, output_dim) self.dropout = nn.Dropout(dropout) def forward(self, input, hidden, encoder_outputs): input = input.unsqueeze(1) embedded = self.dropout(self.embedding(input)) a = self.attention(hidden[-1:], encoder_outputs) a = a.unsqueeze(1) weighted = torch.bmm(a, encoder_outputs) rnn_input = torch.cat((embedded, weighted), dim=2) output, hidden = self.gru(rnn_input, hidden) embedded = embedded.squeeze(1) output = output.squeeze(1) weighted = weighted.squeeze(1) prediction = self.fc_out(torch.cat((output, weighted, embedded), dim=1)) return prediction, hidden class Seq2Seq(nn.Module): def __init__(self, encoder, decoder, device): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device def forward(self, src, trg, teacher_forcing_ratio=0.5): batch_size = src.shape[0] trg_len = trg.shape[1] trg_vocab_size = self.decoder.output_dim outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device) encoder_outputs, hidden = self.encoder(src) input = trg[:, 0] for t in range(1, trg_len): output, hidden = self.decoder(input, hidden, encoder_outputs) outputs[:, t] = output teacher_force = random.random() < teacher_forcing_ratio top1 = output.argmax(1) input = trg[:, t] if teacher_force else top1 return outputs
5. 训练
模型的训练过程包括定义优化器、训练函数和评估函数。以下是训练过程的详细步骤:
-
定义优化器:
def initialize_optimizer(model, learning_rate=0.001): return optim.Adam(model.parameters(), lr=learning_rate)
-
训练函数:
def train(model, iterator, optimizer, criterion, clip): model.train() epoch_loss = 0 for i, batch in enumerate(iterator): src, trg = batch if src.numel() == 0 or trg.numel() == 0: continue # 跳过空的批次 src, trg = src.to(DEVICE), trg.to(DEVICE) optimizer.zero_grad() output = model(src, trg) output_dim = output.shape[-1] output = output[:, 1:].contiguous().view(-1, output_dim) trg = trg[:, 1:].contiguous().view(-1) loss = criterion(output, trg) loss.backward() clip_grad_norm_(model.parameters(), clip) optimizer.step() epoch_loss += loss.item() print(f"Average loss for this epoch: {epoch_loss / len(iterator)}") return epoch_loss / len(iterator)
-
评估函数:
def evaluate(model, iterator, criterion): model.eval() epoch_loss = 0 with torch.no_grad(): for i, batch in enumerate(iterator): src, trg = batch if src.numel() == 0 or trg.numel() == 0: continue # 跳过空批次 src, trg = src.to(DEVICE), trg.to(DEVICE) output = model(src, trg, 0) # 关闭 teacher forcing output_dim = output.shape[-1] output = output[:, 1:].contiguous().view(-1, output_dim) trg = trg[:, 1:].contiguous().view(-1) loss = criterion(output, trg) epoch_loss += loss.item() return epoch_loss / len(iterator)
6. 结果与分析
在训练过程中,模型在开发集上的表现逐渐提升。以下是训练过程中的一些关键结果:
- 训练损失和验证损失:
Average loss for this epoch: 11.16251309712728 Epoch: 01 | Time: 0m 9s \tTrain Loss: 11.163 | Train PPL: 70439.756 \t Val. Loss: 10.523 | Val. PPL: 37165.378 Average loss for this epoch: 9.531808853149414 Epoch: 02 | Time: 0m 8s \tTrain Loss: 9.532 | Train PPL: 13791.515 \t Val. Loss: 8.691 | Val. PPL: 5948.359
7. 结论
通过本次实验,我们成功地训练了一个基于Seq2Seq模型的机器翻译系统。模型在开发集上的表现随着训练周期的增加而逐渐提升,最终在测试集上完成了翻译任务。未来的工作可以包括进一步优化模型结构、调整超参数或使用更大规模的数据集来提高翻译质量。
8. 参考文献
以上是“Datawhale 2024年AI夏令营第二期NLP方向Task02”的学习实验笔记。