PyTorch文本：05.使用Sequence2Sequence网络和注意力进行翻译

最新推荐文章于 2024-04-05 10:15:00 发布

智云研

最新推荐文章于 2024-04-05 10:15:00 发布

阅读量782

点赞数

分类专栏： PyTorch 文章标签： python 机器学习深度学习 java 大数据

本文链接：https://blog.csdn.net/aizhushou/article/details/108376779

版权

本教程详细介绍了如何使用PyTorch实现序列到序列（Seq2Seq）网络和注意力机制进行文本翻译。项目涵盖了从数据预处理、模型构建、训练到注意力可视化的过程，旨在帮助读者理解神经网络如何将法语翻译成英语。

摘要由CSDN通过智能技术生成

在这个项目中，我们将讲解使用神经网络将法语翻译成英语。

[KEY: > input, = target, < output]
> il est en train de peindre un tableau .
= he is painting a picture .
< he is painting a picture .
> pourquoi ne pas essayer ce vin delicieux ?
= why not try that delicious wine ?
< why not try that delicious wine ?
> elle n est pas poete mais romanciere .
= she is not a poet but a novelist .
< she not not a poet but a novelist .
> vous etes trop maigre .
= you re too skinny .
< you re all alone .

…取得了不同程度的成功。

这可以通过序列到序列网络来实现，其中两个递归神经网络一起工作以将一个序列转换成另一个序列。编码器网络将输入序列压缩成向量，并且解码器网络将该向量展开成新的序列。

阅读建议

开始本教程前，你已经安装好了PyTorch，并熟悉Python语言，理解“张量”的概念：

https://pytorch.org/ PyTorch 安装指南

Deep Learning with PyTorch：A 60 Minute Blitz :PyTorch的基本入门教程

Learning PyTorch with Examples:得到深层而广泛的概述

PyTorch for Former Torch Users Lua Torch:如果你曾是一个Lua张量的使用者

事先学习并了解序列到序列网络的工作原理对理解这个例子十分有帮助:

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Sequence to Sequence Learning with Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

A Neural Conversational Model

您还可以找到之前有关Classifying Names with a Character-Level RNN和 Generating Names with a Character-Level RNN 的教程，因为这些概念分别与编码器和解码器模型非常相似。

更多信息，请阅读介绍这些主题的论文：

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Sequence to Sequence Learning with Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

A Neural Conversational Model

1.导入必须的包

from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

2.加载数据文件

该项目的数据是成千上万的英语到法语的翻译对的集合。

关于Open Data Stack Exchange的这个问题，开放式翻译网站 https://tatoeba.org/给出了指导，该网站的下载位于https://tatoeba.org/eng/downloads

更好的是，有人做了额外的拆分工作，将语言对分成单独的文本文件：https：//www.manythings.org/anki/

英语到法语对因为太大而无法包含在repo中，因此下载到data / eng-fra.txt再继续进行后续步骤。该文件是以制表符分隔的翻译对列表：

I am cold. J'ai froid.

注意：从此处下载数据并将其解压缩到当前目录。

与字符级RNN教程中使用的字符编码类似，我们将语言中的每个单词表示为one-hot向量或零的巨向量，除了单个字符（在单词的索引处）。与语言中可能存在的几十个字符相比，还有更多的字，因此编码向量很大。然而，我们投机取巧并修剪数据，每种语言只使用几千个单词。

我们将需要每个单词的唯一索引，以便稍后用作网络的输入和目标。为了跟踪所有这些，我们将使用一个名为Lang的辅助类，它具有 word→index（word2index）和index→word（index2word）的字典，以及用于稍后替换稀有单词的每个单词word2count的计数。

SOS_token = 0
EOS_token = 1
class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS
    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

这些文件都是Unicode格式，为了简化我们将Unicode字符转换为ASCII，使所有内容都小写，并去掉大多数标点符号。

# 将Unicode字符串转换为纯ASCII, 感谢https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )
# 小写，修剪和删除非字母字符
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

2.1 读取数据文件

要读取数据文件，我们将文件拆分为行，然后将行拆分成对。这些文件都是英语→其他语言，所以如果我们想翻译其他语言→英语，我添加reverse标志来反转对。

def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")
    # 读取文件并分成几行
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')
    # 将每一行拆分成对并进行标准化
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    # 反向对，使Lang实例
    if reverse:
        pairs = [lis