NLP From Scratch: 基于注意力机制的 seq2seq 神经网络翻译

最新推荐文章于 2022-04-26 21:57:18 发布

光英的记忆

最新推荐文章于 2022-04-26 21:57:18 发布

阅读量518

点赞数 2

分类专栏： pytorch官方教程

本文链接：https://blog.csdn.net/qq_29678299/article/details/103052829

版权

本文深入探讨自然语言处理（NLP）领域，通过逐步讲解如何从零开始构建一个基于注意力机制的seq2seq模型，实现神经网络翻译。文章覆盖了编码器-解码器架构、注意力机制的概念及其在翻译任务中的应用，旨在帮助读者理解这一关键技术的工作原理。

摘要由CSDN通过智能技术生成

from __future__ import unicode_literals,print_function,division
from io import open
import unicodedata
import string
import re 
import random

import torch
import torch.nn as nn 
from torch  import optim
import torch.nn.functional as F

device=torch.device('cuda' if torch.cuda.is_available()  else 'cpu')

"""
加载数据文件
这个项目的数据是一组数以千计的英语到法语的翻译用例。

这个问题在 Open Data Stack Exchange 上 点我打开翻译网址 https://tatoeba.org/ 
这个网站的下载地址 https://tatoeba.org/eng/downloads - 更棒的是，
有人将这些语言切分成单个文件: https://www.manythings.org/anki/

由于翻译文件太大而不能放到repo中，请在继续往下阅读前，
下载数据到 data/eng-fra.txt。该文件是一个使用制表符（table）分割的翻译列表:

I am cold.    J'ai froid.
Copy
注意

从 这里 下载数据和解压到相关的路径.

与character-level RNN教程中使用的字符编码类似,我们将用语言中的每个单词 作为独热向量,
或者除了单个单词之外(在单词的索引处)的大的零向量. 相较于可能 存在于一种语言中仅有十个字符相比,多数都是有大量的字
,因此编码向量很大. 然而,我们会欺骗性的做一些数据修剪,保证每种语言只使用几千字.



我们之后需要将每个单词对应唯一的索引作为神经网络的输入和目标.
为了追踪这些索引我们使用一个帮助
类 Lang 类中有 词 → 索引 (word2index) 和 索引 → 词(index2word) 的字典, 以及每个词word2count 用来替换稀疏词汇。

"""
#构建词典帮组类

SOS_token = 0
EOS_token = 1
class Lang:
    def __init__(self,name):
        self.name=name
        self.word2index={}
        self.word2count={}
        self.index2word={0:"SOS",1:"EOS"}
        self.n_words=2
        
    def addSentence(self,sentence):
        for word in sentence.split(" "):
            self.addWord(word)
            
    def addWord(self,word):
        if word not in self.word2index:
            self.word2index[word]=self.n_words
            self.word2count[word]=1
            self.index2word[self.n_words]=word
            self.n_words+=1
        else:
            self.word2count[word]+=1
            

"""
这些文件全部采用Unicode编码，为了简化起见，
我们将Unicode字符转换成ASCII编码，所有内容小写，并修剪大部分标点符号。
"""
# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s



"""
为了读取数据文件，我们将按行分开，并将每一行分成两对来读取文件。
这些文件都是英语 → 其他语言，所以如果我们想从其他语言翻译 → 英语，添加reverse标志来翻转词语对。
"""
def readLangs(lang1,lang2,reverse=False):
    print("reading lines.......")
    
    #read the file and split into lines
    lines=open('data/data/%s-%s.txt'%(lang1,lang2),encoding='utf-8').read().strip().split('\n')
    
    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    
    
    # reverse pairs make lang instances
    
    if reverse:
        pairs=[list(reversed(p))  for p in pairs]
        
        input_lang=Lang(lang2)
        output_lang=Lang(lang1)
        
    else:
        input_lang=Lang(lang1)
        output_lang=Lang(lang2)
        
    return input_lang,output_lang,pairs


"""
由于有很多例句，而且我们想要快速训练模型，因此我们将数据集修剪为长度相对较短且简单的句子。
在这里，最大长度是十个单词（包括结尾标点符号），
而且我们会对翻译为"I am" 或者 "He is" 形式的句子进行过滤（考虑到之前我们清理过撇号 → '）。
"""
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "