seq2seq模型_Seq2Seq模型实现机器翻译

本文介绍如何通过RNN的Seq2Seq模型来实现机器翻译,最终的模型可以将英文翻译成法文。数据集下载地址https://download.pytorch.org/tutorial/data.zip。使用1台RTX 2080super显卡历时约1个小时的运算,模型最终可以很好的实现英文到法文的机器翻译。

一、数据预处理

这个数据集是一个txt文档格式,每一行有两句话,第一句是英文,然后插入一个制表符,接着是一句法文。前十行数据如下:

Go.	Va !Run!	Cours !Run!	Courez !Wow!	Ça alors !Fire!	Au feu !Help!	À l'aide !Jump.	Saute.Stop!	Ça suffit !Stop!	Stop !Stop!	Arrête-toi !Wait!	Attends !

结合我们之前的文章,我们知道文本预处理过程一般为分词,创建词典,利用词典构建句子。因此首先我们构建字典,并使用字典值将句子“翻译”为单词对应的字典值。```utils.py``是其中一个预处理程序,这一程序将数据从文档中读取出来,然后整理成输入句子,输出句子这一列表,在此过程中会统一句子的格式并将编码形式转为Ascii码。参数可以调整输入句子和输出句子的语言,调整这一布尔值为True可以轻易实现法文翻译为英文。

# utils.pyfrom lang import LanguageModelimport unicodedataimport re# Turn a Unicode string to plain ASCII, thanks to# https://stackoverflow.com/a/518232/2809427def unicode2ascii(s):    return ''.join(        c for c in unicodedata.normalize('NFD', s)        if unicodedata.category(c) != 'Mn'    )# Lowercase, trim, and remove non-letter charactersdef normalize_string(s):    s = unicode2ascii(s.lower().strip())    s = re.sub(r"([.!?])", r" \1", s)    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)    return sdef read_sentences(lang1, lang2, reverse=False):    print("Reading lines...")    # Read the file and split into lines    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8'). \        read().strip().split('\n')    # Split every line into pairs and normalize    pairs = [[normalize_string(s) for s in l.split('\t')] for l in lines]    # Reverse pairs, make Lang instances    if reverse:        pairs = [list(reversed(p)) for p in pairs]        input_lang = LanguageModel(lang2)        output_lang = LanguageModel(lang1)    else:        input_lang = LanguageModel(lang1)        output_lang = LanguageModel(lang2)    return input_lang, output_lang, pairs

程序lang.py创建了一个字典类,支持利用句子创建字典。源码如下:

# lang.pySOS_token = 0EOS_token = 1class LanguageModel:    def __init__(self, name):        self.name = name        self.word2index = {}        self.word2count = {}        self.index2word = {0: "SOS", 1: "EOS"}        self.n_words = 2  # Count SOS and EOS    def add_sentence(self, sentence):        for word in sentence.split(' '):            self.add_word(word)    def add_word(self, word):        if word not in self.word2index:            self.word2index[word] = self.n_words            self.word2count[word] = 1            self.index2word[self.n_words] = word            self.n_words += 1        else:            self.word2count[word] += 1

第三个程序data_preprocessing.py运行了utils.py,然后进一步创建了每种语言对应的字典对象,同时,为了提高训练速度,我们主观上选择了其中一部分句子用于训练,限制了句子的长度以及句子开头的两个单词。

# data_preprocessing.pyfrom utils import read_sentencesMAX_LENGTH = 15eng_prefixes = (    "i am ", "i m ",    "he is", "he s ",    "she is", "she s ",    "you are", "you re ",    "we are", "we re ",    "they are", "they re ")def filterPair(p):    return len(p[0].split(' ')) < MAX_LENGTH and \        len(p[1].split(' ')) < MAX_LENGTH and \        p[0].startswith(eng_prefixes)def filterPairs(pairs):    return [pair for pair in pairs if filterPair(pair)]def prepare_data(lang1, lang2):    sen_in, sen_out, sen_pairs = read_sentences(lang1, lang2)    print("Read %s sentence pairs" % len(sen_pairs))    sen_pairs = filterPairs(sen_pairs)    print("Trimmed to %s sentence pairs" % len(sen_pairs))    print("Counting words...")    for pair in sen_pairs:        sen_in.add_sentence(pair[0])        sen_out.add_sentence(pair[1])    print(sen_in.name, sen_in.n_words)    print(sen_out.name, sen_out.n_words)    return sen_in, sen_out, sen_pairsinput_lang, output_lang, pairs = prepare_data('eng', 'fra')

第四个程序prepare_data.py是最后一个预处理程序,这一程序利用字典将句子转化为张量。

# prepare_data.pyimport torchfrom lang import EOS_tokenfrom models import devicefrom data_preprocessing import input_lang, output_langdef indexesFromSentence(lang, sentence):    return [lang.word2index[word] for word in sentence.split(' ')]def tensorFromSentence(lang, sentence):    indexes = indexesFromSentence(lang, sentence)    indexes.append(EOS_token)    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)def tensorsFromPair(pair):    input_tensor = tensorFromSentence(input_lang, pair[0])    target_tensor = tensorFromSentence(output_lang, pair[1])    return input_tensor, target_tensor

然后我们需要创建RNN模型,我们使用attention机制,基于GRU构建一个RNN Encoder 和 Decoder模型用来实现机器翻译,源代码如下:

# models.pyimport torchimport torch.nn as nnimport torch.nn.functional as Ffrom data_preprocessing import MAX_LENGTHdevice = torch.device("cuda")class EncoderRNN(nn.Module):    def __init__(self, input_size, hidden_size):        super(EncoderRNN, self).__init__()        self.hidden_size = hidden_size        self.embedding = nn.Embedding(input_size, hidden_size)        self.gru = nn.GRU(hidden_size, hidden_size)    def forward(self, input, hidden):        embedded = self.embedding(input).view(1, 1, -1)        output = embedded        output, hidden = self.gru(output, hidden)        return output, hidden    def init_hidden(self):        return torch.zeros(1, 1, self.hidden_size, device=device)class AttnDecoderRNN(nn.Module):    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):        super(AttnDecoderRNN, self).__init__()        self.hidden_size = hidden_size        self.output_size = output_size        self.dropout_p = dropout_p        self.max_length = max_length        self.embedding = nn.Embedding(self.output_size, self.hidden_size)        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)        self.dropout = nn.Dropout(self.dropout_p)        self.gru = nn.GRU(self.hidden_size, self.hidden_size)        self.out = nn.Linear(self.hidden_size, self.output_size)    def forward(self, input, hidden, encoder_outputs):        embedded = self.embedding(input).view(1, 1, -1)        embedded = self.dropout(embedded)        attn_weights = F.softmax(            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)        attn_applied = torch.bmm(attn_weights.unsqueeze(0),                                 encoder_outputs.unsqueeze(0))        output = torch.cat((embedded[0], attn_applied[0]), 1)        output = self.attn_combine(output).unsqueeze(0)        output = F.relu(output)        output, hidden = self.gru(output, hidden)        output = F.log_softmax(self.out(output[0]), dim=1)        return output, hidden, attn_weights    def init_hidden(self):        return torch.zeros(1, 1, self.hidden_size, device=device)

构建训练函数train.py。由于时间关系,我们这里只训练了1个epoch,通过增加epoch可以让模型的表现更好。

import torchimport randomfrom lang import SOS_token, EOS_tokenfrom data_preprocessing import MAX_LENGTHfrom models import deviceteacher_forcing_ratio = 0.5def train_sen(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion,          max_length=MAX_LENGTH):    encoder_hidden = encoder.init_hidden()    encoder_optimizer.zero_grad()    decoder_optimizer.zero_grad()    input_length = input_tensor.size(0)    target_length = target_tensor.size(0)    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)    loss = 0    for ei in range(input_length):        encoder_output, encoder_hidden = encoder(            input_tensor[ei], encoder_hidden)        encoder_outputs[ei] = encoder_output[0, 0]    decoder_input = torch.tensor([[SOS_token]], device=device)    decoder_hidden = encoder_hidden    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False    if use_teacher_forcing:        # Teacher forcing: Feed the target as the next input        for di in range(target_length):            decoder_output, decoder_hidden, decoder_attention = decoder(                decoder_input, decoder_hidden, encoder_outputs)            loss += criterion(decoder_output, target_tensor[di])            decoder_input = target_tensor[di]  # Teacher forcing    else:        # Without teacher forcing: use its own predictions as the next input        for di in range(target_length):            decoder_output, decoder_hidden, decoder_attention = decoder(                decoder_input, decoder_hidden, encoder_outputs)            topv, topi = decoder_output.topk(1)            decoder_input = topi.squeeze().detach()  # detach from history as input            loss += criterion(decoder_output, target_tensor[di])            if decoder_input.item() == EOS_token:                break    loss.backward()    encoder_optimizer.step()    decoder_optimizer.step()    return loss.item() / target_length

构建评价神经网络性能的程序:

# evaluation.pyimport torchimport prepare_datafrom data_preprocessing import MAX_LENGTHfrom lang import EOS_token, SOS_tokenfrom models import devicefrom data_preprocessing import input_lang, output_langdef evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):    with torch.no_grad():        input_tensor = prepare_data.tensorFromSentence(input_lang, sentence)        input_length = input_tensor.size()[0]        encoder_hidden = encoder.init_hidden()        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)        for ei in range(input_length):            encoder_output, encoder_hidden = encoder(input_tensor[ei],                                                     encoder_hidden)            encoder_outputs[ei] += encoder_output[0, 0]        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS        decoder_hidden = encoder_hidden        decoded_words = []        decoder_attentions = torch.zeros(max_length, max_length)        for di in range(max_length):            decoder_output, decoder_hidden, decoder_attention = decoder(                decoder_input, decoder_hidden, encoder_outputs)            decoder_attentions[di] = decoder_attention.data            topv, topi = decoder_output.data.topk(1)            if topi.item() == EOS_token:                decoded_words.append('')                break            else:                decoded_words.append(output_lang.index2word[topi.item()])            decoder_input = topi.squeeze().detach()        return decoded_words, decoder_attentions[:di + 1]

构建相关辅助程序:

# assistance.pyimport matplotlib.ticker as tickerimport randomfrom data_preprocessing import pairsfrom evaluation import evaluateimport matplotlib.pyplot as pltimport timeimport mathdef showPlot(points):    plt.figure()    fig, ax = plt.subplots()    # this locator puts ticks at regular intervals    loc = ticker.MultipleLocator(base=0.2)    ax.yaxis.set_major_locator(loc)    plt.plot(points)def evaluateRandomly(encoder, decoder, n=10):    for i in range(n):        pair = random.choice(pairs)        print('>', pair[0])        print('=', pair[1])        output_words, attentions = evaluate(encoder, decoder, pair[0])        output_sentence = ' '.join(output_words)        print('        print('')def asMinutes(s):    m = math.floor(s / 60)    s -= m * 60    return '%dm %ds' % (m, s)def timeSince(since, percent):    now = time.time()    s = now - since    es = s / percent    rs = es - s    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

构建main函数。

# main.pyimport torchimport torch.nn as nnimport timeimport randomimport matplotlib.pyplot as pltimport matplotlib.ticker as tickerfrom train import train_senfrom assistance import timeSince, showPlot, evaluateRandomlyfrom evaluation import evaluatefrom prepare_data import tensorsFromPairfrom models import AttnDecoderRNN, EncoderRNN, devicefrom data_preprocessing import input_lang, output_lang, pairsdef trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=1000, learning_rate=0.002):    start = time.time()    plot_losses = []    print_loss_total = 0  # Reset every print_every    plot_loss_total = 0  # Reset every plot_every    encoder_optimizer = torch.optim.SGD(encoder.parameters(), lr=learning_rate)    decoder_optimizer = torch.optim.SGD(decoder.parameters(), lr=learning_rate)    training_pairs = [tensorsFromPair(random.choice(pairs))                      for i in range(n_iters)]    criterion = nn.NLLLoss()    for epoch in range(1):        for iter in range(1, n_iters + 1):            training_pair = training_pairs[iter - 1]            input_tensor = training_pair[0]            target_tensor = training_pair[1]            loss = train_sen(input_tensor, target_tensor, encoder,                             decoder, encoder_optimizer, decoder_optimizer, criterion)            print_loss_total += loss            plot_loss_total += loss            if iter % print_every == 0:                print_loss_avg = print_loss_total / print_every                print_loss_total = 0                print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),                                             iter, iter / n_iters * 100, print_loss_avg))            if iter % plot_every == 0:                plot_loss_avg = plot_loss_total / plot_every                plot_losses.append(plot_loss_avg)                plot_loss_total = 0    showPlot(plot_losses)def evaluateAndShowAttention(input_sentence):    output_word, attention = evaluate(        encoder1, attn_decoder1, input_sentence)    print('input =', input_sentence)    print('output =', ' '.join(output_words))    showAttention(input_sentence, output_word, attention)def showAttention(input_sentence, output_word, attention):    # Set up figure with colorbar    fig = plt.figure()    ax = fig.add_subplot(111)    cax = ax.matshow(attention.numpy(), cmap='bone')    fig.colorbar(cax)    # Set up axes    ax.set_xticklabels([''] + input_sentence.split(' ') +                       [''], rotation=90)    ax.set_yticklabels([''] + output_word)    # Show label at every tick    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))    plt.show()hidden_size = 256encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.2).to(device)trainIters(encoder1, attn_decoder1, n_iters=120000)torch.save(encoder1.state_dict(), 'rnn_encoder.parameters.pt')torch.save(attn_decoder1.state_dict(), 'rnn_decoder.parameters.pt')evaluateRandomly(encoder1, attn_decoder1)output_words, attentions = evaluate(    encoder1, attn_decoder1, "both of them seem suspicious .")plt.matshow(attentions.numpy())evaluateAndShowAttention("can you believe what he said ?")evaluateAndShowAttention("i ve got as much money as she has .")evaluateAndShowAttention("i ve never hit anyone in my life .")evaluateAndShowAttention("i couldn t put up with that noise any longer")

程序运行结果如下:

Reading lines...Read 135842 sentence pairsTrimmed to 12823 sentence pairsCounting words...Counted words:eng 3308fra 5034['i m glad you didn t call tom .', 'je suis content que tu n aies pas appele tom .']........................47m 58s (- 0m 18s) (79500 99%) 1.404348m 17s (- 0m 0s) (80000 100%) 1.3170> i m not so sure .= je n en suis pas si sure !< je n en suis pas si sur ! > he s a good liar .= il est bon menteur .< c est un menteur menteur . > you re big .= vous etes grandes .< vous etes grande . > he s a law abiding citizen .= c est un citoyen respectueux des lois .< c est un citoyen des des . . > i m sorry i don t speak french .= je suis desolee je ne parle pas le francais .< je suis desole je ne parle pas francais . > you re bossy .= tu fais le chef .< tu es fou . > i m thinking about visiting my friend next year .= j envisage de rendre visite a mon ami l annee prochaine .< je suis en a rendre d la . > you re through .= vous en avez fini .< vous en avez . > i m not very good at chess .= je ne suis pas tres bon aux echecs .< je ne suis pas fort bon . . . > i m not even sure if this is my key .= je ne suis meme pas sur que ce soit ma cle .< je ne suis pas du meme que mon ma ma ma ma ma ma mainput = he s a good boy .output = c est un garcon garcon . 

1f5dac4ec5d50a785a9de2aaeffe6241.png

上图是loss function的变化曲线。下面的图片都是attention的表现结果。颜色越浅,模型对对应的单词的attention就越高。

3f3dc332b139152428207be3d0b26fec.png

0335354e5cd6aa4402b19cf8cf9e0e26.png

ea2608ad14186dc9bd9e8c2c4b6800fe.png

3472faed90a8f5f798b466806ddaa0d2.png

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值