语言翻译

最新推荐文章于 2023-12-25 00:34:58 发布

Adm1rat1on

最新推荐文章于 2023-12-25 00:34:58 发布

阅读量556

点赞数

分类专栏：机器学习人工智能文章标签： NLP

本文链接：https://blog.csdn.net/qq_35358021/article/details/83542581

版权

机器学习同时被 2 个专栏收录

79 篇文章 2 订阅

订阅专栏

人工智能

23 篇文章 0 订阅

订阅专栏

运用神经网络完成机器翻译。使用英语和法语语句组成的数据集，训练一个序列到序列模型（sequence to sequence model），该模型能够将新的英语句子翻译成法语。

获取数据

import helper
import problem_unittests as tests

source_path = './data/small_vocab_en'
target_path = './data/small_vocab_fr'
source_text = helper.load_data(source_path)
target_text = helper.load_data(target_path)

探索数据

view_sentence_range = (0, 10)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in source_text.split()})))

sentences = source_text.split('\n')
word_counts = [len(sentence.split()) for sentence in sentences]
print('Number of sentences: {}'.format(len(sentences)))
print('Average number of words in a sentence: {}'.format(np.average(word_counts)))

print()
print('English sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(source_text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))
print()
print('French sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(target_text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))

实现预处理函数

首先将文本转换为数字。在函数 text_to_ids() 中，请将单词中的 source_text 和 target_text 转为 id。注意：需要在 target_text 中每个句子的末尾，添加单词 id。这样可以预测句子应该在什么地方结束。

通过以下代码获取单词ID：

target_vocab_to_int[’’]
使用 source_vocab_to_int 和 target_vocab_to_int 获得其他单词 id。

def text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int):
    """
    Convert source and target text to proper word ids
    :param source_text: String that contains all the source text.
    :param target_text: String that contains all the target text.
    :param source_vocab_to_int: Dictionary to go from the source words to an id
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :return: A tuple of lists (source_id_text, target_id_text)
    """
    # 需要编程：
    source_id_text = [[source_vocab_to_int.get(word,source_vocab_to_int['<UNK>']) 
                       for word in line.split()] 
                      for line in source_text.split('\n')]
    target_id_text = [[target_vocab_to_int.get(word,source_vocab_to_int['<UNK>']) 
                       for word in line.split()] + [target_vocab_to_int['<EOS>']] 
                      for line in target_text.split('\n')] 

    return (source_id_text, target_id_text)


tests.test_text_to_ids(text_to_ids)

预处理所有数据并且保存

运行下列代码单元，预处理所有数据，并保存到文件中

helper.preprocess_and_save_data(source_path, target_path, text_to_ids)

检查点

第一个检查点。重新启动记事本后，可以从这里继续。预处理的数据已保存到磁盘上。

import numpy as np
import helper

(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = helper.load_preprocess()

检查Tensorflow版本，确认可访问GPU

这一检查步骤，可以确保你使用的是正确版本的 TensorFlow，并且能够访问 GPU。

from distutils.version import LooseVersion
import warnings
import tensorflow as tf

# 检查TensorFlow版本
assert LooseVersion(tf.__version__) in [LooseVersion('1.0.0'), LooseVersion('1.0.1')], 'This project requires TensorFlow version 1.0  You are using {}'.format(tf.__version__)
print('TensorFlow版本是: {}'.format(tf.__version__))

# 检查是否有GPU
if not tf.test.gpu_device_name():
    warnings.warn('没有发现GPU，请使用GPU训练网络')
else:
    print('默认GPU设备号: {}'.format(tf.test.gpu_device_name()))

构建神经网络

输入

实现 model_inputs() 函数，为神经网络创建 TF 占位符。如下：
a. 名为 “input” 的输入文本占位符，并使用 TF Placeholder 名称参数（数据维度为 2）
b. 目标占位符（数据维度为 2）
c. 学习速率占位符（数据维度为 0）
d. 名为 “keep_prob” 的保留率占位符，并使用 TF Placeholder 名称参数（数据维度为 0）
在以下元祖（tuple）中返回占位符：（输入、目标、学习速率、保留率）

def model_inputs():
    """
    Create TF Placeholders for input, targets, and learning rate.
    :return: Tuple (input, targets, learning rate, keep probability)
    """
    #需要编程完成：
    inputs=tf.placeholder(tf.int32,[None,None],name='input')
    targets=tf.placeholder(tf.int32,[None,None])
    learningrate=tf.placeholder(tf.float32)
    keep_prob=tf.placeholder(tf.float32,name='keep_prob')
    return inputs,targets,learningrate,keep_prob

tests.test_model_inputs(model_inputs)

处理解码输入

使用 TensorFlow 实现 process_decoding_input，以便删掉 target_data 中每个批次的最后一个单词 ID，并将 GO ID 放到每个批次的开头。

def process_decoding_input(target_data, target_vocab_to_int, batch_size):
    """
    Preprocess target data for dencoding
    :param target_data: Target Placehoder
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :param batch_size: Batch Size
    :return: Preprocessed target data
    """
    # 需要编程：
    ending=tf.strided_slice(target_data,[0,0],[batch_size,-1],[1,1])
    #tf.fill(dims,value)创建维度=dims的tensor，并且以value值填充
    #tf.concat(values,axis) values可以是一个List,沿着指定axis轴，连接values中的tensor
    dec_input=tf.concat([tf.fill([batch_size,1],target_vocab_to_int['<GO>']),ending],1)
    return dec_input

tests.test_process_decoding_input(process_decoding_input)

编码

实现 encoding_layer()，以使用 tf.nn.dynamic_rnn() 创建编码器 RNN 层级。

def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob):
    """
    Create encoding layer
    :param rnn_inputs: Inputs for the RNN
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param keep_prob: Dropout keep probability
    :return: RNN state
    """
    # 需要编程
    def make_cell(rnn_size,keep_prob):
        #tf.contrib.rnn.LSTMCell初始化LSTM单元参数，initializer=权重和投影矩阵的初始化器
        enc_cell=tf.contrib.rnn.LSTMCell(rnn_size,initializer=tf.random_uniform_initializer(-0.1,0.1,seed=2))
        drop=tf.contrib.rnn.DropoutWrapper(enc_cell,output_keep_prob=keep_prob)
        return drop
    #tf.contrib.rnn.MultiRNNCell(cells), cells( 有序的RNNcells列表 )
    cell=tf.contrib.rnn.MultiRNNCell([make_cell(rnn_size,keep_prob) for _ in range(num_layers)])
    _, enc_state = tf.nn.dynamic_rnn(cell, rnn_inputs, dtype=tf.float32)
    return enc_state

tests.test_encoding_layer(encoding_layer)

解码-训练

def decoding_layer_train(encoder_state, dec_cell, dec_embed_input, sequence_length, 
                         decoding_scope,output_fn, keep_prob):
    """
    创建一个解码层用于训练（Create a decoding layer for training）
    :param encoder_state: 编码状态 Encoder State
    :param dec_cell: 解码RNN单元 Decoder RNN Cell
    :param dec_embed_input: 解码嵌入输入 Decoder embedded input
    :param sequence_length: 序列长度 Sequence Length
    :param decoding_scope: 解码变量范围 TenorFlow Variable Scope for decoding
    :param output_fn: 应用到输出层的函数 Function to apply the output layer
    :param keep_prob: 丢弃的保留概率 Dropout keep probability
    :return: Train Logits
    """
    # 编程：
    train_decoder=tf.contrib.seq2seq.simple_decoder_fn_train(encoder_state)
    drop=tf.contrib.rnn.DropoutWrapper(dec_cell,output_keep_prob=keep_prob)
    train_dec_output=tf.contrib.seq2seq.dynamic_rnn_decoder(cell=drop,
                                                            decoder_fn=train_decoder, 
                                                            inputs=dec_embed_input,
                                                            sequence_length=sequence_length,
                                                            scope=decoding_scope)[0]
    train_logits=output_fn(train_dec_output)
    return train_logits

tests.test_decoding_layer_train(decoding_layer_train)

解码-推论

def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id,
                         end_of_sequence_id,maximum_length, vocab_size, 
                         decoding_scope, output_fn, keep_prob):
    """
    创建推理解码层Create a decoding layer for inference
    :param encoder_state: Encoder state
    :param dec_cell: Decoder RNN Cell
    :param dec_embeddings: 解码嵌入Decoder embeddings
    :param start_of_sequence_id: GO ID
    :param end_of_sequence_id: EOS Id
    :param maximum_length: 解码最大允许时间步数The maximum allowed time steps to decode
    :param vocab_size: 单词表大小Size of vocabulary
    :param decoding_scope: TensorFlow Variable Scope for decoding
    :param output_fn: Function to apply the output layer
    :param keep_prob: Dropout keep probability
    :return: Inference Logits
    """
    # 编程
    inference_decoder=tf.contrib.seq2seq.simple_decoder_fn_inference(output_fn,
                                                                     encoder_state,
                                                                     dec_embeddings,
                                                                     start_of_sequence_id,
                                                                     end_of_sequence_id,
                                                                    maximum_length,
                                                                     vocab_size)
    #drop=tf.contrib.rnn.DropoutWrapper(dec_cell,output_keep_prob=keep_prob)
    inference_decoder_output=tf.contrib.seq2seq.dynamic_rnn_decoder(cell=dec_cell,
                                                                    decoder_fn=inference_decoder,
                                                                    scope=decoding_scope)[0]
    
    return inference_decoder_output


tests.test_decoding_layer_infer(decoding_layer_infer)

构建解码层级

实现 decoding_layer() 以创建解码器 RNN 层级
a. 使用rnn_size 和 num_layers 创建解码 RNN 单元
b. 使用 lambda 创建输出函数，将输入，也就是分对数转换为类分对数（class logits）。注意是：将cell_output投影到class logits的函数。例如：tf.variable_scope(“decoder”) as varscope output_fn = lambda x: layers.linear(x, num_decoder_symbols,scope=varscope)
c. 使用 decoding_layer_train(encoder_state, dec_cell, dec_embed_input, sequence_length, decoding_scope, output_fn, keep_prob) 函数获取训练分对数
d. 使用 decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id, end_of_sequence_id, maximum_length, vocab_size, decoding_scope, output_fn, keep_prob) 函数获取推论分对数

def decoding_layer(dec_embed_input, dec_embeddings, encoder_state, vocab_size, 
                   sequence_length, rnn_size,num_layers,target_vocab_to_int, keep_prob):
    """
    Create decoding layer
    :param dec_embed_input: Decoder embedded input
    :param dec_embeddings: Decoder embeddings
    :param encoder_state: The encoded state
    :param vocab_size: Size of vocabulary
    :param sequence_length: Sequence Length
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :param keep_prob: Dropout keep probability
    :return: Tuple of (Training Logits, Inference Logits)
    """
    # 需要编程：
    def make_cell(rnn_size):
        dec_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        return dec_cell
    dec_cell = tf.contrib.rnn.MultiRNNCell([make_cell(rnn_size) for _ in range(num_layers)])
    ##fully_connected第三个参数为activation_fn=tf.nn.relu，设置为None
    output_fn=lambda x: tf.contrib.layers.fully_connected(x, vocab_size, None, 
                                                          scope=decoding_scope)
    with tf.variable_scope("decoding") as decoding_scope:
        train_decoder_output=decoding_layer_train(encoder_state, dec_cell, 
                                                  dec_embed_input, sequence_length, 
                                                  decoding_scope, output_fn, keep_prob)
   
    with tf.variable_scope("decoding",reuse=True) as decoding_scope:
        inference_decoder_output=decoding_layer_infer(encoder_state, dec_cell, 
                                                      dec_embeddings,
                                                      target_vocab_to_int['<GO>'],
                                                      target_vocab_to_int['<EOS>'], 
                                                      sequence_length, vocab_size,
                                                      decoding_scope, output_fn,
                                                      keep_prob)
    
    return train_decoder_output,inference_decoder_output

tests.test_decoding_layer(decoding_layer)

构建神经网络

向编码器的输入数据应用嵌入
使用 encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob) 编码输入
使用 process_decoding_input(target_data, target_vocab_to_int, batch_size) 函数处理目标数据
向解码器的目标数据应用嵌入
使用 decoding_layer(dec_embed_input, dec_embeddings, encoder_state, vocab_size, sequence_length, rnn_size, num_layers, target_vocab_to_int, keep_prob) 解码编码的输入数据

def seq2seq_model(input_data, target_data, keep_prob, batch_size, sequence_length,
                  source_vocab_size, target_vocab_size, enc_embedding_size, 
                  dec_embedding_size, rnn_size, num_layers, target_vocab_to_int):
    """
    Build the Sequence-to-Sequence part of the neural network
    :param input_data: Input placeholder
    :param target_data: Target placeholder
    :param keep_prob: Dropout keep probability placeholder
    :param batch_size: Batch Size
    :param sequence_length: Sequence Length
    :param source_vocab_size: Source vocabulary size
    :param target_vocab_size: Target vocabulary size
    :param enc_embedding_size: Decoder embedding size
    :param dec_embedding_size: Encoder embedding size
    :param rnn_size: RNN Size
    :param num_layers: Number of layers
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :return: Tuple of (Training Logits, Inference Logits)
    """
    # 需要编程
    enc_embed_input = tf.contrib.layers.embed_sequence(input_data, source_vocab_size, 
                                                       enc_embedding_size)
    encoder_state=encoding_layer(enc_embed_input, rnn_size, num_layers, keep_prob)
    
    dec_input=process_decoding_input(target_data, target_vocab_to_int, batch_size)
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, dec_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    
    train_logits,inference_logits=decoding_layer(dec_embed_input, dec_embeddings, encoder_state,target_vocab_size, sequence_length, rnn_size,
                   num_layers,target_vocab_to_int, keep_prob)
    
    return train_logits,inference_logits
    
tests.test_seq2seq_model(seq2seq_model)

训练神经网络

超参数

将 epochs 设为 epoch 次数
将 batch_size 设为批次大小
将 rnn_size 设为 RNN 的大小
将 num_layers 设为层级数量
将 encoding_embedding_size 设为编码器嵌入大小
将 decoding_embedding_size 设为解码器嵌入大小
将 learning_rate 设为训练速率
将 keep_probability 设为丢弃保留率（Dropout keep probability）

# 训练次数
epochs = 10
# Batch Size
batch_size = 256
# RNN Size
rnn_size = 128
# Number of Layers
num_layers = 2
# Embedding Size
encoding_embedding_size = 256
decoding_embedding_size = 256
# Learning Rate
learning_rate = 0.001
# 丢弃的 保留概率
keep_probability =0.7

构建图表

使用你实现的神经网络构建图表

save_path = './data'
(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = helper.load_preprocess()
max_source_sentence_length = max([len(sentence) for sentence in source_int_text])

train_graph = tf.Graph()
with train_graph.as_default():
    input_data, targets, lr, keep_prob = model_inputs()
    sequence_length = tf.placeholder_with_default(max_source_sentence_length, None,
                                                  name='sequence_length')
    input_shape = tf.shape(input_data)
    
    train_logits, inference_logits = seq2seq_model(
        tf.reverse(input_data, [-1]), targets, keep_prob, batch_size, sequence_length, len(source_vocab_to_int), len(target_vocab_to_int),
        encoding_embedding_size, decoding_embedding_size, rnn_size, num_layers, target_vocab_to_int)

    tf.identity(inference_logits, 'logits')
    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            train_logits,
            targets,
            tf.ones([input_shape[0], sequence_length]))

        # Optimizer
        optimizer = tf.train.AdamOptimizer(lr)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)

训练

利用预处理的数据训练神经网络

import time

def get_accuracy(target, logits):
    """
    Calculate accuracy
    """
    max_seq = max(target.shape[1], logits.shape[1])
    if max_seq - target.shape[1]:
        target = np.pad(
            target,
            [(0,0),(0,max_seq - target.shape[1])],
            'constant')
    if max_seq - logits.shape[1]:
        logits = np.pad(
            logits,
            [(0,0),(0,max_seq - logits.shape[1]), (0,0)],
            'constant')

    return np.mean(np.equal(target, np.argmax(logits, 2)))

train_source = source_int_text[batch_size:]
train_target = target_int_text[batch_size:]

valid_source = helper.pad_sentence_batch(source_int_text[:batch_size])
valid_target = helper.pad_sentence_batch(target_int_text[:batch_size])

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(epochs):
        for batch_i, (source_batch, target_batch) in enumerate(
                helper.batch_data(train_source, train_target, batch_size)):
            start_time = time.time()
            
            _, loss = sess.run(
                [train_op, cost],
                {input_data: source_batch,
                 targets: target_batch,
                 lr: learning_rate,
                 sequence_length: target_batch.shape[1],
                 keep_prob:keep_probability})
            
            batch_train_logits = sess.run(
                inference_logits,
                {input_data: source_batch, keep_prob: 1.0})
            batch_valid_logits = sess.run(
                inference_logits,
                {input_data: valid_source, keep_prob: 1.0})
                
            train_acc = get_accuracy(target_batch, batch_train_logits)
            valid_acc = get_accuracy(np.array(valid_target), batch_valid_logits)
            end_time = time.time()
            print('Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>6.3f}, Validation Accuracy: {:>6.3f}, Loss: {:>6.3f}'
                  .format(epoch_i, batch_i, len(source_int_text) // batch_size, train_acc, valid_acc, loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_path)
    print('Model Trained and Saved')

保存参数

保存 batch_size 和 save_path 参数以进行推论（for inference）

# Save parameters for checkpoint
helper.save_params(save_path)

检查点

要向模型提供要翻译的句子，首先需要预处理该句子。实现函数 sentence_to_seq() 以预处理新的句子

将句子转换为小写形式
使用 vocab_to_int 将单词转换为 id
如果单词不在词汇表中，将其转换为单词 id

def sentence_to_seq(sentence, vocab_to_int):
    """
    Convert a sentence to a sequence of ids
    :param sentence: String
    :param vocab_to_int: Dictionary to go from the words to an id
    :return: List of word ids
    """
    # 需要编程
    sentence_lower=sentence.lower()
    word_ids=[vocab_to_int.get(word,vocab_to_int['<UNK>'])for word in sentence_lower.split()]
    return word_ids

tests.test_sentence_to_seq(sentence_to_seq)

翻译

将 translate_sentence 从英语翻译成法语

translate_sentence = 'he saw a old yellow truck .'



translate_sentence = sentence_to_seq(translate_sentence, source_vocab_to_int)

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_path + '.meta')
    loader.restore(sess, load_path)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('logits:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')

    translate_logits = sess.run(logits, {input_data: [translate_sentence], keep_prob: 1.0})[0]

print('Input')
print('  Word Ids:      {}'.format([i for i in translate_sentence]))
print('  English Words: {}'.format([source_int_to_vocab[i] for i in translate_sentence]))

print('\nPrediction')
print('  Word Ids:      {}'.format([i for i in np.argmax(translate_logits, 1)]))
print('  French Words: {}'.format([target_int_to_vocab[i] for i in np.argmax(translate_logits, 1)]))

Adm1rat1on

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
语言翻译

运用神经网络完成机器翻译。使用英语和法语语句组成的数据集，训练一个序列到序列模型（sequence to sequence model），该模型能够将新的英语句子翻译成法语。获取数据import helperimport problem_unittests as testssource_path = './data/small_vocab_en'target_path = './dat...
复制链接

扫一扫