使用seq2seq模型进行机器翻译的方法不同

自然语言处理| 深度学习 (Natural language processing | Deep learning)

Machine translation is a computational linguistics sub-field that examines how software is used to translate text or speech from one language to another. MT performs mechanical substitution of words in one language for words in another on a simple level, but this alone rarely yields effective translation since it involves comprehension of entire sentences and their nearest counterparts in the target language. Two given languages may have completely different structures. Words in a language do not have equivalent words in another language. Also, many words have more than one meaning. Solving this problem with neural techniques is a fast-growing field that leads to better translations and it handles differences in translation of idioms, and typology.

机器翻译是计算语言学的一个子领域,它检查如何使用软件将文本或语音从一种语言翻译成另一种语言。 MT在一个简单的级别上将一种语言中的单词机械替换为另一种语言中的单词,但仅凭这种语言就很少产生有效的翻译,因为它涉及对整个句子及其目标语言中最接近的句子的理解。 两种给定的语言可能具有完全不同的结构。 一种语言中的单词没有另一种语言中的等效单词。 同样,许多单词具有不止一种含义。 用神经技术解决这个问题是一个快速发展的领域,它可以带来更好的翻译,并且可以处理习语翻译和类型学方面的差异。

In this article, we are going to build a translator that can translate an English sentence to a Hindi sentence. You can create your translator for different languages by simply changing the dataset we are going to use here. We will use the Recurrent Neural Network topic — seq2seq i.e. the Encoder-Decoder model. In the below article, the seq2seq model is used to build a generative chatbot.

在本文中,我们将构建一个可将英语句子翻译为印地语句子的翻译器。 您只需更改我们将在此处使用的数据集,即可为不同的语言创建翻译器。 我们将使用循环神经网络主题seq2seq,即编码器-解码器模型。 在下面的文章中,将使用seq2seq模型构建一个生成的聊天机器人。

Machine translation is more or less similar to what is done in the above article. The prime difference in building a generative chatbot and a machine translator is of the dataset and text preprocessing. That said, the steps we will follow here will be similar to those in the below article.

机器翻译或多或少类似于上述文章。 构建生成的聊天机器人和机器翻译器的主要区别在于数据集和文本预处理。 也就是说,我们将在此处执行的步骤与下一篇文章中的步骤类似。

There are two approaches we can take when doing machine translation. We will discuss them in the upcoming sections.

Ť 这里是做机器翻译的时候,我们可以采取两种方法。 我们将在接下来的部分中讨论它们。

机器翻译的seq2seq方法简介 (Introduction to the seq2seq approach for Machine translation)

The seq2seq model also called the encoder-decoder model uses Long Short Term Memory- LSTM for text generation from the training corpus. The seq2seq model is also useful in machine translation applications. What does the seq2seq or encoder-decoder model do in simple words? It predicts a word given in the user input and then each of the next words is predicted using the probability of likelihood of that word to occur. In building our Generative chatbot we will use this approach for text generation given in the user input.

seq2seq模型也称为编码器-解码器模型,它使用长短期记忆LSTM从训练语料库生成文本。 seq2seq模型在机器翻译应用程序中也很有用。 seq2seq或编码器-解码器模型用简单的词表示什么? 它预测用户输入中给定的单词,然后使用该单词出现的可能性来预测下一个单词。 在构建Generative聊天机器人时,我们将使用这种方法来生成用户输入中给出的文本

Image for post
Machine translation using the Encoder-Decoder model
使用Encoder-Decoder模型进行机器翻译

The encoder outputs a final state vector (memory) which becomes the initial state for the decoder. We use a method called teacher forcing to train the decoder which enables it to predict the following words in a target sequence given in the previous words. As shown above, states are passed through the encoder to each layer of the decoder. ‘I’, ‘do’, ‘not’, and ‘know’ are called input tokens while ‘मुझे’, ‘नहीं’, and ‘पता’ are called target tokens. The likelihood of token ‘पता’ depends on the previous words and the encoder states. We are adding ‘<END>’ token to let our decoder know when to stop. You can learn more about the seq2seq model here.

编码器输出最终状态向量(存储器),该状态向量成为解码器的初始状态。 我们使用一种称为教师强迫的方法来训练解码器,该解码器使它能够以先前单词给出的目标序列来预测接下来的单词。 如上所示,状态通过编码器传递到解码器的每一层。 'I','do','not'和'know'称为输入令牌,而'मुझे','नहीं'和'प彼''称为目标令牌。 令牌“पाा”的可能性取决于先前的单词和编码器状态。 我们正在添加“ <END>”令牌,以使我们的解码器知道何时停止。 您可以在此处了解有关seq2seq模型的更多信息。

Let’s start building our translator from scratch! The first task we will have to do is preprocess our dataset.

让我们从头开始构建我们的翻译器! 我们要做的第一项任务是预处理数据集。

预处理数据集 (Preprocessing the dataset)

The dataset to be used here is self-created with the help of a dataset available on a public repository on GitHub. You can find the code along with the dataset from the project link given at the end of this article. The dataset contains 10,000 English sentences and the corresponding Hindi translations.

此处使用的数据集是在GitHub的公共存储库上可用的数据集的帮助下自行创建的。 您可以从本文结尾给出的项目链接中找到代码以及数据集。 数据集包含10,000个英语句子和相应的印地语翻译。

First, we will have to clean our corpus with the help of Regular Expressions. Then, we will need to make pairs like English-Hindi so that we can train our seq2seq model. We will do these tasks as shown below.

首先,我们将不得不借助正则表达式来清理语料库。 然后,我们需要像英语-印地语这样配对,以便我们可以训练seq2seq模型。 我们将执行以下任务。

import re
import random
data_path = "/Data/English.txt"
data_path2 = "/Data/Hindi.txt"# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
lines = f.read().strip().split('\n')
with open(data_path2, 'r', encoding='utf-8') as f:
lines2 = f.read().strip().split('\n')lines = [" ".join(re.findall(r"[A-Za-z0-9]+",line)) for line in lines]
lines2 = [re.sub(r"%s|\(|\)|<|>|%|[a-z]|[A-Z]|_",'',line) for line in lines2]# Grouping lines by response pair
pairs = list(zip(lines,lines2))
random.shuffle(pairs)

After creating pairs we can also shuffle those before training. Our pairs will look like this now:

建立配对后,我们还可以在训练之前将它们洗牌。 我们的配对现在看起来像这样:

[('he disliked that old black automobile', 'उन्होंने उस पुराने काले ऑटोमोबाइल को नापसंद किया।'), ('they dislike peaches pears and apples', 'वे आड़ू, नाशपाती और सेब को नापसंद करते हैं।'),...]

Here, ‘he disliked that old black automobile’ is input sequence, and ‘उन्होंने उस पुराने काले ऑटोमोबाइल को नापसंद किया।’ is a target sequence. We will have to create separate lists for input sequences and target sequences and we will also need to create lists for unique tokens (input tokens and target tokens) in our dataset. For target sequences, we will add ‘<START>’ at the beginning of the sequence and ‘<END>’ at the end of the sequence so that our model knows where to start and end text generation. We will do this as shown below.

在这里,“他不喜欢那辆旧黑色汽车”是输入序列,而“他不喜欢那辆旧黑色汽车”是输入序列。 是目标序列。 我们将必须为输入序列和目标序列创建单独的列表,并且还需要为数据集中的唯一标记(输入标记和目标标记)创建列表。 对于目标序列,我们将在序列的开头添加“ <START>”,并在序列的末尾添加“ <END>”,以便我们的模型知道从何处开始和结束文本生成 。 我们将如下所示。

import numpy as np
input_docs = []
target_docs = []
input_tokens = set()
target_tokens = set()
for line in pairs:
input_doc, target_doc = line[0], line[1]
# Appending each input sentence to input_docs
input_docs.append(input_doc)# Splitting words from punctuation
target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))

# Redefine target_doc below and append it to target_docs
target_doc = '<START> ' + target_doc + ' <END>'
target_docs.append(target_doc)
# Now we split up each sentence into words and add each unique word to our vocabulary set
for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
if token not in input_tokens:
input_tokens.add(token)
for token in target_doc.split():
if token not in target_tokens:
target_tokens.add(token)input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)

两种不同的方法 (The two different approaches)

A key thing to notice here is that while creating target_doc we are splitting words from punctuation. This means that the target sequence ‘वे आड़ू, नाशपाती और सेब को नापसंद करते हैं।’ will become ‘व े आड ़ ू, न ा शप ा त ी और स े ब क ो न ा पस ं द करत े ह ै ं ।’. This is done when we are performing character-level predictions. Another option to preprocess our target sequence is to simply append the sequence as it is. This is done when we want to train our model to predict the fixed words from the training corpus (word-level prediction). To use this approach comment out the bold statement in the above code snippet. When we are doing the character-level prediction we get 200 encoder tokens and 238 decoder tokens while in word-level prediction, we get 200 encoder tokens and 678 decoder tokens. We will discuss the performance difference between these two options in the latter section while discussing the accuracy and loss of the model. For now, let’s stick to the former (character-level) option.

这里要注意的关键是,在创建target_doc时,我们正在从标点符号中分离单词。 这意味着目标序列'वेआड़ू,नआड़ूआड़ूसेबसेबहैं。' 会变成'वूू,ूेूसकककककननन。。 这是在执行字符级预测时完成的。 预处理目标序列的另一种方法是简单地按原样追加序列。 当我们想训练我们的模型以从训练语料库中预测固定词(词级预测)时,就可以做到这一点。 要使用这种方法,请在上面的代码片段中注释掉粗体语句。 当我们进行字符级预测时,我们获得200个编码器令牌和238个解码器令牌,而在单词级预测中,我们获得200个编码器令牌和678个解码器令牌。 我们将在下一部分中讨论这两个选项之间的性能差异,同时讨论模型的准确性和损失。 现在,让我们继续使用前一个(字符级)选项。

Now, we have unique input tokens and target tokens for our dataset. Now we will create an input features dictionary that will store our input tokens as key-value pairs, the word being the key and value is the index. Similarly, for target tokens, we will create a target features dictionary. Features dictionary will help us encode our sentences into one-hot vectors. After all, computers only understand the numbers. To decode the sentences we will need to create the reverse features dictionary that stores index as a key and word as a value.

现在,我们为数据集提供了唯一的输入令牌和目标令牌。 现在,我们将创建一个输入要素字典,该字典会将我们的输入标记存储为键值对,其中单词是键,值是索引。 同样,对于目标标记,我们将创建一个目标特征字典。 功能字典将帮助我们将句子编码为一键向量。 毕竟,计算机只能理解数字。 为了对句子进行解码,我们将需要创建反向特征字典,该字典将index作为关键字存储,将word作为值存储。

input_features_dict = dict(
[(token, i) for i, token in enumerate(input_tokens)])
target_features_dict = dict(
[(token, i) for i, token in enumerate(target_tokens)])reverse_input_features_dict = dict(
(i, token) for token, i in input_features_dict.items())
reverse_target_features_dict = dict(
(i, token) for token, i in target_features_dict.items())

训练设置 (Training setup)

To train our seq2seq model we will use three matrices of one-hot vectors, Encoder input data, Decoder input data, and Decoder output data. The reason we are using two matrices for the Decoder is a method called teacher forcing which is used by the seq2seq model while training. What is the idea behind this? We have an input token from the previous timestep to help the model train for the current target token. Let’s create these matrices.

为了训练我们的seq2seq模型,我们将使用三个矩阵的一键向量,编码器输入数据,解码器输入数据和解码器输出数据。 我们在解码器中使用两个矩阵的原因是一种称为教师强迫的方法 训练时由seq2seq模型使用 这背后的想法是什么? 我们从上一个时间步获得了输入令牌,以帮助模型训练当前目标令牌。 让我们创建这些矩阵。

#Maximum length of sentences in input and target documents
max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs])encoder_input_data = np.zeros(
(len(input_docs), max_encoder_seq_length, num_encoder_tokens), dtype='float32')
decoder_input_data = np.zeros(
(len(input_docs), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
decoder_target_data = np.zeros(
(len(input_docs), max_decoder_seq_length, num_decoder_tokens), dtype='float32')for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)):
for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)):
#Assign 1. for the current line, timestep, & word in encoder_input_data
encoder_input_data[line, timestep, input_features_dict[token]] = 1.
for timestep, token in enumerate(target_doc.split()):
decoder_input_data[line, timestep, target_features_dict[token]] = 1.
if timestep > 0:
decoder_target_data[line, timestep - 1, target_features_dict[token]] = 1.

To get a clear understanding of how the dimensions of encoder_input_data works see the below figure from the above-mentioned article. The decoder_input_data and decoder_target_data similarly have the dimensions.

为了清楚地了解编码器 _输入_数据的尺寸是如何工作的,请参见上述文章中的下图。 解码器输入数据解码器目标数据类似地具有尺寸。

Image for post

编码器-解码器模型的训练设置 (Training setup for the Encoder-decoder model)

Our encoder model requires an input layer which defines a matrix for holding the one-hot vectors and an LSTM layer with some number of hidden states. Decoder model structure is almost the same as encoder’s but here we pass in the state data along with the decoder inputs.

我们的编码器模型需要一个输入层和一个LSTM层,其中输入层定义了用于保存单热向量的矩阵,而LSTM层则具有一些隐藏状态。 解码器模型的结构几乎与编码器的结构相同,但是这里我们将状态数据与解码器输入一起传递。

from tensorflow import keras
from keras.layers import Input, LSTM, Dense
from keras.models import Model
#Dimensionality
dimensionality = 256
#The batch size and number of epochs
batch_size = 256
epochs = 100
#Encoder
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(dimensionality, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]#Decoder
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(dimensionality, return_sequences=True, return_state=True)
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

You can learn more about how to code the encoder-decoder model here as a full explanation of it is out of scope for this article.

您可以在此处了解有关如何编码编码器-解码器模型的更多信息,因为对此的完整解释超出了本文的范围。

建立和训练seq2seq模型 (Building and training seq2seq model)

Now we will create our seq2seq model and train it with encoder and decoder data as shown below.

现在,我们将创建seq2seq模型,并使用编码器和解码器数据对其进行训练,如下所示。

#Model
training_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
#Compiling
training_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'], sample_weight_mode='temporal')
#Training
training_model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size = batch_size, epochs = epochs, validation_split = 0.2)

Here, we are using adam as an optimizer and categorical_crossentropy as our loss function. We call the .fit() method by giving the encoder and decoder input data (X/input) and decoder target data (Y/label).

在这里,我们使用亚当作为优化器,使用categorical_crossentropy作为损失函数。 我们通过提供编码器和解码器输入数据(X /输入)和解码器目标数据(Y /标签)来调用.fit()方法。

两种不同的方法-性能比较 (Two different approaches — Performance comparison)

After the training process, we get training accuracy of 53.35% and validation accuracy of 52.77% while the training loss and validation loss are 0.0720 and 0.1137 respectively. Look at the plots of accuracy and loss during the training process.

经过训练,我们得到的训练准确度为53.35%,验证准确度为52.77%,而训练损失和验证损失分别为0.0720和0.1137。 查看训练过程中的准确性和损失图。

Image for post
Image for post

The training and validation accuracies we get for the word-level prediction are 71.07% and 72.99% respectively while the training and validation losses are 0.0185 and 0.0624 respectively. Look at the plots of accuracy and loss during the training process.

我们对词级预测的训练和验证精度分别为71.07%和72.99%,而训练和验证损失分别为0.0185和0.0624 查看训练过程中的准确性和损失图。

Image for post
Image for post

The accuracy curves are very smooth in case of character-level predictions while in the case of word-level predictions the curve contains many spikes. We are getting a very high accuracy in the beginning but the loss is also high and as the loss goes down the accuracy also tends to fluctuate and go down. This tells us not to rely on the latter approach even if it gives higher accuracy than the former approach as the spikes introduce uncertainty in the performance.

在字符级预测的情况下,精度曲线非常平滑,而在单词级预测的情况下,该曲线包含许多尖峰。 我们在一开始就获得了非常高的精度,但是损耗也很高,并且随着损耗的降低,精度也会趋于波动并下降。 这告诉我们,即使它提供的准确性比前一种方法更高,也不要依赖于后一种方法,因为尖峰会导致性能不确定。

测试设置 (Testing setup)

Now, to handle an input that the model has not seen we will need a model that decodes step-by-step instead of using teacher forcing because the model we created only works when the target sequence is known. In the Generative chatbot application, we will not know what the generated response will be for input the user passes in. For doing this, we will have to build a seq2seq model in individual pieces. Let’s first build an encoder model with encoder inputs and encoder output states. We will do this with the help of the previously trained model.

现在,要处理模型未看到的输入,我们将需要一个逐步解码的模型,而不是使用教师强制,因为我们创建的模型仅在目标序列已知时才起作用 在Generative chatbot应用程序中,我们将不知道生成的响应将用于用户传递的输入。为此,我们将必须分别构建一个seq2seq模型。 首先,我们建立一个具有编码器输入和编码器输出状态的编码器模型。 我们将在先前训练有素的模型的帮助下进行此操作。

from keras.models import load_model
training_model = load_model('training_model.h5')encoder_inputs = training_model.input[0]
encoder_outputs, state_h_enc, state_c_enc = training_model.layers[2].output
encoder_states = [state_h_enc, state_c_enc]
encoder_model = Model(encoder_inputs, encoder_states)

Next, we will need to create placeholders for decoder input states as we do not know what we need to decode or what hidden state we will get.

接下来,我们将需要为解码器输入状态创建占位符,因为我们不知道我们需要解码什么或获得什么隐藏状态。

latent_dim = 256
decoder_state_input_hidden = Input(shape=(latent_dim,))
decoder_state_input_cell = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]

Now, we will create new decoder states and outputs with the help of the decoder LSTM and Dense layer that we trained earlier.

现在,我们将在我们之前训练的LSTM和Dense层的帮助下创建新的解码器状态和输出。

decoder_outputs, state_hidden, state_cell = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_hidden, state_cell]
decoder_outputs = decoder_dense(decoder_outputs)

Finally, we have the decoder input layer, the final states from the encoder, the decoder outputs from the Dense layer of the decoder, and decoder output states which is the memory during the network from one word to the next. We can bring this all together now and set up the decoder model as shown below.

最后,我们具有解码器输入层,来自编码器的最终状态,来自解码器的密集层的解码器输出以及解码器输出状态,该状态是网络中从一个字到下一个字的存储空间。 现在,我们可以将所有这些整合在一起,如下所示设置解码器模型。

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

测试我们的模型 (Testing our model)

At last, we will create a function that accepts our text inputs and generates a response using encoder and decoder that we created. In the function below, we pass in the NumPy matrix that represents our text sentence and we get the generated response back from it. I have added comments for almost every line of code for you to understand it quickly. What happens in the below function is this: 1.) We retrieve output states from the encoder 2.) We pass in the output states to the decoder (which is our initial hidden state of the decoder) to decode the sentence word by word 3.) Update the hidden state of decoder after decoding each word so that we can use previously decoded words to help decode new ones

最后,我们将创建一个函数,该函数接受文本输入并使用我们创建的编码器和解码器生成响应。 在下面的函数中,我们传入代表文本句子的NumPy矩阵,然后从中返回生成的响应。 我为几乎每一行代码都添加了注释,以便您快速理解。 下面的函数中发生的事情是:1.)我们从编码器检索输出状态2.)我们将输出状态传递给解码器(这是解码器的初始隐藏状态),以逐个单词对句子进行解码3 。)在解码每个单词之后更新解码器的隐藏状态,以便我们可以使用以前解码的单词来帮助解码新单词

We will stop once we encounter ‘<END>’ token that we added to target sequences in our preprocessing task or we hit the maximum length of the sequence.

一旦遇到在预处理任务中添加到目标序列的“ <END>”令牌,或者达到序列的最大长度,我们将停止。

def decode_response(test_input):
#Getting the output states to pass into the decoder
states_value = encoder_model.predict(test_input)
#Generating empty target sequence of length 1
target_seq = np.zeros((1, 1, num_decoder_tokens))
#Setting the first token of target sequence with the start token
target_seq[0, 0, target_features_dict['<START>']] = 1.
#A variable to store our response word by word
decoded_sentence = ''
stop_condition = False
while not stop_condition:
#Predicting output tokens with probabilities and states
output_tokens, hidden_state, cell_state = decoder_model.predict([target_seq] + states_value)
#Choosing the one with highest probability
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_token = reverse_target_features_dict[sampled_token_index]
decoded_sentence += " " + sampled_token#Stop if hit max length or found the stop token
if (sampled_token == '<END>' or len(decoded_sentence) > max_decoder_seq_length):
stop_condition = True
#Update the target sequence
target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, sampled_token_index] = 1.
#Update states
states_value = [hidden_state, cell_state]
return decoded_sentence

全部放在一起—机器翻译 (Putting it all together — Machine Translation)

Let’s create a class that contains methods required for running our translator.

让我们创建一个包含运行翻译器所需方法的类。

class Translator:
exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")
#Method to start the translator
def start(self):
user_response = input("Give in an English sentence. :) \n")
self.translate(user_response)
#Method to handle the conversation
def translate(self, reply):
while not self.make_exit(reply):
reply = input(self.generate_response(reply)+"\n")#Method to convert user input into a matrix
def string_to_matrix(self, user_input):
tokens = re.findall(r"[\w']+|[^\s\w]", user_input)
user_input_matrix = np.zeros(
(1, max_encoder_seq_length, num_encoder_tokens),
dtype='float32')
for timestep, token in enumerate(tokens):
if token in input_features_dict:
user_input_matrix[0, timestep, input_features_dict[token]] = 1.
return user_input_matrix
#Method that will create a response using seq2seq model we built
def generate_response(self, user_input):
input_matrix = self.string_to_matrix(user_input)
chatbot_response = decode_response(input_matrix)
#Remove <START> and <END> tokens from chatbot_response
chatbot_response = chatbot_response.replace("<START>",'')
chatbot_response = chatbot_response.replace("<END>",'')
return chatbot_response
#Method to check for exit commands
def make_exit(self, reply):
for exit_command in self.exit_commands:
if exit_command in reply:
print("Ok, have a great day!")
return True
return False
translator = Translator()

All methods are self-explanatory in the above code. Below is the final output for our translator!

上面的代码中所有方法都是不言自明的。 以下是我们翻译器的最终输出!

两种不同的方法-最终输出比较 (Two different approaches — Final output comparison)

Image for post
Output for word-level prediction 输出以进行字级预测
Image for post
Output for character-level prediction 输出用于字符级预测

The above snapshots show translation done by our translator for two different approaches.

上面的快照显示了我们的翻译人员针对两种不同方法进行的翻译。

Image for post
Photo by Greg Bulla on Unsplash
Greg BullaUnsplash上的 照片

You can find all of the code above along with the dataset from GitHub. You can connect with me on LinkedIn also. If any query arises you can leave a response here or in my LinkedIn inbox.

您可以在GitHub上找到上面的所有代码以及数据集。 您也可以在LinkedIn上与我联系。 如果有任何疑问,您可以在此处或在我的LinkedIn收件箱中留下答复。

结论 (Conclusion)

We managed to get an accuracy of around 53% in the case of character-level prediction and 73% in the case of word-level prediction. Natural language processing is a domain that requires tons of data especially the machine translation task. It is developing and training neural networks for approximating the approach the human brain takes towards language processing. This deep learning strategy allows computers to handle human language much more efficiently. There are companies like Google and Microsoft that gives human-level accuracy in the machine translation task. The network these companies use is a lot more complex one as compared to the one we created here.

在字符级预测的情况下,我们设法获得了约53%的准确性,而在单词级预测的情况下,我们取得了73%的准确性。 自然语言处理是一个需要大量数据(尤其是机器翻译任务)的领域。 它正在开发和训练神经网络,以近似人脑对语言处理所采用的方法。 这种深度学习策略使计算机可以更有效地处理人类语言。 有像Google和Microsoft这样的公司,它们在机器翻译任务中提供人为的准确性。 与我们在此处创建的网络相比,这些公司使用的网络要复杂得多。

翻译自: https://towardsdatascience.com/machine-translation-with-the-seq2seq-model-different-approaches-f078081aaa37

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值