python机器人语音聊天_逐步在python中构建语言翻译聊天机器人

最新推荐文章于 2023-05-27 07:57:57 发布

weixin_26750481

最新推荐文章于 2023-05-27 07:57:57 发布

阅读量319

点赞数

文章标签： python 人工智能聊天机器人语音识别 nlp

原文链接：https://medium.com/swlh/building-a-language-translation-chatbot-in-python-step-by-step-40709393a98

版权

python机器人语音聊天

Here, in this article, We will make a language translation model and will be testing by providing input in one language and getting translated output in your desired language. We will be using Sequence to Sequence model architecture for our Language Translation model using Python.

在这里，在本文中，我们将建立一种语言翻译模型，并将通过提供一种语言的输入并获得所需语言的翻译输出进行测试。我们将为使用Python的语言翻译模型使用Sequence to Sequence模型体系结构。

A sequence to sequence model has two parts. The first part is an encoder and the second part is a decoder. Both the features are two different neural network models combined into one giant neural network. An encoder model’s task is to understand the input sequence by after applying other text cleaning mechanism and create a smaller vector representation of the given input text. Then the encoder model forwards the created vector to a decoder network, which generates a sequence that is an output vector representing the model’s output.

序列到序列模型有两个部分。第一部分是编码器，第二部分是解码器。这两个功能都是将两个不同的神经网络模型组合成一个巨型神经网络。编码器模型的任务是通过应用其他文本清除机制并为给定输入文本创建较小的矢量表示，从而了解输入序列。然后，编码器模型将创建的向量转发到解码器网络，解码器网络生成一个序列，该序列是代表模型输出的输出向量。

数据 (Data)

We will use the English to Hindi translation dataset, which has around 3000 conversations that we use in our day to day life. Data can we grab from any open-source resource. You can get it over at kaggle.

我们将使用英语到北印度语的翻译数据集，该数据集在我们的日常生活中使用了大约3000个对话。我们可以从任何开源资源中获取数据。您可以在kaggle上解决它。

Image for post — Photo by 수안 최 on Unsplash

Here, I am using a simple text file, which is space-separated conversations. It is based on English to Hindi conversations, but you can also use your own languages. But, the data format should be same as text file that will help you more by just following my code with no change. Else you might need to make some little change according to your data format.

在这里，我使用的是一个简单的文本文件，即以空格分隔的对话。它基于英语到北印度语的对话，但是您也可以使用自己的语言。但是，数据格式应该与文本文件相同，这将通过遵循我的代码而没有任何变化来为您提供更多帮助。否则，您可能需要根据数据格式进行一些小的更改。

数据块 (Chunks of data)

We need to break our data into some parts and use those parts to train out deep learning model so that our machine didn’t run out of memory.

我们需要将数据分解为几个部分，并使用这些部分来训练深度学习模型，以便我们的机器不会耗尽内存。

设定向量大小 (Setting Vector size)

We need to set the vector size. The vector size is the size of the output array size we need to define so that all the output array can have the same size.

我们需要设置向量大小。向量大小是我们需要定义的输出数组大小的大小，以便所有输出数组可以具有相同的大小。

embed_size=100 #define the vector size based on word your embedding
max_features=6000 #to restrict your number of unique words
maxlen=100

文字处理 (Text Processing)

As discussed, encoder and decoder data needs to be processed so that it can give better results. Here, in our Language Translation, we will use some text cleaning methods like:

如上所述，编码器和解码器数据需要进行处理，以便可以提供更好的结果。在这里，在我们的语言翻译中，我们将使用一些文本清除方法，例如：

Removing all the stopwords
删除所有停用词
Change in word case
更改大小写
Removing all the numeric data
删除所有数值数据
Remove duplicate words
删除重复的单词

词嵌入 (Word Embedding)

We will be using the word2vec model to converting out text data to a vector of defined size.

我们将使用word2vec模型将文本数据转换为定义大小的向量。

Word2Vec is a technique to turn words into numbers. Our machine learning or deep learning model accept input as numeric form.

Word2Vec是一种将单词转换为数字的技术。我们的机器学习或深度学习模型接受输入为数字形式。

We have two famous word embedding techniques:

我们有两种著名的词嵌入技术：

CBOW:
CBOW：
Skip-Gram
跳过格拉姆

We can use any pre-trained word2vec models. Here, We will be utilizing GloVe model. GloVe model combines the benefits of the word2vec skip-gram model in the word analogy tasks. This GloVe model can be found on google. It has a .txt format that we can import using the following code.

我们可以使用任何预训练的word2vec模型。在这里，我们将利用GloVe模型。 GloVe模型在单词类比任务中结合了word2vec跳过语法模型的优势。可以在Google上找到此GloVe模型。它具有.txt格式，我们可以使用以下代码导入。

Glove embedding is famous for small size embedding and is enough for our day to day chats.

手套嵌入以小尺寸嵌入而闻名，足以进行我们日常的聊天。

数据令牌化 (Data Tokenization)

After initializing our word embedding, we need to tokenize our data using embedding. Embedding converts each word into a defined size vector of numbers. Our machine learning or deep learning models works on numeric data, to it is necessary to converts any text data to numeric data by defining each word to a specific vector so that we can later identify them.

初始化词嵌入后，我们需要使用嵌入来标记数据。嵌入将每个单词转换为数字的定义大小矢量。我们的机器学习或深度学习模型适用于数字数据，因此有必要通过将每个单词定义为特定向量来将任何文本数据转换为数字数据，以便以后识别它们。

资料准备 (Data Preparation)

Finally, We need to use our defined data processing steps to clean our data and use tokenized_data.py to convert them into tokens. Here, we will take a question and answer set as input. We will apply text cleaning steps, and finally, we will pass then by our pre-trained word2vec model to assign each word a vector. And, yet take the average of word vector to make a sentence vector.

最后，我们需要使用定义的数据处理步骤来清理数据，并使用tokenized_data.py将它们转换为令牌。在这里，我们将问答集作为输入。我们将应用文本清理步骤，最后，我们将通过预先训练的word2vec模型，为每个单词分配一个向量。并且，将单词向量的平均值作为句子向量。

Here, we also need to define the start and end of the chat sentences so that model can understand where is the end of a particular sentence and where the sentence starts, which helps our model in inferencing.

在这里，我们还需要定义聊天句子的开头和结尾，以便模型可以了解特定句子的结尾和句子的起点，这有助于我们的模型进行推理。

培养 (Train)

Finally, it’s time to train our model. We will here use of cleaned and vector format data to pass it to Sequence to Sequence model. Our model will trained over all the conversations using batch data that we have defined at the beginning.

最后，是时候训练我们的模型了。在这里，我们将使用清理后的矢量格式数据将其传递给Sequence to Sequence模型。我们的模型将使用开始时定义的批处理数据训练所有对话。

使用我们的语言翻译聊天机器人 (Using our Language Translation Chatbot)

Now, it’s time to use our trained model. But, before using it, we need to define some functions that will help us cleaning input data and transforming it into vectors and passing it to a trained language translation model and getting output vector that we will decode to get the output translated sentence.

现在，该使用我们训练有素的模型了。但是，在使用它之前，我们需要定义一些函数，这些函数将帮助我们清理输入数据并将其转换为向量，并将其传递给经过训练的语言翻译模型，并获取输出向量，我们将对其进行解码以获得输出翻译后的句子。

enc_model , dec_model = make_inference_models()

运行聊天机器人 (Running Chatbot)

So, we have trained our model on chunks of data we created. We trained in different epochs. Now, to check model performance, we can start giving the input and observe the kind of output we receive from the model. Here, I am using a loop to ask 10 language translation questions to our model. Our model takes input. Clean the input, create a word vector, and finally take the mean of word vectors to generate a sentence vector. And the sentence vector goes to model and model as output provides another sentence vector that we decode and print out as output.

因此，我们已经在创建的数据块上训练了模型。我们训练了不同的时代。现在，要检查模型的性能，我们可以开始提供输入并观察从模型接收到的输出类型。在这里，我正在使用一个循环向我们的模型询问10种语言翻译问题。我们的模型接受输入。清理输入，创建单词向量，最后取单词向量的均值以生成句子向量。句子向量进入模型并作为输出模型提供了另一个句子向量，我们将其解码并输出为输出。

So, now we have our Language Translation model that converts any English Sentence to Hindi. We can use any other language also and the code will be the same for that also.

因此，现在我们有了语言翻译模型，可以将任何英语句子转换为印地语。我们也可以使用任何其他语言，并且代码也将相同。

如何提高精度 (How to Increase Accuracy)

The accuracy of your model depends on the data source and the kind of model use which suits your data. The more data you will have, the more you can train and validate your model.

模型的准确性取决于数据源和适合您数据的模型使用类型。您拥有的数据越多，您可以训练和验证模型的次数就越多。

So, here it is. We have built our language translation in Python. Try it with your own data and your own language. Feel free to ask for any doubt in the comment section. Happy learning!

所以，就在这里。我们已经在Python中建立了语言翻译。使用您自己的数据和您的语言进行尝试。随时在评论部分提出任何疑问。学习愉快！

翻译自: https://medium.com/swlh/building-a-language-translation-chatbot-in-python-step-by-step-40709393a98

python机器人语音聊天

weixin_26750481

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python机器人语音聊天_逐步在python中构建语言翻译聊天机器人

python机器人语音聊天Here, in this article, We will make a language translation model and will be testing by providing input in one language and getting translated output in your desired language. We will be ...
复制链接

扫一扫