使用lstm实现文本生成_使用LSTM生成电视脚本

最新推荐文章于 2024-07-13 10:55:59 发布

weixin_26752075

最新推荐文章于 2024-07-13 10:55:59 发布

阅读量674

点赞数 1

文章标签： python java linux vue https ViewUI

原文链接：https://towardsdatascience.com/generating-tv-scripts-with-lstm-e94be65a179

版权

本文介绍了如何使用LSTM神经网络来实现文本生成，特别是针对电视脚本的创作。通过翻译自的数据科学文章，读者可以了解到如何运用Python相关技术进行实践。

摘要由CSDN通过智能技术生成

使用lstm实现文本生成

Hello friends, I just completed this project as a part of the Deep Learning Nanodegree at Udacity.

^ h ELLO朋友，我只是完成了这个项目的深度学习Nanodegree在Udacity的一部分。

The project is about predicting the next word in the script, given the previous context using LSTM.

该项目是关于使用LSTM在给定先前上下文的情况下预测脚本中的下一个单词。

I’ll be explaining the major and the very basic concepts that will be required while doing the project. The sections which I’ve covered include:

我将解释执行该项目时需要的主要和非常基本的概念。我涵盖的部分包括：

How our data can be transformed into a form that the model can work with?

如何将我们的数据转换为模型可以使用的形式？

Preprocessing the data
预处理数据
Batching the data
批量处理数据

How the model architecture works with the batched data, passing it from one layer to the other and how it finally leads to predicting the next word in the sequence.

模型体系结构如何与批处理数据一起工作，如何将数据从一层传递到另一层，以及如何最终导致预测序列中的下一个单词。

目标受众和先决条件 (Targeted Audience and Prerequisites)

This is a fundamental project done with LSTM, anyone having basic knowledge about neural networks and RNNs can easily understand this. I have tried breaking the project down to the very basic details as simply as possible.

这是使用LSTM完成的基础项目，任何具有神经网络和RNN基本知识的人都可以轻松理解这一点。我尝试过尽可能简单地将项目分解为最基本的细节。

I have put a good amount of focus on the model architecture, the input shapes, output shapes, and parameters that each layer considers. I have compiled and explained what each of these input/output shapes mean and why they are shaped in a particular way, for you to visualize better.

我已经将大量精力放在模型架构，每个层考虑的输入形状，输出形状和参数上。我已经汇编并解释了每种输入/输出形状的含义，以及为什么要以特定方式对它们进行形状处理，以便于您更好地可视化。

Note: The project has been coded in Pytorch. However, there is not much difference in the parameters that the LSTM model layers consider. So, anyone trying to do it in Keras too can refer to my explanation.

注意：该项目已在Pytorch中编码。但是，LSTM模型层考虑的参数没有太大差异。因此，任何在Keras中尝试这样做的人都可以参考我的解释。

What I have not covered?

我没有涵盖什么？

The theoretical details of what an LSTM is, or why it works though I’ve touched a part of why it works better than Neural Networks for this case.I have attached a link to the main project, but I haven’t explained the whole of it line by line.

关于LSTM是什么的理论细节，或者为什么它起作用，尽管我已经谈到了为什么LSTM在这种情况下比神经网络更好的一部分。我已经附加了一个链接到主要项目，但是我没有整体解释它逐行。

However, I have covered the parts that I felt would be sufficient for understanding the main working so that you can generalize the concept to use it for other similar projects.

但是，我已经介绍了我认为足以理解主要工作的部分，以便您可以概括该概念以将其用于其他类似项目。

让我们在🐬中跳 (Let’s jump right in 🐬)

Image for post — Wilhelm Gunkel on 威廉· Unsplash 甘克尔(Unillash)摄

文字预处理 (Text preprocessing)

We have the text, we first batch the data in a form that it can be fed into the LSTM.

有了文本，我们首先以可以将其输入LSTM的形式批处理数据。

An important part of which is Text preprocessing, we have applied a few very basic necessary but not sufficient preprocessing to the text:

其中一个重要的部分是文本预处理，我们对文本应用了一些非常基本的必要但不充分的预处理：

Lower-casing the text
小写文本
Tokenizing the text: A step that splits longer strings of text into smaller pieces, or tokens.
标记文本：拆分较长文本字符串的步骤成小块或令牌。
Tokenizing the punctuations i.e. replacing “great!” by a “great”, “|| exclamation ||” so that “great!” and “great” are not considered as different words.
标记标点符号，即替换“ great！” 用“伟大”，“ || 感叹号||” 所以说“太好了！” 和“伟大”不被视为不同的词。
Creating two lookup tables mapping words to integers(because a neural network understands the language of numbers only) and vice versa.
创建两个将单词映射到整数的查找表(因为神经网络仅理解数字语言)，反之亦然。

Some other text processing techniques might be Stemming, lemmatization. This is a good read on text processing.

其他一些文本处理技术可能是词干，词根化。这是一篇有关文本处理的好书。

After having preprocessed the text and converted it to integers, the text data is batched in a way that LSTM can consider it as an input.

在对文本进行预处理并将其转换为整数之后，将以LSTM可以将其视为输入的方式对文本数据进行批处理。

Now before moving further, a very basic yet important question:

在继续之前，有一个非常基本但重要的问题：

为什么使用RNN / LSTM而不是普通的神经网络？ (Why RNN/LSTM and not a normal neural network?)

The two most important terms in RNN/ LSTMs are recurrent units and hidden units. Both of which play a part in considering the context of the text. For example, if you join a discussion in between it is difficult for you to converse, you need to know the context of the discussion before you say something. That is made possible with the help of hidden units, the hidden units carry forward the information from the previous context, so the LSTM/RNN unit considers 2 inputs: the input word and the hidden unit.

RNN / LSTM中两个最重要的术语是循环单位和隐藏单位。两者都在考虑文本的上下文中起作用。例如，如果您加入了介于两者之间的讨论，那么您很难交谈，那么您在发言之前需要了解讨论的上下文。借助隐藏单元可以实现这一点，这些隐藏单元会转发前一上下文中的信息，因此LSTM / RNN单元会考虑2个输入：输入词和隐藏单元。

The normal neural network could have worked as well, but it will work in a way without considering the previous context, just giving the probability of the occurrence of a word as the next word, given the previous word.

正常的神经网络也可以工作，但是它会以某种方式工作，而无需考虑先前的上下文，只是给定了先前的单词，就给出了一个单词作为下一个单词出现的可能性。

In the image above “h” is the hidden unit, “o” denotes output from each LSTM unit, the final output is matched for similarity with the target.

在上面的图像中，“ h”是隐藏的单位，“ o”表示每个LSTM单位的输出，最终输出与目标相似以进行匹配。

批处理示例： (Batching with an example:)

Let’s consider a piece of text, this is an extract from quora:

让我们考虑一段文字，这是从定语中摘录的：

“ “What are you thinking?” asked the boss. “We didn’t ask anyone any question here. By asking a few questions we won’t be able to assess the skills of anyone. So our test was to assess the attitude of the person. We kept certain tests based on the behavior of the candidates and we observed everyone through CCTV.”

“ “你在想什么？” 问老板。 “我们在这里没有问任何人任何问题。通过提出几个问题，我们将无法评估任何人的技能。所以我们的测试是评估人的态度。我们根据候选人的行为进行了某些测试，并通过CCTV观察了所有人。”

After selecting a suitable sequence length and batch size, 4 & 3 here respectively, an input to the LSTM cells looks like as follows (I am ignoring punctuations here):

在分别选择合适的序列长度和批处理大小(分别为4和3)后， LSTM单元格的输入如下所示(我在这里忽略了标点符号)：

First batch:

第一批：

[what, are, you,thinking] -> [asked]

[您在想什么]-> [要求]

[are, you, thinking,asked] -> [the]

[是，你在想，问]-> [在]

[you, thinking, asked, the] ->[boss]

[你，想，问，]-> [老板]

Second batch:

第二批：

[thinking,asked,the,boss] -> [we]

[思考，询问，上司]-> [我们]

[asked,the,boss,we] -> [didnt]

[问，老板，我们]-> [同伴]

[the,boss,we,didnt] -> [ask]

[老板，我们，同伴]-> [问]

Of course, converted to integers.

当然，转换为整数。

If instead, we would have used a normal Neural Network, the batches would have looked something similar to this (batch size: 3):

如果相反，我们将使用普通的神经网络 ，则这些批次将看起来与此类似(批次大小：3)：

First batch:

第一批：

[what] -> [are]

[什么]-> [是]

[are] -> [you]

[是]-> [您]

[thinking] -> [about]

[思考]-> [关于]

So that it’s quite clear that the Neural Network does not consider the previous context.

因此，很明显，神经网络没有考虑先前的情况。

模型架构与工作： (Model Architecture and Working:)

Coming on to how these batches are processed by the LSTMs.

接下来介绍LSTM如何处理这些批次。

Note:

注意：

Below is a picture of the architecture of our model with the parameters

下面是带有参数的模型的架构图

Batch_size:2 ,sequence_len:3, embedding_dim:15 ,output_size=vocab_size:20, hidden_dim:10

Batch_size：2，sequence_len：3，embedding_dim：15，output_size = vocab_size：20，hidden_dim：10

Please refer to it for better understanding of the architecture, I have used smaller numbers for the ease of representation.

请参考它以更好地理解体系结构，为了简化表示，我使用了较小的数字。

1，嵌入层 (1.The Embedding Layer)

So if these integers (words mapped to integers in the vocab_to_int dictionary) are passed into the LSTM unit it might uncover a biased meaning thinking of these integers as some weights, in order to avoid this problem we sometimes one-hot encode these integers. But, when the vocabulary dictionary is large, the encodings take up a lot of unnecessary space.

因此，如果将这些整数 (在vocab_to_int词典中映射为整数的单词)传递到LSTM单元中，则可能会发现将这些整数视为权重的偏见，为了避免此问题，有时我们会对这些整数进行一次热编码 。但是，当词汇词典很大时，编码会占用很多不必要的空间。

Suppose vocab dictionary size=3

假设vocab字典大小= 3

The vectors occupy a space of 3*3 cell size i.e. 9

向量占据3 * 3像元大小的空间，即9

[1,0,0]

[1,0,0]

[0,1,0]

[0,1,0]

[0,0,1]

[0,0,1]

Now imagine this for a vocabulary size of 10000. That would mean a 10¹⁰ cell size of space.

现在想象一下，其词汇量为10000。这意味着10×1的单元格大小。

Therefore comes the concept of Embedding Layer.

因此出现了“ 嵌入层”的概念。

Instead of replacing these words by one-hot encodings, this time we replace it with some fixed-length vectors filled with a random choice of numbers. This matrix of values gets trained along with the model over time and learns a close vector representation of the words, which can help add meaning to them and help in increasing the overall accuracy.

这次我们没有用一次性编码代替这些单词，而是用一些固定长度的矢量代替了它们，这些矢量填充有随机选择的数字。随着时间的推移，该值矩阵将与模型一起训练，并学习单词的近似矢量表示形式，这可以帮助为其添加含义并有助于提高整体准确性。

The torch.nn.Embedding layer takes in the parameters:

torch.nn.Embedding层接受以下参数：

num_embeddings (size of the dictionary ) which is equal to the vocab_size
num_embeddings (字典的大小)，等于vocab_size
embedding_dim(size of the embedding vectors used for the word representations)
embedding_dim (用于单词表示的嵌入向量的大小)

The input to this layer is shaped as (batch_size,seq_length) and output from this layer is of shape (batch_size,seq_length,embedding_dim).

该层的输入形状为(batch_size，seq_length)和该层的输出的形状为(batch_size，seq_length，embedding_dim)。

For a vocab size of 25000, the embedding vector of 300 works well which is considerably smaller than each vector size of 25000 in case of one-hot encoded vectors.

对于25000的vocab，嵌入向量300可以很好地工作，比单热编码的矢量的25000的每个向量大小都小得多。

Some pre-trained word representations models like fastText, GloVe are available and can be directly used instead of generating these representations with the Embedding Layer.

一些预先训练的单词表示模型，例如fastText，GloVe，可以使用，并且可以直接使用，而不用通过嵌入层生成这些表示。

2，LSTM层 (2.The LSTM layer)

The torch.nn.LSTM expects the parameters input_size,hidden_dim, num_layers. Input_size & hidden_dim are the number of features in the input and hidden state respectively.

torch.nn.LSTM需要参数input_size，hidden_dim，num_layers 。 Input_size和hidden_dim分别是输入状态和隐藏状态下的要素数量。

Note: For someone who is familiar with CNN's can relate to it as input_size as the in_channels and hidden_dim as out_channels of the CNN layer.

注意：对于熟悉CNN的人来说，可以将其与input_size 关联为 CNN层的 in_channels，将 hidden_dim 关联为 out_channels 。

Since we have used an embedding layer, the embedding vector length is the number of input features.Therefore, input_size = embedding_dim.The parameters hidden_dim and num_layers(number of layers of the LSTM, generally has a value 1 or 2) can be chosen arbitrarily.

由于我们使用了嵌入层，因此嵌入向量的长度就是输入要素的数量，因此input_size = embedding_dim ，可以任意选择参数hidden_dim和num_layers (LSTM的层数，通常值为1或2)。。

The nn.LSTM layer returns an output and a hidden state, the output has the shape (batch_size,sequence_length,hidden_dim).

nn.LSTM层返回一个输出和一个隐藏状态，该输出的形状为(batch_size，sequence_length，hidden_dim)。

3，线性层 (3.The Linear layer)

This output is reshaped to (batch_size*sequence_length , hidden_dim) and passed on to the Linear Layer with in_features = hidden_dim and out_features = output_size .

此输出将重塑为(batch_size * sequence_length，hidden_dim)，并使用in_features = hidden_dim和out_features = output_size传递到线性层。

In our case, output_size = vocabulary size, i.e. the number of output nodes is equal to the total number of words present in the vocabulary, so that the network can choose the most probable next word from the whole vocabulary of words.

在我们的例子中， output_size =词汇量size ，即输出节点的数量等于词汇量中存在的单词总数，因此网络可以从整个单词词汇中选择最可能的下一个单词。

The output from the Linear layer is shaped as (batch_size*sequence_length,output_size) which is further reshaped to have a shape (batch_size,sequence_length,output_size) and though we train the model on the whole corpus for predicting the next word for every word of the sequence, we need only the next word after the last word of the sequence, so the returned output from our model considers just that.

线性层的输出的形状为(batch_size * sequence_length，output_size) ，将其进一步整形为具有形状(batch_size，sequence_length，output_size) ，尽管我们在整个语料库上训练模型，以预测每个词的下一个词在序列中，我们只需要序列的最后 一个单词之后的下 一个单词，因此我们模型返回的输出就考虑了这一点。

Thus the final output layer is shaped as (batch_size,output_size).

因此，最终输出层的形状为(batch_size，output_size) 。

The output from our model gives us the most probable next word for each sequence for the given batch. 💁

我们模型的输出为给定批次的每个序列提供了最可能的下一个单词。 💁

码： (CODE:)

The full code for this project can be found here:

该项目的完整代码可以在这里找到：

试试看！ 💻 (Try this out! 💻)

I have tried to cover all the important and necessary concepts for doing this project, yet if you face difficulty anywhere and if you feel I have missed a point, do let me know in the comments below.

我已经尝试涵盖了执行此项目的所有重要和必要的概念，但是，如果您在任何地方都遇到困难，并且觉得自己有遗漏的地方，请在下面的评论中告诉我。

I’ll be eagerly waiting to see you’ll complete this project. 🙌

我将热切等待您完成这个项目。 🙌

Best Wishes.

最好的祝愿。

Adieu 😁

阿迪耶😁