使用Python和Keras进行实用的文本分类

Imagine you could know the mood of the people on the Internet. Maybe you are not interested in its entirety, but only if people are today happy on your favorite social media platform. After this tutorial, you’ll be equipped to do this. While doing this, you will get a grasp of current advancements of (deep) neural networks and how they can be applied to text.

想象一下,您可以了解Internet上人们的心情。 也许您对整个过程不感兴趣,但前提是今天人们对您喜欢的社交媒体平台感到满意。 学习完本教程后,您将具备执行此操作的能力。 在此过程中,您将掌握(深度)神经网络的最新进展以及如何将其应用于文本。

Reading the mood from text with machine learning is called sentiment analysis, and it is one of the prominent use cases in text classification. This falls into the very active research field of natural language processing (NLP). Other common use cases of text classification include detection of spam, auto tagging of customer queries, and categorization of text into defined topics. So how can you do this?

通过机器学习从文本中读取情绪被称为情感分析 ,它是文本分类中最重要的用例之一。 这属于自然语言处理(NLP)的非常活跃的研究领域。 文本分类的其他常见用例包括检测垃圾邮件,自动标记客户查询以及将文本分类为已定义的主题。 那你怎么做呢?

Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you’ll need to take your Python skills to the next level.

免费奖金: 关于Python精通的5个想法 ,这是针对Python开发人员的免费课程,向您展示了将Python技能提升到新水平所需的路线图和心态。

选择数据集 (Choosing a Data Set)

Before we start, let’s take a look at what data we have. Go ahead and download the data set from the Sentiment Labelled Sentences Data Set from the UCI Machine Learning Repository.

在开始之前,让我们看一下我们拥有的数据。 继续并从UCI机器学习存储库中的“ 情感标记句子数据集”下载数据集。

By the way, this repository is a wonderful source for machine learning data sets when you want to try out some algorithms. This data set includes labeled reviews from IMDb, Amazon, and Yelp. Each review is marked with a score of 0 for a negative sentiment or 1 for a positive sentiment.

顺便说一句,当您想尝试一些算法时,该存储库是机器学习数据集的绝佳来源。 该数据集包括来自IMDb,Amazon和Yelp的带有标签的评论。 每个评论的负面情绪评分为0,正面情绪评分为1。

Extract the folder into a data folder and go ahead and load the data with Pandas:

将文件夹解压缩到data文件夹中,然后继续使用Pandas加载数据:

 import import pandas pandas as as pd

pd

filepath_dict filepath_dict = = {{ 'yelp''yelp' :   :   'data/sentiment_analysis/yelp_labelled.txt''data/sentiment_analysis/yelp_labelled.txt' ,
                 ,
                 'amazon''amazon' : : 'data/sentiment_analysis/amazon_cells_labelled.txt''data/sentiment_analysis/amazon_cells_labelled.txt' ,
                 ,
                 'imdb''imdb' :   :   'data/sentiment_analysis/imdb_labelled.txt''data/sentiment_analysis/imdb_labelled.txt' }

}

df_list df_list = = []
[]
for for sourcesource , , filepath filepath in in filepath_dictfilepath_dict .. itemsitems ():
    ():
    df df = = pdpd .. read_csvread_csv (( filepathfilepath , , namesnames == [[ 'sentence''sentence' , , 'label''label' ], ], sepsep == '' tt '' )
    )
    dfdf [[ 'source''source' ] ] = = source  source  # Add another column filled with the source name
    # Add another column filled with the source name
    df_listdf_list .. appendappend (( dfdf )

)

df df = = pdpd .. concatconcat (( df_listdf_list )
)
printprint (( dfdf .. ilociloc [[ 00 ])
])

The result will be as follows:

结果将如下所示:

This looks about right. With this data set, you are able to train a model to predict the sentiment of a sentence. Take a quick moment to think about how you would go about predicting the data.

这看起来不错。 使用此数据集,您可以训练模型以预测句子的情感。 花点时间思考一下如何预测数据。

One way you could do this is to count the frequency of each word in each sentence and tie this count back to the entire set of words in the data set. You would start by taking the data and creating a vocabulary from all the words in all sentences. The collection of texts is also called a corpus in NLP.

您可以执行此操作的一种方法是计算每个句子中每个单词的出现频率,然后将此计数重新绑定到数据集中的整个单词集。 您将首先获取数据并根据所有句子中的所有单词创建词汇表。 文本集合在NLP中也称为语料库

The vocabulary in this case is a list of words that occurred in our text where each word has its own index. This enables you to create a vector for a sentence. You would then take the sentence you want to vectorize, and you count each occurrence in the vocabulary. The resulting vector will be with the length of the vocabulary and a count for each word in the vocabulary.

在这种情况下, 词汇表是在我们的文本中出现的单词的列表,其中每个单词都有自己的索引。 这使您可以为句子创建向量。 然后,您将采用想要向量化的句子,并计算词汇表中的每次出现次数。 产生的向量将具有词汇表的长度以及词汇表中每个单词的计数。

The resulting vector is also called a feature vector. In a feature vector, each dimension can be a numeric or categorical feature, like for example the height of a building, the price of a stock, or, in our case, the count of a word in a vocabulary. These feature vectors are a crucial piece in data science and machine learning, as the model you want to train depends on them.

所得向量也称为特征向量 。 在特征向量中,每个维度可以是数字或分类特征,例如建筑物的高度,股票的价格,或者在我们的情况下是词汇表中单词的计数。 这些特征向量是数据科学和机器学习中的关键部分,因为要训练的模型取决于它们。

Let’s quickly illustrate this. Imagine you have the following two sentences:

让我们快速说明一下。 假设您有以下两个句子:

>>>
>>> sentences = ['John likes ice cream', 'John hates chocolate.']

>>>

Next, you can use the CountVectorizer provided by the scikit-learn library to vectorize sentences. It takes the words of each sentence and creates a vocabulary of all the unique words in the sentences. This vocabulary can then be used to create a feature vector of the count of the words:

接下来,您可以使用scikit-learn库提供的CountVectorizer对句子进行矢量化处理。 它采用每个句子的单词,并创建句子中所有唯一单词的词汇表。 然后,可以使用此词汇表创建单词计数的特征向量:

>>>
>>> from sklearn.feature_extraction.text import CountVectorizer

>>> vectorizer = CountVectorizer(min_df=0, lowercase=False)
>>> vectorizer.fit(sentences)
>>> vectorizer.vocabulary_
{'John': 0, 'chocolate': 1, 'cream': 2, 'hates': 3, 'ice': 4, 'likes': 5}

>>>

This vocabulary serves also as an index of each word. Now, you can take each sentence and get the word occurrences of the words based on the previous vocabulary. The vocabulary consists of all five words in our sentences, each representing one word in the vocabulary. When you take the previous two sentences and transform them with the CountVectorizer you will get a vector representing the count of each word of the sentence:

该词汇表还用作每个单词的索引。 现在,您可以采用每个句子并根据先前的词汇表获取单词的单词出现。 词汇表由我们句子中的所有五个单词组成,每个单词代表词汇表中的一个单词。 当您使用前两个句子并使用CountVectorizer对其进行转换时,您将获得一个向量,该向量表示该句子中每个单词的计数:

>>>
>>> vectorizer.transform(sentences).toarray()
array([[1, 0, 1, 0, 1, 1],
    [1, 1, 0, 1, 0, 0]])

>>>

Now, you can see the resulting feature vectors for each sentence based on the previous vocabulary. For example, if you take a look at the first item, you can see that both vectors have a 1 there. This means that both sentences have one occurrence of John, which is in the first place in the vocabulary.

现在,您可以根据先前的词汇表查看每个句子的结果特征向量。 例如,如果您看一下第一项,那么您会看到两个向量都在那里有一个1 。 这意味着两个句子都出现一次John ,这在词汇表中排在首位。

This is considered a Bag-of-words (BOW) model, which is a common way in NLP to create vectors out of text. Each document is represented as a vector. You can use these vectors now as feature vectors for a machine learning model. This leads us to our next part, defining a baseline model.

这被认为是词袋(BOW)模型,这是NLP中从文本创建矢量的一种常用方法。 每个文档都表示为矢量。 您现在可以将这些向量用作机器学习模型的特征向量。 这将引导我们进入下一部分,定义基线模型。

定义基准模型 (Defining a Baseline Model)

When you work with machine learning, one important step is to define a baseline model. This usually involves a simple model, which is then used as a comparison with the more advanced models that you want to test. In this case, you’ll use the baseline model to compare it to the more advanced methods involving (deep) neural networks, the meat and potatoes of this tutorial.

使用机器学习时,重要的一步是定义基准模型。 这通常涉及一个简单的模型,然后将该模型与您要测试的更高级的模型进行比较。 在这种情况下,您将使用基线模型将其与涉及(深度)神经网络,本教程中的“肉和土豆”的更高级方法进行比较。

First, you are going to split the data into a training and testing set which will allow you to evaluate the accuracy and see if your model generalizes well. This means whether the model is able to perform well on data it has not seen before. This is a way to see if the model is overfitting.

首先,您将数据分成训练和测试集 ,这将使您可以评估准确性,并查看模型是否能很好地推广。 这意味着该模型是否能够对以前从未见过的数据执行良好的操作。 这是一种查看模型是否过度拟合的方法。

Overfitting is when a model is trained too well on the training data. You want to avoid overfitting, as this would mean that the model mostly just memorized the training data. This would account for a large accuracy with the training data but a low accuracy in the testing data.

过度拟合是指模型在训练数据上训练得太好。 您要避免过度拟合,因为这将意味着该模型主要只是存储了训练数据。 这将导致训练数据的准确性较高,而测试数据的准确性较低。

We start by taking the Yelp data set which we extract from our concatenated data set. From there, we take the sentences and labels. The .values returns a NumPy array instead of a Pandas Series object which is in this context easier to work with:

我们首先获取从连接数据集中提取的Yelp数据集。 从那里开始,我们接受句子和标签。 .values返回一个NumPy数组而不是Pandas Series对象,在这种情况下,它更易于使用:

>>>
>>> from sklearn.model_selection import train_test_split

>>> df_yelp = df[df['source'] == 'yelp']

>>> sentences = df_yelp['sentence'].values
>>> y = df_yelp['label'].values

>>> sentences_train, sentences_test, y_train, y_test = train_test_split(
...    sentences, y, test_size=0.25, random_state=1000)

>>>

Here we will use again on the previous BOW model to vectorize the sentences. You can use again the CountVectorizer for this task. Since you might not have the testing data available during training, you can create the vocabulary using only the training data. Using this vocabulary, you can create the feature vectors for each sentence of the training and testing set:

在这里,我们将再次使用先前的BOW模型对句子进行矢量化处理。 您可以再次使用CountVectorizer来完成此任务。 由于培训期间可能没有可用的测试数据,因此您可以仅使用培训数据来创建词汇表。 使用此词汇表,您可以为训练和测试集的每个句子创建特征向量:

>>>
>>> from sklearn.feature_extraction.text import CountVectorizer

>>> vectorizer = CountVectorizer()
>>> vectorizer.fit(sentences_train)

>>> X_train = vectorizer.transform(sentences_train)
>>> X_test  = vectorizer.transform(sentences_test)
>>> X_train
<750x1714 sparse matrix of type '<class 'numpy.int64'>'
    with 7368 stored elements in Compressed Sparse Row format>

>>>

You can see that the resulting feature vectors have 750 samples which are the number of training samples we have after the train-test split. Each sample has 1714 dimensions which is the size of the vocabulary. Also, you can see that we get a sparse matrix. This is a data type that is optimized for matrices with only a few non-zero elements, which only keeps track of the non-zero elements reducing the memory load.

您可以看到生成的特征向量有750个样本,这是训练测试拆分后我们拥有的训练样本的数量。 每个样本都有1714个维度,即词汇量。 另外,您可以看到我们得到了一个稀疏矩阵 。 这是一种针对只有几个非零元素的矩阵而优化的数据类型,该数据类型仅跟踪非零元素,从而减少了内存负载。

CountVectorizer performs tokenization which separates the sentences into a set of tokens as you saw previously in the vocabulary. It additionally removes punctuation and special characters and can apply other preprocessing to each word. If you want, you can use a custom tokenizer from the NLTK library with the CountVectorizer or use any number of the customizations which you can explore to improve the performance of your model.

CountVectorizer执行标记化 ,将句子分为一组标记,如您先前在词汇表中所见。 此外,它还删除了标点符号和特殊字符,并且可以对每个单词进行其他预处理。 如果需要,可以将NLTK库中的自定义标记器与CountVectorizer使用,或者使用可以探索以提高模型性能的任意数量的自定义项。

Note: There are a lot of additional parameters to CountVectorizer() that we forgo using here, such as adding ngrams, beacuse the goal at first is to build a simple baseline model. The token pattern itself defaults to token_pattern=’(?u)bww+b’, which is a regex pattern that says, “a word is 2 or more Unicode word characters surrounded by word boundaries.”.

注意: CountVectorizer()有许多其他参数,我们在这里不再使用,例如添加ngrams ,因为最初的目标是建立一个简单的基线模型。 令牌模式本身默认为token_pattern='(?u)bww+b' ,这是一个正则表达式模式,表示“一个单词是2个或更多被单词边界包围的Unicode单词字符”。

The classification model we are going to use is the logistic regression which is a simple yet powerful linear model that is mathematically speaking in fact a form of regression between 0 and 1 based on the input feature vector. By specifying a cutoff value (by default 0.5), the regression model is used for classification. You can use again scikit-learn library which provides the LogisticRegression classifier:

我们将使用的分类模型是逻辑回归 ,它是一个简单但功能强大的线性模型,在数学上实际上是基于输入特征向量的0到1之间的回归形式。 通过指定截止值(默认为0.5),将回归模型用于分类。 您可以再次使用提供LogisticRegression分类器的scikit-learn库:

>>>
>>> from sklearn.linear_model import LogisticRegression

>>> classifier = LogisticRegression()
>>> classifier.fit(X_train, y_train)
>>> score = classifier.score(X_test, y_test)

>>> print("Accuracy:", score)
Accuracy: 0.796

>>>

You can see that the logistic regression reached an impressive 79.6%, but let’s have a look how this model performs on the other data sets that we have. In this script, we perform and evaluate the whole process for each data set that we have:

您可以看到逻辑回归达到了令人印象深刻的79.6%,但让我们看一下该模型如何对我们拥有的其他数据集执行。 在此脚本中,我们为拥有的每个数据集执行并评估整个过程:

 for for source source in in dfdf [[ 'source''source' ]] .. uniqueunique ():
    ():
    df_source df_source = = dfdf [[ dfdf [[ 'source''source' ] ] == == sourcesource ]
    ]
    sentences sentences = = df_sourcedf_source [[ 'sentence''sentence' ]] .. values
    values
    y y = = df_sourcedf_source [[ 'label''label' ]] .. values

    values

    sentences_trainsentences_train , , sentences_testsentences_test , , y_trainy_train , , y_test y_test = = train_test_splittrain_test_split (
        (
        sentencessentences , , yy , , test_sizetest_size == 0.250.25 , , random_staterandom_state == 10001000 )

    )

    vectorizer vectorizer = = CountVectorizerCountVectorizer ()
    ()
    vectorizervectorizer .. fitfit (( sentences_trainsentences_train )
    )
    X_train X_train = = vectorizervectorizer .. transformtransform (( sentences_trainsentences_train )
    )
    X_test  X_test  = = vectorizervectorizer .. transformtransform (( sentences_testsentences_test )

    )

    classifier classifier = = LogisticRegressionLogisticRegression ()
    ()
    classifierclassifier .. fitfit (( X_trainX_train , , y_trainy_train )
    )
    score score = = classifierclassifier .. scorescore (( X_testX_test , , y_testy_test )
    )
    printprint (( 'Accuracy for 'Accuracy for  {}{}  data:  data:  {:.4f}{:.4f} '' .. formatformat (( sourcesource , , scorescore ))
))

Here’s the result:

结果如下:

Great! You can see that this fairly simple model achieves a fairly good accuracy. It would be interesting to see whether we are able to outperform this model. In the next part, we will get familiar with (deep) neural networks and how to apply them to text classification.

大! 您可以看到,这个相当简单的模型实现了相当好的准确性。 看看我们是否能够胜过该模型将是很有趣的。 在下一部分中,我们将熟悉(深度)神经网络以及如何将其应用于文本分类。

(深度)神经网络入门 (A Primer on (Deep) Neural Networks)

You might have experienced some of the excitement and fear related to artificial intelligence and deep learning. You might have stumbled across some confusing article or concerned TED talk about the approaching singularity or maybe you saw the backflipping robots and you wonder whether a life in the woods seems reasonable after all.

您可能已经经历了与人工智能和深度学习相关的一些激动和恐惧。 您可能偶然发现了一些令人困惑的文章,或者是有关即将到来的奇点的 TED演讲,或者也许您看到了后空翻机器人,并且您想知道树林中的生活毕竟是否还算合理。

On a lighter note, AI researchers all agreed that they did not agree with each other when AI will exceed Human-level performance. According to this paper we should still have some time left.

轻松一点,人工智能研究人员都同意,当人工智能将超过人类水平的表现时,他们彼此并不认同。 根据本文,我们应该还有一些时间。

So you might already be curious how neural networks work. If you already are familiar with neural networks, feel free to skip to the parts involving Keras. Also, there is the wonderful Deep Learning book by Ian Goodfellow which I highly recommend if you want to dig deeper into the math. You can read the whole book online for free. In this section you will get an overview of neural networks and their inner workings, and you will later see how to use neural networks with the outstanding Keras library.

因此,您可能已经好奇神经网络如何工作。 如果您已经熟悉神经网络,请随时跳到涉及Keras的部分。 此外,还有Ian Goodfellow 撰写的精彩的《 深度学习》书 ,如果您想更深入地研究数学,我强烈建议您这样做。 您可以免费在线阅读整本书。 在本节中,您将概述神经网络及其内部工作原理,并且稍后将了解如何将神经网络与出色的Keras库一起使用。

In this article, you don’t have to worry about the singularity, but (deep) neural networks play a crucial role in the latest developments in AI. It all started with a famous paper in 2012 by Geoffrey Hinton and his team, which outperformed all previous models in the famous ImageNet Challenge.

在本文中,您不必担心奇异性,但是(深度)神经网络在AI的最新发展中起着至关重要的作用。 这一切始于2012年Geoffrey Hinton和他的团队发表的著名论文 ,其表现优于著名的ImageNet Challenge中以前的所有模型。

The challenge could be considered the World Cup in computer vision which involves classifying a large set of images based on given labels. Geoffrey Hinton and his team managed to beat the previous models by using a convolutional neural network (CNN), which we will cover in this tutorial as well.

可以将挑战视为计算机视觉世界杯,其中涉及根据给定的标签对大量图像进行分类。 杰弗里·欣顿(Geoffrey Hinton)和他的团队通过使用卷积神经网络(CNN)击败了先前的模型,我们还将在本教程中进行介绍。

Since then, neural networks have moved into several fields involving classification, regression and even generative models. The most prevalent fields include computer vision, voice recognition and natural language processing (NLP).

从那时起,神经网络已进入涉及分类,回归甚至生成模型的多个领域。 最普遍的领域包括计算机视觉,语音识别和自然语言处理(NLP)。

Neural networks, or sometimes called artificial neural network (ANN) or feedforward neural network, are computational networks which were vaguely inspired by the neural networks in the human brain. They consist of neurons (also called nodes) which are connected like in the graph below.

神经网络,有时也称为人工神经网络(ANN)或前馈神经网络,是受人脑中的神经网络模糊启发的计算网络。 它们由神经元(也称为节点)组成,它们的连接方式如下图所示。

You start by having a layer of input neurons where you feed in your feature vectors and the values are then feeded forward to a hidden layer. At each connection, you are feeding the value forward, while the value is multiplied by a weight and a bias is added to the value. This happens at every connection and at the end you reach an output layer with one or more output nodes.

首先要有一层输入神经元,然后在其中输入特征向量,然后将值前馈到隐藏层。 在每个连接处,您都在将值前馈,同时将值乘以权重并向该值添加偏差。 这发生在每个连接上,最后到达具有一个或多个输出节点的输出层。

If you want to have a binary classification you can use one node, but if you have multiple categories you should use multiple nodes for each category:

如果要进行二进制分类,可以使用一个节点,但是如果有多个类别,则应为每个类别使用多个节点:

neural network structure
Neural network model
神经网络模型

You can have as many hidden layers as you wish. In fact, a neural network with more than one hidden layer is considered a deep neural network. Don’t worry: I won’t get here into the mathematical depths concerning neural networks. But if you want to get an intuitive visual understanding of the math involved, you can check out the YouTube Playlist by Grant Sanderson. The formula from one layer to the next is this short equation:

您可以根据需要拥有任意数量的隐藏层。 实际上,具有多个隐藏层的神经网络被视为深度神经网络。 别担心:我不会在这里涉及神经网络的数学深度。 但是,如果您想直观地了解所涉及的数学,可以查看Grant Sanderson的YouTube播放列表 。 简短的公式是从一层到另一层的公式:

neural network formula
Neural network formula
神经网络公式

Let’s slowly unpack what is happening here. You see, we are dealing here with only two layers. The layer with nodes a serves as input for the layer with nodes o. In order to calculate the values for each output node, we have to multiply each input node by a weight w and add a bias b.

让我们慢慢解压缩这里发生的事情。 您看,我们在这里仅处理两层。 具有节点a的层用作具有节点o的层的输入。 为了计算每个输出节点的值,我们必须将每个输入节点乘以权重w并加上偏差b

All of those have to be then summed and passed to a function f. This function is considered the activation function and there are various different functions that can be used depending on the layer or the problem. It is generally common to use a rectified linear unit (ReLU) for hidden layers, a sigmoid function for the output layer in a binary classification problem, or a softmax function for the output layer of multi-class classification problems.

然后必须将所有这些求和并传递给函数f 。 此功能被视为激活功能 ,根据层或问题,可以使用各种不同的功能。 通常在隐藏层中使用整流线性单元(ReLU) ,在二进制分类问题中将Sigmoid函数用于输出层,或者在多层分类问题中将softmax函数用于输出层。

You might already wonder how the weights are calculated, and this is obviously the most important part of neural networks, but also the most difficult part. The algorithm starts by initializing the weights with random values and they are then trained with a method called backpropagation.

您可能已经想知道权重是如何计算的,这显然是神经网络最重要的部分,也是最困难的部分。 该算法首先使用随机值初始化权重,然后使用一种称为反向传播的方法对其进行训练。

This is done by using optimization methods (also called optimizer) like the gradient descent in order to reduce the error between the computed and the desired output (also called target output). The error is determined by a loss function whose loss we want to minimize with the optimizer. The whole process is too extensive to cover here, but I’ll refer again to the Grant Sanderson playlist and the Deep Learning book by Ian Goodfellow I mentioned before.

这可以通过使用诸如梯度下降之类的优化方法(也称为优化器)来完成,以减少计算出的输出与所需输出(也称为目标输出)之间的误差。 误差由损失函数确定,我们希望通过优化器将其损失降至最低。 整个过程过于繁琐,无法在此处进行介绍,但是我将再次参考之前提到的Grant Sanderson播放列表和Ian Goodfellow撰写的《深度学习》。

What you have to know is that there are various optimization methods that you can use, but the most common optimizer currently used is called Adam which has a good performance in various problems.

您需要知道的是,可以使用多种优化方法,但是当前使用的最常见的优化器称为Adam ,它在各种问题上都有很好的性能。

You can also use different loss functions, but in this tutorial you will only need the cross entropy loss function or more specifically binary cross entropy which is used for binary classification problems. Be sure to experiment with the various available methods and tools. Some researchers even claim in a recent article that the choice for the best performing methods borders on alchemy. The reason being that many methods are not well explained and consist of a lot of tweaking and testing.

您也可以使用不同的损失函数,但是在本教程中,您将只需要交叉熵损失函数或更具体地说是用于二进制分类问题的二进制交叉熵。 确保尝试各种可用的方法和工具。 一些研究人员甚至在最近的一篇文章中声称,选择性能最佳的方法与炼金术有关。 原因是许多方法没有得到很好的解释,并且包含大量的调整和测试。

介绍Keras (Introducing Keras)

Keras is a deep learning and neural networks API by François Chollet which is capable of running on top of Tensorflow (Google), Theano or CNTK (Microsoft). To quote the wonderful book by François Chollet, Deep Learning with Python:

KerasFrançoisChollet开发的深度学习和神经网络API,能够在Tensorflow (Google), TheanoCNTK (Microsoft)之上运行。 引用FrançoisChollet的精彩著作《 Python深度学习》:

Keras is a model-level library, providing high-level building blocks for developing deep-learning models. It doesn’t handle low-level operations such as tensor manipulation and differentiation. Instead, it relies on a specialized, well-optimized tensor library to do so, serving as the backend engine of Keras (Source)

Keras是模型级别的库,为开发深度学习模型提供了高级构建块。 它不处理诸如张量操纵和微分之类的底层操作。 取而代之的是,它依赖于专门的,经过优化的张量库来充当Keras的后端引擎( Source

It is a great way to start experimenting with neural networks without having to implement every layer and piece on your own. For example Tensorflow is a great machine learning library, but you have to implement a lot of boilerplate code to have a model running.

这是开始尝试神经网络的好方法,而不必自己实现每一层和每一层。 例如,Tensorflow是一个很棒的机器学习库,但是您必须实现很多样板代码才能运行模型。

安装Keras (Installing Keras)

Before installing Keras, you’ll need either Tensorflow, Theano, or CNTK. In this tutorial we will be using Tensorflow so check out their installation guide here, but feel free to use any of the frameworks that works best for you. Keras can be installed using PyPI with the following command:

在安装Keras之前,您需要Tensorflow,Theano或CNTK。 在本教程中,我们将使用Tensorflow,因此请在此处查看其安装指南,但请随时使用最适合您的任何框架。 可以使用PyPI通过以下命令安装Keras

 $ pip install keras
$ pip install keras

You can choose the backend you want to have by opening the Keras configuration file which you can find here:

您可以通过打开Keras配置文件来选择想要的后端,该文件位于以下位置:

If you are a Windows user, you have to replace $HOME with %USERPROFILE%. The configuration file should look as follows:

如果您是Windows用户,则必须将$HOME替换$HOME %USERPROFILE% 。 配置文件应如下所示:

 {
    {
    "image_data_format""image_data_format" : : "channels_last""channels_last" ,
    ,
    "epsilon""epsilon" : : 1e-071e-07 ,
    ,
    "floatx""floatx" : : "float32""float32" ,
    ,
    "backend""backend" : : "tensorflow"
"tensorflow"
}
}

You can change the backend field there to "theano", "tensorflow" or "cntk", given that you have installed the backend on your machine. For more details check out the Keras backends documentation.

假设您已在计算机上安装了后端,则可以将backend字段更改为"theano""tensorflow""cntk" 。 有关更多详细信息,请查看Keras后端文档。

You might notice that we use float32 data in the configuration file. The reason for this is that neural networks are frequently used in GPUs, and the computational bottleneck is memory. By using 32 bit, we are able reduce the memory load and we do not lose too much information in the process.

您可能会注意到,我们在配置文件中使用float32数据。 原因是神经网络经常在GPU中使用,而计算瓶颈是内存。 通过使用32位,我们可以减少内存负载,并且在此过程中不会丢失太多信息。

您的第一个Keras模型 (Your First Keras Model)

Now you are finally ready to experiment with Keras. Keras supports two main types of models. You have the Sequential model API which you are going to see in use in this tutorial and the functional API which can do everything of the Sequential model but it can be also used for advanced models with complex network architectures.

现在,您终于可以尝试使用Keras了。 Keras支持两种主要类型的模型。 您将拥有在本教程中使用的顺序模型API功能性API ,该功能API可以完成所有顺序模型的所有工作,但也可以用于具有复杂网络体系结构的高级模型。

The Sequential model is a linear stack of layers, where you can use the large variety of available layers in Keras. The most common layer is the Dense layer which is your regular densely connected neural network layer with all the weights and biases that you are already familiar with.

顺序模型是线性的图层堆栈,您可以在Keras中使用各种各样的可用图层。 最常见的层是密集层,它是您的常规密集连接神经网络层,具有您已经熟悉的所有权重和偏差。

Let’s see if we can achieve some improvement to our previous logistic regression model. You can use the X_train and X_test arrays that you built in our earlier example.

让我们看看是否可以对我们先前的逻辑回归模型进行一些改进。 您可以使用在我们先前的示例中构建的X_trainX_test数组。

Before we build our model, we need to know the input dimension of our feature vectors. This happens only in the first layer since the following layers can do automatic shape inference. In order to build the Sequential model, you can add layers one by one in order as follows:

在建立模型之前,我们需要了解特征向量的输入维。 这仅在第一层中发生,因为随后的层可以进行自动形状推断。 为了构建顺序模型,您可以按如下顺序一层一层地添加图层:

>>>
>>>
 >>>  from keras.models import Sequential
>>>  from keras import layers

>>>  input_dim = X_train . shape [ 1 ]  # Number of features

>>>  model = Sequential ()
>>>  model . add ( layers . Dense ( 10 , input_dim = input_dim , activation = 'relu' ))
>>>  model . add ( layers . Dense ( 1 , activation = 'sigmoid' ))
Using TensorFlow backend.

Before you can start with the training of the model, you need to configure the learning process. This is done with the .compile() method. This method specifies the optimizer and the loss function.

在开始训练模型之前,您需要配置学习过程。 这是通过.compile()方法完成的。 此方法指定优化器和损失函数。

Additionally, you can add a list of metrics which can be later used for evaluation, but they do not influence the training. In this case, we want to use the binary cross entropy and the Adam optimizer you saw in the primer mentioned before. Keras also includes a handy .summary() function to give an overview of the model and the number of parameters available for training:

此外,您可以添加一个指标列表,该指标可稍后用于评估,但它们不会影响培训。 在这种情况下,我们想使用您在前面提到的入门中看到的二进制交叉熵和Adam优化器。 Keras还包括一个方便的.summary()函数,以概述模型和可用于训练的参数数量:

>>>
>>>
 >>>  model . compile ( loss = 'binary_crossentropy' , 
...               optimizer = 'adam' , 
...               metrics = [ 'accuracy' ])
>>>  model . summary ()
_________________________________________________________________
Layer (type)                 Output Shape          Param #   
=================================================================
dense_1 (Dense)              (None, 10)            17150     
_________________________________________________________________
dense_2 (Dense)              (None, 1)             11        
=================================================================
Total params: 17,161
Trainable params: 17,161
Non-trainable params: 0
_________________________________________________________________

You might notice that we have 8575 parameters for the first layer and another 6 in the next one. Where did those come from?

您可能会注意到,第一层有8575个参数,下一层有6个参数。 这些是从哪里来的?

See, we have 1714 dimensions for each feature vector, and then we have 5 nodes. We need weights for each feature dimension and each node which accounts for 1714 * 5 = 8570 parameters, and then we have another 5 times an added bias for each node, which gets us the 8575 parameters. In the final node, we have another 5 weights and one bias, which gets us to 6 parameters.

看到,每个特征向量有1714个维,然后有5个节点。 我们需要为每个要素维和每个节点分配权重,这些权重需要1714 * 5 = 8570参数,然后每个节点还要再加上5倍的附加偏差,即获得8575个参数。 在最后一个节点中,我们还有另外5个权重和一个偏差,这使我们获得6个参数。

Neat! You are almost there. Now it is time to start your training with the .fit() function.

整齐! 你快到了。 现在该开始使用.fit()函数进行训练了。

Since the training in neural networks is an iterative process, the training won’t just stop after it is done. You have to specify the number of iterations you want the model to be training. Those completed iterations are commonly called epochs. We want to run it for 100 epochs to be able to see how the training loss and accuracy are changing after each epoch.

由于神经网络中的训练是一个迭代过程,因此训练不会在完成后立即停止。 您必须指定要模型训练的迭代次数。 这些完成的迭代通常称为epoch 。 我们希望将其运行100个纪元,以便能够看到每个纪元后训练损失和准确性如何变化。

Another parameter you have to your selection is the batch size. The batch size is responsible for how many samples we want to use in one epoch, which means how many samples are used in one forward/backward pass. This increases the speed of the computation as it need fewer epochs to run, but it also needs more memory, and the model may degrade with larger batch sizes. Since we have a small training set, we can leave this to a low batch size:

您必须选择的另一个参数是批量大小 。 批次大小决定了我们在一个时期中要使用多少个样本,这意味着一次向前/向后通过中要使用多少个样本。 由于它需要更少的运行时间,因此提高了计算速度,但同时也需要更多的内存,并且随着批处理量的增加,模型可能会退化。 由于我们的培训集很小,因此我们可以将其设为小批量:

>>>
>>>
 >>>  history = model . fit ( X_train , y_train ,
...                     epochs = 100 ,
...                     verbose = False ,
...                     validation_data = ( X_test , y_test )
...                     batch_size = 10 )

Now you can use the .evaluate() method to measure the accuracy of the model. You can do this both for the training data and testing data. We expect that the training data has a higher accuracy then for the testing data. Tee longer you would train a neural network, the more likely it is that it starts overfitting.

现在,您可以使用.evaluate()方法来测量模型的准确性。 您可以对训练数据和测试数据都执行此操作。 我们期望训练数据比测试数据具有更高的准确性。 发球时间越长,您训练神经网络的可能性就越大,它开始过度拟合。

Note that if you rerun the .fit() method, you’ll start off with the computed weights from the previous training. Make sure to compile the model again before you start training the model again. Now let’s evaluate the accuracy model:

请注意,如果您重新运行.fit()方法,则将从先前训练中计算出的权重开始。 在再次开始训练模型之前,请确保再次编译模型。 现在让我们评估准确性模型:

>>>
>>>
 >>>  loss , accuracy = model . evaluate ( X_train , y_train , verbose = False )
>>>  print ( "Training Accuracy:  {:.4f} " . format ( accuracy ))
>>>  loss , accuracy = model . evaluate ( X_test , y_test , verbose = False )
>>>  print ( "Testing Accuracy:   {:.4f} " . format ( accuracy ))
Training Accuracy: 1.0000
Testing Accuracy:  0.7960

You can already see that the model was overfitting since it reached 100% accuracy for the training set. But this was expected since the number of epochs was fairly large for this model. However, the accuracy of the testing set has already surpassed our previous logistic Regression with BOW model, which is a great step further in terms of our progress.

您已经可以看到模型拟合过度,因为它对于训练集达到了100%的准确性。 但这是预料之中的,因为此模型的纪元数相当大。 但是,测试集的准确性已经超过了我们先前使用BOW模型进行的Logistic回归,这在我们的进步方面迈出了一大步。

To make your life easier, you can use this little helper function to visualize the loss and accuracy for the training and testing data based on the History callback. This callback, which is automatically applied to each Keras model, records the loss and additional metrics that can be added in the .fit() method. In this case, we are only interested in the accuracy. This helper function employs the matplotlib plotting library:

为了使您的生活更轻松,您可以使用此小帮手功能根据“ 历史记录”回调来可视化训练和测试数据的损失和准确性。 该回调将自动应用于每个Keras模型,记录损耗和可以在.fit()方法中添加的其他指标 。 在这种情况下,我们只对准确性感兴趣。 此辅助函数使用matplotlib绘图库:

To use this function, simply call plot_history() with the collected accuracy and loss inside the history dictionary:

要使用此功能,只需在history字典中调用具有所收集的准确性和损失的plot_history()

>>>
>>> plot_history(history)

>>>
loss accuracy baseline model
Accuracy and loss for baseline model
基准模型的准确性和损失

You can see that we have trained our model for too long since the training set reached 100% accuracy. A good way to see when the model starts overfitting is when the loss of the validation data starts rising again. This tends to be a good point to stop the model. You can see this around 20-40 epochs in this training.

您可以看到,自训练集达到100%的准确性以来,我们对模型的训练时间过长。 查看模型何时开始过度拟合的一个好方法是,验证数据的损失何时开始再次上升。 这往往是停止模型的好地方。 您可以在本培训中看到20-40个纪元。

Note: When training neural networks, you should use a separate testing and validation set. What you would usually do is take the model with the highest validation accuracy and then test the model with the testing set.

注意:训练神经网络时,应使用单独的测试和验证集。 通常,您要做的是获取验证精度最高的模型,然后使用测试集对模型进行测试。

This makes sure that you don’t overfit the model. Using the validation set to choose the best model is a form of data leakage (or “cheating”) to get to pick the result that produced the best test score out of hundreds of them. Data leakage happens when information outside the training data set is used in the model.

这样可以确保您不会过度拟合模型。 使用验证集选择最佳模型是数据泄漏 (或“作弊”)的一种形式,可以从数百种数据中挑选出产生最佳测试分数的结果。 当模型中使用训练数据集之外的信息时,就会发生数据泄漏。

In this case, our testing and validation set are the same, since we have a smaller sample size. As we have covered before, (deep) neural networks perform best when you have a very large number of samples. In the next part, you’ll see a different way to represent words as vectors. This is a very exciting and powerful way to work with words where you’ll see how to represent words as dense vectors.

在这种情况下,我们的测试和验证集是相同的,因为我们的样本量较小。 如前所述,当您有大量样本时,(深度)神经网络的效果最佳。 在下一部分中,您将看到将单词表示为矢量的另一种方法。 这是处理单词的一种非常令人兴奋且功能强大的方法,您将看到如何将单词表示为密集向量。

什么是词嵌入? (What Is a Word Embedding?)

Text is considered a form of sequence data similar to time series data that you would have in weather data or financial data. In the previous BOW model, you have seen how to represent a whole sequence of words as a single feature vector. Now you will see how to represent each word as vectors. There are various ways to vectorize text, such as:

文本被认为是序列数据的一种形式,类似于天气数据或财务数据中的时间序列数据。 在以前的BOW模型中,您已经看到了如何将整个单词序列表示为单个特征向量。 现在,您将看到如何将每个单词表示为向量。 有多种矢量化文本的方法,例如:

  • Words represented by each word as a vector
  • Characters represented by each character as a vector
  • N-grams of words/characters represented as a vector (N-grams are overlapping groups of multiple succeeding words/characters in the text)
  • 每个单词表示的单词作为矢量
  • 每个字符表示为矢量的字符
  • N-gram表示为向量的单词/字符(N-gram是文本中多个后续单词/字符的重叠组)

In this tutorial, you’ll see how to deal with representing words as vectors which is the common way to use text in neural networks. Two possible ways to represent a word as a vector are one-hot encoding and word embeddings.

在本教程中,您将看到如何将单词表示为向量,这是在神经网络中使用文本的常用方法。 将单词表示为向量的两种可能方式是单热编码和单词嵌入。

一站式编码 (One-Hot Encoding)

The first way to represent a word as a vector is by creating a so-called one-hot encoding, which is simply done by taking a vector of the length of the vocabulary with an entry for each word in the corpus.

将单词表示为矢量的第一种方法是创建所谓的“ 单热”编码,只需将词汇长度的矢量与语料库中每个单词的条目一起输入即可。

In this way, you have for each word, given it has a spot in the vocabulary, a vector with zeros everywhere except for the corresponding spot for the word which is set to one. As you might imagine, this can become a fairly large vector for each word and it does not give any additional information like the relationship between words.

这样,对于每个单词,给定它在词汇表中有一个点,除了设置为1的单词的对应点之外,每个地方都有一个零向量。 就像您想象的那样,对于每个单词,这可能会成为一个很大的向量,并且不会像单词之间的关系那样提供任何其他信息。

Let’s say you have a list of cities as in the following example:

假设您有一个城市列表,如以下示例所示:

>>>
>>> cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']
>>> cities
['London', 'Berlin', 'Berlin', 'New York', 'London']

>>>

You can use scikit-learn and the LabelEncoder to encode the list of cities into categorical integer values like here:

您可以使用scikit-learn和LabelEncoder将城市列表编码为以下类别的整数值:

>>>
>>> from sklearn.preprocessing import LabelEncoder

>>> encoder = LabelEncoder()
>>> city_labels = encoder.fit_transform(cities)
>>> city_labels
array([1, 0, 0, 2, 1])

>>>

Using this representation, you can use the OneHotEncoder provided by scikit-learn to encode the categorical values we got before into a one-hot encoded numeric array. OneHotEncoder expects each categorical value to be in a separate row, so you’ll need to reshape the array, then you can apply the encoder:

使用此表示形式,您可以使用scikit-learn提供的OneHotEncoder将我们之前获得的分类值编码为一个单编码的数字数组。 OneHotEncoder希望每个分类值都在单独的行中,因此您需要调整数组的OneHotEncoder ,然后可以应用编码器:

>>>
>>> from sklearn.preprocessing import OneHotEncoder

>>> encoder = OneHotEncoder(sparse=False)
>>> city_labels = city_labels.reshape((5, 1))
>>> encoder.fit_transform(city_labels)
array([[0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

>>>

You can see that categorical integer value represents the position of the array which is 1 and the rest is 0. This is often used when you have a categorical feature which you cannot represent as a numeric value but you still want to be able to use it in machine learning. One use case for this encoding is of course words in a text but it is most prominently used for categories. Such categories can be for example city, department, or other categories.

您可以看到分类整数值表示数组的位置,该值为1 ,其余为0 。 当您具有无法表示为数字值但仍希望能够在机器学习中使用的分类功能时,通常会使用此功能。 这种编码的一个用例当然是文本中的单词,但是最显着地用于类别。 这样的类别可以是例如城市,部门或其他类别。

词嵌入 (Word Embeddings)

This method represents words as dense word vectors (also called word embeddings) which are trained unlike the one-hot encoding which are hardcoded. This means that the word embeddings collect more information into fewer dimensions.

此方法将单词表示为密集单词向量(也称为单词嵌入),这些单词向量与硬编码的一键编码不同。 这意味着单词嵌入将更多信息收集到更少的维度。

Note that the word embeddings do not understand the text as a human would, but they rather map the statistical structure of the language used in the corpus. Their aim is to map semantic meaning into a geometric space. This geometric space is then called the embedding space.

请注意,嵌入词一词并不像人类那样理解文本,而是映射了语料库中使用的语言的统计结构。 他们的目的是将语义含义映射到几何空间中。 然后,将该几何空间称为嵌入空间

This would map semantically similar words close on the embedding space like numbers or colors. If the embedding captures the relationship between words well, things like vector arithmetic should become possible. A famous example in this field of study is the ability to map King – Man + Woman = Queen.

这将在语义上将相似的词映射在嵌入空间上,例如数字或颜色。 如果嵌入很好地抓住了单词之间的关系,那么像矢量算术这样的事情就应该成为可能。 该研究领域的一个著名例子是绘制国王–男人+女人=女王的能力

How can you get such a word embedding? You have two options for this. One way is to train your word embeddings during the training of your neural network. The other way is by using pretrained word embeddings which you can directly use in your model. There you have the option to either leave these word embeddings unchanged during training or you train them also.

你怎么能得到这样的词嵌入? 您有两个选择。 一种方法是在训练神经网络时训练单词嵌入。 另一种方法是使用可以直接在模型中使用的预训练词嵌入。 在那里,您可以选择在训练过程中使这些词嵌入保持不变,也可以对其进行训练。

Now you need to tokenize the data into a format that can be used by the word embeddings. Keras offers a couple of convenience methods for text preprocessing and sequence preprocessing which you can employ to prepare your text.

现在,您需要将数据标记化为可由单词嵌入使用的格式。 Keras为文本预处理序列预处理提供了两种便捷的方法,您可以使用它们来准备文本。

You can start by using the Tokenizer utility class which can vectorize a text corpus into a list of integers. Each integer maps to a value in a dictionary that encodes the entire corpus, with the keys in the dictionary being the vocabulary terms themselves. You can add the parameter num_words, which is responsible for setting the size of the vocabulary. The most common num_words words will be then kept. I have the testing and training data prepared from the previous example:

您可以从使用Tokenizer实用程序类开始,该类可以将文本语料库向量化为整数列表。 每个整数都映射到字典中的一个值,该值对整个主体进行编码,字典中的键本身就是词汇术语。 您可以添加参数num_words ,该参数负责设置词汇表的大小。 然后将保留最常见的num_words单词。 我有从上一个示例准备的测试和培训数据:

>>>
>>> from keras.preprocessing.text import Tokenizer

>>> tokenizer = Tokenizer(num_words=5000)
>>> tokenizer.fit_on_texts(sentences_train)

>>> X_train = tokenizer.texts_to_sequences(sentences_train)
>>> X_test = tokenizer.texts_to_sequences(sentences_test)

>>> vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

>>> print(sentences_train[2])
>>> print(X_train[2])
Of all the dishes, the salmon was the best, but all were great.
[11, 43, 1, 171, 1, 283, 3, 1, 47, 26, 43, 24, 22]

>>>

The indexing is ordered after the most common words in the text, which you can see by the word the having the index 1. It is important to note that the index 0 is reserved and is not assigned to any word. This zero index is used for padding, which I’ll introduce in a moment.

该索引是在后文中最常见的话,您可以通过字看有序the具有指数1 。 重要的是要注意,索引0是保留的,没有分配给任何单词。 这个零索引用于填充,我将在稍后介绍。

Unknown words (words that are not in the vocabulary) are denoted in Keras with word_count + 1 since they can also hold some information. You can see the index of each word by taking a look at the word_index dictionary of the Tokenizer object:

未知单词(不在词汇表中的单词)在word_count + 1表示,因为它们也可以保存一些信息。 通过查看Tokenizer对象的word_index字典,可以看到每个单词的索引:

>>>
>>> for word in ['the', 'all', 'happy', 'sad']:
...     print('{}: {}'.format(word, tokenizer.word_index[word]))
the: 1
all: 43
happy: 320
sad: 450

>>>

Note: Pay close attention to the difference between this technique and the X_train that was produced by scikit-learn’s CountVectorizer.

注意:请密切注意此技术与scikit-learn的CountVectorizer产生的X_train之间的CountVectorizer

With CountVectorizer, we had stacked vectors of word counts, and each vector was the same length (the size of the total corpus vocabulary). With Tokenizer, the resulting vectors equal the length of each text, and the numbers don’t denote counts, but rather correspond to the word values from the dictionary tokenizer.word_index.

使用CountVectorizer ,我们可以堆叠单词计数的向量,并且每个向量的长度都是相同的(总语料库词汇量的大小)。 使用Tokenizer时,结果向量等于每个文本的长度,并且数字不表示计数,而是与字典tokenizer.word_index的单词值相对应。

One problem that we have is that each text sequence has in most cases different length of words. To counter this, you can use pad_sequence() which simply pads the sequence of words with zeros. By default, it prepends zeros but we want to append them. Typically it does not matter whether you prepend or append zeros.

我们遇到的一个问题是,在大多数情况下,每个文本序列的单词长度都不同。 为了解决这个问题,您可以使用pad_sequence() ,它简单地用零pad_sequence()单词序列。 默认情况下,它前面加零,但我们要附加它们。 通常,您是否在前面加上零都无所谓。

Additionally you would want to add a maxlen parameter to specify how long the sequences should be. This cuts sequences that exceed that number. In the following code, you can see how to pad sequences with Keras:

另外,您可能想添加一个maxlen参数来指定序列应多长时间。 这将剪切超出该数目的序列。 在以下代码中,您可以看到如何用Keras填充序列:

>>>
>>> from keras.preprocessing.sequence import pad_sequences

>>> maxlen = 100

>>> X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
>>> X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

>>> print(X_train[0, :])
[  1  10   3 282 739  25   8 208  30  64 459 230  13   1 124   5 231   8
  58   5  67   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]

>>>

The first values represent the index in the vocabulary as you have learned from the previous examples. You can also see that the resulting feature vector contains mostly zeros, since you have a fairly short sentence. In the next part you will see how to work with word embeddings in Keras.

从前面的示例中学到的,第一个值表示词汇表中的索引。 您还会看到生成的特征向量主要包含零,因为您的句子很短。 在下一部分中,您将看到如何在Keras中使用单词嵌入。

Keras嵌入层 (Keras Embedding Layer)

Notice that, at this point, our data is still hardcoded. We have not told Keras to learn a new embedding space through successive tasks. Now you can use the Embedding Layer of Keras which takes the previously calculated integers and maps them to a dense vector of the embedding. You will need the following parameters:

请注意,此时,我们的数据仍是硬编码的。 我们没有告诉Keras通过连续的任务来学习新的嵌入空间。 现在,您可以使用Keras的“ 嵌入层” ,该将使用先前计算的整数并将其映射到嵌入的密集向量。 您将需要以下参数:

  • input_dim: the size of the vocabulary
  • output_dim: the size of the dense vector
  • input_length: the length of the sequence
  • input_dim词汇量
  • output_dim密集向量的大小
  • input_length序列的长度

With the Embedding layer we have now a couple of options. One way would be to take the output of the embedding layer and plug it into a Dense layer. In order to do this you have to add a Flatten layer in between that prepares the sequential input for the Dense layer:

有了Embedding层,我们现在有了两个选择。 一种方法是获取嵌入层的输出并将其插入Dense层。 为此,您必须在两者之间添加一个Flatten层,以为Dense层准备顺序输入:

 from from keras.models keras.models import import Sequential
Sequential
from from keras keras import import layers

layers

embedding_dim embedding_dim = = 50

50

model model = = SequentialSequential ()
()
modelmodel .. addadd (( layerslayers .. EmbeddingEmbedding (( input_diminput_dim == vocab_sizevocab_size , 
                           , 
                           output_dimoutput_dim == embedding_dimembedding_dim , 
                           , 
                           input_lengthinput_length == maxlenmaxlen ))
))
modelmodel .. addadd (( layerslayers .. FlattenFlatten ())
())
modelmodel .. addadd (( layerslayers .. DenseDense (( 1010 , , activationactivation == 'relu''relu' ))
))
modelmodel .. addadd (( layerslayers .. DenseDense (( 11 , , activationactivation == 'sigmoid''sigmoid' ))
))
modelmodel .. compilecompile (( optimizeroptimizer == 'adam''adam' ,
              ,
              lossloss == 'binary_crossentropy''binary_crossentropy' ,
              ,
              metricsmetrics == [[ 'accuracy''accuracy' ])
])
modelmodel .. summarysummary ()
()

The result will be as follows:

结果将如下所示:

You can now see that we have 87350 new parameters to train. This number comes from vocab_size times the embedding_dim. These weights of the embedding layer are initialized with random weights and are then adjusted through backpropagation during training. This model takes the words as they come in the order of the sentences as input vectors. You can train it with the following:

现在您可以看到我们有87350个新参数要训练。 这个数字来自vocab_size embedding_dim 。 使用随机权重初始化嵌入层的这些权重,然后在训练过程中通过反向传播进行调整。 该模型将按句子顺序出现的单词作为输入向量。 您可以使用以下方法进行训练:

 history history = = modelmodel .. fitfit (( X_trainX_train , , y_trainy_train ,
                    ,
                    epochsepochs == 2020 ,
                    ,
                    verboseverbose == FalseFalse ,
                    ,
                    validation_datavalidation_data == (( X_testX_test , , y_testy_test ),
                    ),
                    batch_sizebatch_size == 1010 )
)
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_trainX_train , , y_trainy_train , , verboseverbose == FalseFalse )
)
printprint (( "Training Accuracy: "Training Accuracy:  {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_testX_test , , y_testy_test , , verboseverbose == FalseFalse )
)
printprint (( "Testing Accuracy:  "Testing Accuracy:   {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
plot_historyplot_history (( historyhistory )
)

The result will be as follows:

结果将如下所示:

loss accuracy first model
Accuracy and loss for first model
第一个模型的准确性和损失

This is typically a not very reliable way to work with sequential data as you can see in the performance. When working with sequential data you want to focus on methods that look at local and sequential information instead of absolute positional information.

从性能中可以看出,这通常不是一种非常可靠的处理顺序数据的方法。 在处理顺序数据时,您希望专注于查看局部和顺序信息而不是绝对位置信息的方法。

Another way to work with embeddings is by using a MaxPooling1D/AveragePooling1D or a GlobalMaxPooling1D/GlobalAveragePooling1D layer after the embedding. You can think of the pooling layers as a way to downsample (a way to reduce the size of) the incoming feature vectors.

嵌入的另一种方法是在嵌入后使用MaxPooling1D / AveragePooling1DGlobalMaxPooling1D / GlobalAveragePooling1D层。 您可以将池化层视为对传入特征向量进行降采样 (减小其大小的一种方法)的一种方法。

In the case of max pooling you take the maximum value of all features in the pool for each feature dimension. In the case of average pooling you take the average, but max pooling seems to be more commonly used as it highlights large values.

在最大池化的情况下,您要为每个要素维取池中所有要素的最大值。 如果使用平均池,则取平均值,但最大池似乎更常用,因为它会突出显示较大的值。

Global max/average pooling takes the maximum/average of all features whereas in the other case you have to define the pool size. Keras has again its own layer that you can add in the sequential model:

全局最大/平均池采用所有功能的最大/平均值,而在其他情况下,则必须定义池大小。 Keras再次拥有自己的图层,您可以在顺序模型中添加该图层:

 from from keras.models keras.models import import Sequential
Sequential
from from keras keras import import layers

layers

embedding_dim embedding_dim = = 50

50

model model = = SequentialSequential ()
()
modelmodel .. addadd (( layerslayers .. EmbeddingEmbedding (( input_diminput_dim == vocab_sizevocab_size , 
                           , 
                           output_dimoutput_dim == embedding_dimembedding_dim , 
                           , 
                           input_lengthinput_length == maxlenmaxlen ))
))
modelmodel .. addadd (( layerslayers .. GlobalMaxPool1DGlobalMaxPool1D ())
())
modelmodel .. addadd (( layerslayers .. DenseDense (( 1010 , , activationactivation == 'relu''relu' ))
))
modelmodel .. addadd (( layerslayers .. DenseDense (( 11 , , activationactivation == 'sigmoid''sigmoid' ))
))
modelmodel .. compilecompile (( optimizeroptimizer == 'adam''adam' ,
              ,
              lossloss == 'binary_crossentropy''binary_crossentropy' ,
              ,
              metricsmetrics == [[ 'accuracy''accuracy' ])
])
modelmodel .. summarysummary ()
()

The result will be as follows:

结果将如下所示:

The procedure for training does not change:

培训程序不变:

 history history = = modelmodel .. fitfit (( X_trainX_train , , y_trainy_train ,
                    ,
                    epochsepochs == 5050 ,
                    ,
                    verboseverbose == FalseFalse ,
                    ,
                    validation_datavalidation_data == (( X_testX_test , , y_testy_test ),
                    ),
                    batch_sizebatch_size == 1010 )
)
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_trainX_train , , y_trainy_train , , verboseverbose == FalseFalse )
)
printprint (( "Training Accuracy: "Training Accuracy:  {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_testX_test , , y_testy_test , , verboseverbose == FalseFalse )
)
printprint (( "Testing Accuracy:  "Testing Accuracy:   {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
plot_historyplot_history (( historyhistory )
)

The result will be as follows:

结果将如下所示:

loss accurcay max pooling
Accuracy and loss for max pooling model
最大合并模型的准确性和损失

You can already see some improvements in our models. Next you’ll see how we can employ pretrained word embeddings and if they help us with our model.

您已经可以在我们的模型中看到一些改进。 接下来,您将看到我们如何使用预训练的词嵌入,以及它们如何帮助我们建立模型。

使用预训练的词嵌入 (Using Pretrained Word Embeddings)

We just saw an example of jointly learning word embeddings incorporated into the larger model that we want to solve.

我们刚刚看到了一个示例,该示例将联合学习单词嵌入合并到我们要解决的较大模型中。

An alternative is to use a precomputed embedding space that utilizes a much larger corpus. It is possible to precompute word embeddings by simply training them on a large corpus of text. Among the most popular methods are Word2Vec developed by Google and GloVe (Global Vectors for Word Representation) developed by the Stanford NLP Group.

另一种选择是使用预先计算的嵌入空间,该空间利用更大的语料库。 通过简单地在大量文本语料库上训练单词嵌入,可以预先计算单词嵌入。 其中最流行的方法是Word2Vec开发了谷歌和手套斯坦福NLP集团开发(用于Word表示全球向量)。

Note that those are different approaches with the same goal. Word2Vec achieves this by employing neural networks and GloVe achieves this with a co-occurrence matrix and by using matrix factorization. In both cases you are dealing with dimensionality reduction, but Word2Vec is more accurate and GloVe is faster to compute.

请注意,这些是具有相同目标的不同方法。 Word2Vec通过使用神经网络实现了这一目标,而GloVe通过共现矩阵并通过了矩阵分解实现了这一目标。 在这两种情况下,您都在处理降维问题,但是Word2Vec更加准确,而GloVe的计算速度更快。

In this tutorial, you’ll see how to work with the GloVe word embeddings from the Stanford NLP Group as their size is more manageable than the Word2Vec word embeddings provided by Google. Go ahead and download the 6B (trained on 6 billion words) word embeddings from here (822 MB).

在本教程中,您将看到如何使用斯坦福大学NLP集团的GloVe单词嵌入,因为它们的大小比Google提供的Word2Vec单词嵌入更易于管理。 继续并从此处下载6B(训练有60亿个单词)的单词嵌入(822 MB)。

You can find other word embeddings also on the main GloVe page. You can find the pretrained Word2Vec embeddings by Google here. If you want to train your own word embeddings, you can do so efficiently with the gensim Python package which uses Word2Vec for calculation. More details on how to do this here.

您也可以在GloVe主页上找到其他单词嵌入。 您可以通过谷歌找到预训练Word2Vec的嵌入这里 。 如果您想训练自己的单词嵌入,可以使用gensim Python软件包(使用Word2Vec进行计算)来高效地进行。 有关如何执行此操作的更多详细信息,请参见此处

Now that we got you covered, you can start using the word embeddings in your models. You can see in the next example how you can load the embedding matrix. Each line in the file starts with the word and is followed by the embedding vector for the particular word.

现在我们已经为您提供了覆盖,您可以开始在模型中使用单词嵌入。 您可以在下一个示例中看到如何加载嵌入矩阵。 文件中的每一行均以单词开头,后跟特定单词的嵌入向量。

This is a large file with 400000 lines, with each line representing a word followed by its vector as a stream of floats. For example, here are the first 50 characters of the first line:

这是一个大文件,有400000行,每行代表一个单词,后跟作为浮点数流的向量。 例如,以下是第一行的前50个字符:

 $ head -n $ head -n 1 data/glove_word_embeddings/glove.6B.50d.txt 1 data/glove_word_embeddings/glove.6B.50d.txt | cut -c-50
| cut -c-50
    the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.04445
    the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.04445

Since you don’t need all words, you can focus on only the words that we have in our vocabulary. Since we have only a limited number of words in our vocabulary, we can skip most of the 40000 words in the pretrained word embeddings:

由于您不需要所有单词,因此您可以仅关注词汇表中的单词。 由于词汇中的单词数量有限,因此可以跳过预训练单词嵌入中的40000个单词中的大多数:

You can use this function now to retrieve the embedding matrix:

您现在可以使用此函数来检索嵌入矩阵:

>>>
>>> embedding_dim = 50
>>> embedding_matrix = create_embedding_matrix(
...     'data/glove_word_embeddings/glove.6B.50d.txt',
...     tokenizer.word_index, embedding_dim)

>>>

Wonderful! Now you are ready to use the embedding matrix in training. Let’s go ahead and use the previous network with global max pooling and see if we can improve this model. When you use pretrained word embeddings you have the choice to either allow the embedding to be updated during training or only use the resulting embedding vectors as they are.

精彩! 现在您可以在训练中使用嵌入矩阵了。 让我们继续使用先前的网络进行全局最大池化,看看是否可以改进此模型。 使用预训练词嵌入时,您可以选择允许在训练过程中更新嵌入,还是仅按原样使用生成的嵌入向量。

First, let’s have a quick look how many of the embedding vectors are nonzero:

首先,让我们快速看一下有多少个嵌入向量是非零的:

>>>
>>> nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
>>> nonzero_elements / vocab_size
0.9507727532913566

>>>

This means 95.1% of the vocabulary is covered by the pretrained model, which is a good coverage of our vocabulary. Let’s have a look at the performance when using the GlobalMaxPool1D layer:

这意味着预训练模型覆盖了95.1%的词汇,这很好地覆盖了我们的词汇。 让我们看一下使用GlobalMaxPool1D层时的性能:

 model model = = SequentialSequential ()
()
modelmodel .. addadd (( layerslayers .. EmbeddingEmbedding (( vocab_sizevocab_size , , embedding_dimembedding_dim , 
                           , 
                           weightsweights == [[ embedding_matrixembedding_matrix ], 
                           ], 
                           input_lengthinput_length == maxlenmaxlen , 
                           , 
                           trainabletrainable == FalseFalse ))
))
modelmodel .. addadd (( layerslayers .. GlobalMaxPool1DGlobalMaxPool1D ())
())
modelmodel .. addadd (( layerslayers .. DenseDense (( 1010 , , activationactivation == 'relu''relu' ))
))
modelmodel .. addadd (( layerslayers .. DenseDense (( 11 , , activationactivation == 'sigmoid''sigmoid' ))
))
modelmodel .. compilecompile (( optimizeroptimizer == 'adam''adam' ,
              ,
              lossloss == 'binary_crossentropy''binary_crossentropy' ,
              ,
              metricsmetrics == [[ 'accuracy''accuracy' ])
])
modelmodel .. summarysummary ()
()

The result will be as follows:

结果将如下所示:

 history history = = modelmodel .. fitfit (( X_trainX_train , , y_trainy_train ,
                    ,
                    epochsepochs == 5050 ,
                    ,
                    verboseverbose == FalseFalse ,
                    ,
                    validation_datavalidation_data == (( X_testX_test , , y_testy_test ),
                    ),
                    batch_sizebatch_size == 1010 )
)
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_trainX_train , , y_trainy_train , , verboseverbose == FalseFalse )
)
printprint (( "Training Accuracy: "Training Accuracy:  {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_testX_test , , y_testy_test , , verboseverbose == FalseFalse )
)
printprint (( "Testing Accuracy:  "Testing Accuracy:   {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
plot_historyplot_history (( historyhistory )
)

The result will be as follows:

结果将如下所示:

loss accuracy embedding untrained
Accuracy and loss for untrained word embeddings
未经训练的词嵌入的准确性和损失

Since the word embeddings are not additionally trained, it is expected to be lower. But let’s now see how this performs if we allow the embedding to be trained by using trainable=True:

由于不对单词嵌入进行额外的培训,因此期望它会更低。 但是,现在让我们看看如果允许通过使用trainable=True对嵌入进行训练,则这将如何执行:

 model model = = SequentialSequential ()
()
modelmodel .. addadd (( layerslayers .. EmbeddingEmbedding (( vocab_sizevocab_size , , embedding_dimembedding_dim , 
                           , 
                           weightsweights == [[ embedding_matrixembedding_matrix ], 
                           ], 
                           input_lengthinput_length == maxlenmaxlen , 
                           , 
                           trainabletrainable == TrueTrue ))
))
modelmodel .. addadd (( layerslayers .. GlobalMaxPool1DGlobalMaxPool1D ())
())
modelmodel .. addadd (( layerslayers .. DenseDense (( 1010 , , activationactivation == 'relu''relu' ))
))
modelmodel .. addadd (( layerslayers .. DenseDense (( 11 , , activationactivation == 'sigmoid''sigmoid' ))
))
modelmodel .. compilecompile (( optimizeroptimizer == 'adam''adam' ,
              ,
              lossloss == 'binary_crossentropy''binary_crossentropy' ,
              ,
              metricsmetrics == [[ 'accuracy''accuracy' ])
])
modelmodel .. summarysummary ()
()

The result will be as follows:

结果将如下所示:

 history history = = modelmodel .. fitfit (( X_trainX_train , , y_trainy_train ,
                    ,
                    epochsepochs == 5050 ,
                    ,
                    verboseverbose == FalseFalse ,
                    ,
                    validation_datavalidation_data == (( X_testX_test , , y_testy_test ),
                    ),
                    batch_sizebatch_size == 1010 )
)
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_trainX_train , , y_trainy_train , , verboseverbose == FalseFalse )
)
printprint (( "Training Accuracy: "Training Accuracy:  {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_testX_test , , y_testy_test , , verboseverbose == FalseFalse )
)
printprint (( "Testing Accuracy:  "Testing Accuracy:   {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
plot_historyplot_history (( historyhistory )
)

The result will be as follows:

结果将如下所示:

loss accuracy embedding trained
Accuracy and Loss for pretrained word embeddings
预训练词嵌入的准确性和损失

You can see that it is most effective to allow the embeddings to be trained. When dealing with large training sets it can boost the training process to be much faster than without. In our case it seemed to help but not by much. This does not have to be because of pretrained word embeddings.

您会看到,允许对嵌入进行训练是最有效的。 处理大型训练集时,它可以比没有训练集时更快地提高训练过程。 就我们而言,这似乎有帮助,但作用不大。 这不一定是因为预训练的单词嵌入。

Now it is time to focus on a more advanced neural network model to see if it is possible to boost the model and give it the leading edge over the previous models.

现在是时候集中精力研究更高级的神经网络模型,以查看是否有可能对模型进行增强并使其具有优于先前模型的优势。

卷积神经网络(CNN) (Convolutional Neural Networks (CNN))

Convolutional neural networks or also called convnets are one of the most exciting developments in machine learning in recent years.

卷积神经网络或也称为convnets是在机器学习中最令人振奋的发展,近年来一个。

They have revolutionized image classification and computer vision by being able to extract features from images and using them in neural networks. The properties that made them useful in image processing makes them also handy for sequence processing. You can imagine a CNN as a specialized neural network that is able to detect specific patterns.

他们能够从图像中提取特征并将其用于神经网络,从而彻底改变了图像分类和计算机视觉。 使它们在图像处理中有用的属性使它们也便于进行序列处理。 您可以将CNN想象成能够检测特定模式的专用神经网络。

If it is just another neural network, what differentiates it from what you have previously learned?

如果这只是另一个神经网络,那么它与您以前学到的有什么区别?

A CNN has hidden layers which are called convolutional layers. When you think of images, a computer has to deal with a two dimensional matrix of numbers and therefore you need some way to detect features in this matrix. These convolutional layers are able to detect edges, corners and other kinds of textures which makes them such a special tool. The convolutional layer consists of multiple filters which are slid across the image and are able to detect specific features.

CNN具有称为卷积层的隐藏层。 当您想到图像时,计算机必须处理二维数字矩阵,因此您需要某种方法来检测该矩阵中的特征。 这些卷积层能够检测边缘,拐角和其他种类的纹理,这使其成为一种特殊的工具。 卷积层由多个滤镜组成,这些滤镜在图像上滑动并能够检测特定特征。

This is the very core of the technique, the mathematical process of convolution. With each convolutional layer the network is able to detect more complex patterns. In the Feature Visualization by Chris Olah you can get a good intuition what these features can look like.

这是技术的核心,即卷积的数学过程。 对于每个卷积层,网络都能够检测到更复杂的模式。 在Chris Olah的功能可视化中 ,您可以很好地了解这些功能的外观。

When you are working with sequential data, like text, you work with one dimensional convolutions, but the idea and the application stays the same. You still want to pick up on patterns in the sequence which become more complex with each added convolutional layer.

当您处理诸如文本之类的顺序数据时,可以使用一维卷积,但是其思想和应用程序保持不变。 您仍然希望选择序列中的模式,这些模式随着每个卷积层的添加而变得更加复杂。

In the next figure you can see how such a convolution works. It starts by taking a patch of input features with the size of the filter kernel. With this patch you take the dot product of the multiplied weights of the filter. The one dimensional convnet is invariant to translations, which means that certain sequences can be recognized at a different position. This can be helpful for certain patterns in the text:

在下图中,您可以看到这种卷积如何工作。 首先从获取具有过滤器内核大小的输入要素的补丁开始。 使用此补丁,您可以得到过滤器权重乘以的点积。 一维卷积不变于翻译,这意味着可以在不同位置识别某些序列。 这对于文本中的某些模式可能会有所帮助:

one dimensional convolution
Image source) 图像源

Now let’s have a look how you can use this network in Keras. Keras offers again various Convolutional layers which you can use for this task. The layer you’ll need is the Conv1D layer. This layer has again various parameters to choose from. The ones you are interested in for now are the number of filters, the kernel size, and the activation function. You can add this layer in between the Embedding layer and the GlobalMaxPool1D layer:

现在让我们看看如何在Keras中使用该网络。 Keras再次提供了各种卷积层 ,您可以将其用于此任务。 您需要的层是Conv1D层。 该层再次具有各种参数可供选择。 现在,您感兴趣的是过滤器的数量,内核大小和激活功能。 您可以在Embedding层和GlobalMaxPool1D层之间添加此层:

 embedding_dim embedding_dim = = 100

100

model model = = SequentialSequential ()
()
modelmodel .. addadd (( layerslayers .. EmbeddingEmbedding (( vocab_sizevocab_size , , embedding_dimembedding_dim , , input_lengthinput_length == maxlenmaxlen ))
))
modelmodel .. addadd (( layerslayers .. Conv1DConv1D (( 128128 , , 55 , , activationactivation == 'relu''relu' ))
))
modelmodel .. addadd (( layerslayers .. GlobalMaxPooling1DGlobalMaxPooling1D ())
())
modelmodel .. addadd (( layerslayers .. DenseDense (( 1010 , , activationactivation == 'relu''relu' ))
))
modelmodel .. addadd (( layerslayers .. DenseDense (( 11 , , activationactivation == 'sigmoid''sigmoid' ))
))
modelmodel .. compilecompile (( optimizeroptimizer == 'adam''adam' ,
              ,
              lossloss == 'binary_crossentropy''binary_crossentropy' ,
              ,
              metricsmetrics == [[ 'accuracy''accuracy' ])
])
modelmodel .. summarysummary ()
()

The result will be as follows:

结果将如下所示:

 history history = = modelmodel .. fitfit (( X_trainX_train , , y_trainy_train ,
                    ,
                    epochsepochs == 1010 ,
                    ,
                    verboseverbose == FalseFalse ,
                    ,
                    validation_datavalidation_data == (( X_testX_test , , y_testy_test ),
                    ),
                    batch_sizebatch_size == 1010 )
)
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_trainX_train , , y_trainy_train , , verboseverbose == FalseFalse )
)
printprint (( "Training Accuracy: "Training Accuracy:  {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
lossloss , , accuracy accuracy = = modelmodel .. evaluateevaluate (( X_testX_test , , y_testy_test , , verboseverbose == FalseFalse )
)
printprint (( "Testing Accuracy:  "Testing Accuracy:   {:.4f}{:.4f} "" .. formatformat (( accuracyaccuracy ))
))
plot_historyplot_history (( historyhistory )
)

The result will be as follows:

结果将如下所示:

loss accuracy convolution model
Accuracy and loss for convolutional neural network
卷积神经网络的精度和损失

You can see that 80% accuracy seems to be tough hurdle to overcome with this data set and a CNN might not be well equipped. The reason for such a plateau might be that:

您会发现,使用此数据集要克服80%的准确性似乎是一个艰巨的障碍,而且CNN可能配置不完善。 出现这种停滞的原因可能是:

  • There are not enough training samples
  • The data you have does not generalize well
  • Missing focus on tweaking the hyperparameters
  • 培训样本不足
  • 您拥有的数据不能很好地概括
  • 缺少对调整超参数的关注

CNNs work best with large training sets where they are able to find generalizations where a simple model like logistic regression won’t be able.

CNN在大型训练集上工作得最好,在这种训练集下,他们可以找到无法通过逻辑回归等简单模型获得的概括。

超参数优化 (Hyperparameters Optimization)

One crucial steps of deep learning and working with neural networks is hyperparameter optimization.

深度学习和使用神经网络的关键步骤之一是超参数优化

As you saw in the models that we have used so far, even with simpler ones, you had a large number of parameters to tweak and choose from. Those parameters are called hyperparameters. This is the most time consuming part of machine learning and sadly there are no one-fits-all solutions ready.

正如您在到目前为止所使用的模型中所看到的那样,即使使用简单的模型,您也有大量参数需要调整和选择。 这些参数称为超参数。 这是机器学习中最耗时的部分,可悲的是,还没有一个万事通的解决方案。

When you have a look at the competitions on Kaggle, one of the largest places to compete against other fellow data scientists, you can see that many of the winning teams and models have gone through a lot of tweaking and experimenting until they reached their prime. So don’t get discouraged when it gets tough and you reach a plateau, but rather think about the ways you could optimize the model or the data.

当您看一下Kaggle上的比赛时, Kaggle是与其他数据科学家同行竞争的最大场所之一,您会看到许多获胜的团队和模型经过大量的调整和试验,直到达到最佳状态。 因此,当遇到困难并达到平稳状态时,不要气our,而要考虑可以优化模型或数据的方式。

One popular method for hyperparameter optimization is grid search. What this method does is it takes lists of parameters and it runs the model with each parameter combination that it can find. It is the most thorough way but also the most computationally heavy way to do this. Another common way, random search, which you’ll see in action here, simply takes random combinations of parameters.

一种流行的超参数优化方法是网格搜索 。 该方法的作用是获取参数列表,并使用可以找到的每个参数组合运行模型。 这是最彻底的方法,也是最繁琐的计算方法。 您将在这里看到的另一种常见方式是随机搜索 ,它只是采用参数的随机组合。

In order to apply random search with Keras, you will need to use the KerasClassifier which serves as a wrapper for the scikit-learn API. With this wrapper you are able to use the various tools available with scikit-learn like cross-validation. The class that you need is RandomizedSearchCV which implements random search with cross-validation. Cross-validation is a way to validate the model and take the whole data set and separate it into multiple testing and training data sets.

为了对Keras应用随机搜索,您将需要使用KerasClassifier ,它用作scikit-learn API的包装。 有了这个包装器,您就可以使用scikit-learn提供的各种工具,例如交叉验证 。 您需要的类是RandomizedSearchCV ,它通过交叉验证实现随机搜索。 交叉验证是一种验证模型,获取整个数据集并将其分为多个测试和训练数据集的方法。

There are various types of cross-validation. One type is the k-fold cross-validation which you’ll see in this example. In this type the data set is partitioned into k equal sized sets where one set is used for testing and the rest of the partitions are used for training. This enables you to run k different runs, where each partition is once used as a testing set. So, the higher k is the more accurate the model evaluation is, but the smaller each testing set is.

有多种类型的交叉验证。 一种类型是k折交叉验证 ,您将在此示例中看到。 在这种类型中,数据集被划分为k个相等大小的集,其中一组用于测试,其余部分用于训练。 这使您可以运行k次不同的运行,其中每个分区曾经用作测试集。 因此,k越高,模型评估越准确,但每个测试集越小。

First step for KerasClassifier is to have a function that creates a Keras model. We will use the previous model, but we will allow various parameters to be set for the hyperparameter optimization:

KerasClassifier是拥有一个创建Keras模型的函数。 我们将使用先前的模型,但是我们将允许为超参数优化设置各种参数:

 def def create_modelcreate_model (( num_filtersnum_filters , , kernel_sizekernel_size , , vocab_sizevocab_size , , embedding_dimembedding_dim , , maxlenmaxlen ):
    ):
    model model = = SequentialSequential ()
    ()
    modelmodel .. addadd (( layerslayers .. EmbeddingEmbedding (( vocab_sizevocab_size , , embedding_dimembedding_dim , , input_lengthinput_length == maxlenmaxlen ))
    ))
    modelmodel .. addadd (( layerslayers .. Conv1DConv1D (( num_filtersnum_filters , , kernel_sizekernel_size , , activationactivation == 'relu''relu' ))
    ))
    modelmodel .. addadd (( layerslayers .. GlobalMaxPooling1DGlobalMaxPooling1D ())
    ())
    modelmodel .. addadd (( layerslayers .. DenseDense (( 1010 , , activationactivation == 'relu''relu' ))
    ))
    modelmodel .. addadd (( layerslayers .. DenseDense (( 11 , , activationactivation == 'sigmoid''sigmoid' ))
    ))
    modelmodel .. compilecompile (( optimizeroptimizer == 'adam''adam' ,
                  ,
                  lossloss == 'binary_crossentropy''binary_crossentropy' ,
                  ,
                  metricsmetrics == [[ 'accuracy''accuracy' ])
    ])
    return return model
model

Next, you want to define the parameter grid that you want to use in training. This consists of a dictionary with each parameters named as in the previous function. The number of spaces on the grid is 3 * 3 * 1 * 1 * 1, where each of those numbers is the number of different choices for a given parameter.

接下来,您要定义要在训练中使用的参数网格。 它由一个字典组成,每个参数的名称与上一个函数中的名称相同。 网格上的空格数为3 * 3 * 1 * 1 * 1 ,其中每个数字都是给定参数的不同选择的数目。

You can see how this could get computationally expensive very quickly, but luckily both grid search and random search are embarrassingly parallel, and the classes come with an n_jobs parameter that lets you test grid spaces in parallel. The parameter grid is initialized with the following dictionary:

您可以看到它很快就会变得很昂贵,但是幸运的是,网格搜索和随机搜索都令人尴尬地是并行的,并且这些类都带有n_jobs参数,可让您并行测试网格空间。 使用以下字典初始化参数网格:

Now you are already ready to start running the random search. In this example we iterate over each data set and then you want to preprocess the data in the same way as previously. Afterwards you take the previous function and add it to the KerasClassifier wrapper class including the number of epochs.

现在您已经准备好开始运行随机搜索。 在此示例中,我们遍历每个数据集,然后您要以与以前相同的方式预处理数据。 之后,您将使用上一个函数并将其添加到KerasClassifier包装器类中,其中包括时期数。

The resulting instance and the parameter grid are then used as the estimator in the RandomSearchCV class. Additionally, you can choose the number of folds in the k-folds cross-validation, which is in this case 4. You have seen most of the code in this snippet before in our previous examples. Besides the RandomSearchCV and KerasClassifier, I have added a little block of code handling the evaluation:

然后,将得到的实例和参数网格用作RandomSearchCV类中的估计量。 另外,您可以选择k折交叉验证中的折数,在这种情况下为4。您在前面的示例中已经看过了此代码段中的大多数代码。 除了RandomSearchCVKerasClassifier ,我还添加了一些处理评估的代码:

 from from keras.wrappers.scikit_learn keras.wrappers.scikit_learn import import KerasClassifier
KerasClassifier
from from sklearn.model_selection sklearn.model_selection import import RandomizedSearchCV

RandomizedSearchCV

# Main settings
# Main settings
epochs epochs = = 20
20
embedding_dim embedding_dim = = 50
50
maxlen maxlen = = 100
100
output_file output_file = = 'data/output.txt'

'data/output.txt'

# Run grid search for each source (yelp, amazon, imdb)
# Run grid search for each source (yelp, amazon, imdb)
for for sourcesource , , frame frame in in dfdf .. groupbygroupby (( 'source''source' ):
    ):
    printprint (( 'Running grid search for data set :''Running grid search for data set :' , , sourcesource )
    )
    sentences sentences = = dfdf [[ 'sentence''sentence' ]] .. values
    values
    y y = = dfdf [[ 'label''label' ]] .. values

    values

    # Train-test split
    # Train-test split
    sentences_trainsentences_train , , sentences_testsentences_test , , y_trainy_train , , y_test y_test = = train_test_splittrain_test_split (
        (
        sentencessentences , , yy , , test_sizetest_size == 0.250.25 , , random_staterandom_state == 10001000 )

    )

    # Tokenize words
    # Tokenize words
    tokenizer tokenizer = = TokenizerTokenizer (( num_wordsnum_words == 50005000 )
    )
    tokenizertokenizer .. fit_on_textsfit_on_texts (( sentences_trainsentences_train )
    )
    X_train X_train = = tokenizertokenizer .. texts_to_sequencestexts_to_sequences (( sentences_trainsentences_train )
    )
    X_test X_test = = tokenizertokenizer .. texts_to_sequencestexts_to_sequences (( sentences_testsentences_test )

    )

    # Adding 1 because of reserved 0 index
    # Adding 1 because of reserved 0 index
    vocab_size vocab_size = = lenlen (( tokenizertokenizer .. word_indexword_index ) ) + + 1

    1

    # Pad sequences with zeros
    # Pad sequences with zeros
    X_train X_train = = pad_sequencespad_sequences (( X_trainX_train , , paddingpadding == 'post''post' , , maxlenmaxlen == maxlenmaxlen )
    )
    X_test X_test = = pad_sequencespad_sequences (( X_testX_test , , paddingpadding == 'post''post' , , maxlenmaxlen == maxlenmaxlen )

    )

    # Parameter grid for grid search
    # Parameter grid for grid search
    param_grid param_grid = = dictdict (( num_filtersnum_filters == [[ 3232 , , 6464 , , 128128 ],
                      ],
                      kernel_sizekernel_size == [[ 33 , , 55 , , 77 ],
                      ],
                      vocab_sizevocab_size == [[ vocab_sizevocab_size ],
                      ],
                      embedding_dimembedding_dim == [[ embedding_dimembedding_dim ],
                      ],
                      maxlenmaxlen == [[ maxlenmaxlen ])
    ])
    model model = = KerasClassifierKerasClassifier (( build_fnbuild_fn == create_modelcreate_model ,
                            ,
                            epochsepochs == epochsepochs , , batch_sizebatch_size == 1010 ,
                            ,
                            verboseverbose == FalseFalse )
    )
    grid grid = = RandomizedSearchCVRandomizedSearchCV (( estimatorestimator == modelmodel , , param_distributionsparam_distributions == param_gridparam_grid ,
                              ,
                              cvcv == 44 , , verboseverbose == 11 , , n_itern_iter == 55 )
    )
    grid_result grid_result = = gridgrid .. fitfit (( X_trainX_train , , y_trainy_train )

    )

    # Evaluate testing set
    # Evaluate testing set
    test_accuracy test_accuracy = = gridgrid .. scorescore (( X_testX_test , , y_testy_test )

    )

    # Save and evaluate results
    # Save and evaluate results
    prompt prompt = = inputinput (( ff 'finished 'finished  {source}{source} ; write to file and proceed? [y/n]'; write to file and proceed? [y/n]' )
    )
    if if promptprompt .. lowerlower () () not not in in {{ 'y''y' , , 'true''true' , , 'yes''yes' }:
        }:
        break
    break
    with with openopen (( output_fileoutput_file , , 'a''a' ) ) as as ff :
        :
        s s = = (( 'Running 'Running  {}{}  data set data set nn Best Accuracy : '
             Best Accuracy : '
             '' {:.4f}{:.4f} nn {}{} nn Test Accuracy : Test Accuracy :  {:.4f}{:.4f} nnnn '' )
        )
        output_string output_string = = ss .. formatformat (
            (
            sourcesource ,
            ,
            grid_resultgrid_result .. best_score_best_score_ ,
            ,
            grid_resultgrid_result .. best_params_best_params_ ,
            ,
            test_accuracytest_accuracy )
        )
        printprint (( output_stringoutput_string )
        )
        ff .. writewrite (( output_stringoutput_string )
)

This takes a while which is a perfect chance to go outside to get some fresh air or even go on a hike, depending on how many models you want to run. Let’s take a look what we have got:

这需要一段时间,这是一个绝佳的机会,可以出门呼吸新鲜空气,甚至是远足,这取决于您要运行多少个模型。 让我们看看我们有什么:

Interesting! For some reason the testing accuracy is higher than the training accuracy which might be because there is a large variance in the scores during cross-validation. We can see that we were still not able to break much through the dreaded 80%, which seems to be a natural limit for this data with its given size. Remember that we have a small data set and convolutional neural networks tend to perform the best with large data sets.

有趣! 由于某种原因,测试准确性高于训练准确性,这可能是因为在交叉验证期间分数存在较大差异。 我们可以看到我们仍然无法突破可怕的80%,对于给定大小的数据,这似乎是自然的限制。 请记住,我们的数据集很小,而卷积神经网络往往在大数据集上表现最佳。

Another method for CV is the nested cross-validation (shown here) which is used when the hyperparameters also need to be optimized. This is used because the resulting non-nested CV model has a bias toward the data set which can lead to an overly optimistic score. You see, when doing hyperparameter optimization as we did in the previous example, we are picking the best hyperparameters for that specific training set but this does not mean that these hyperparameters generalize the best.

为CV的另一种方法是嵌套交叉验证 (示出在这里 ,当超参数还需要进行优化时使用)。 之所以使用此功能,是因为生成的非嵌套CV模型对数据集有偏见,这可能导致过于乐观的得分。 您会看到,当像在上一个示例中一样进行超参数优化时,我们正在为该特定训练集选择最佳的超参数,但这并不意味着这些超参数可以得到最佳的概括。

结论 (Conclusion)

There you have it: you have learned how to work with text classification with Keras, and we have gone from a bag-of-words model with logistic regression to increasingly more advanced methods leading to convolutional neural networks.

有了它,您已经了解了如何使用Keras进行文本分类,并且我们已经从采用逻辑回归的词袋模型转变为导致卷积神经网络的越来越先进的方法。

You should be now familiar with word embeddings, why they are useful, and also how to use pretrained word embeddings for your training. You have also learned how to work with neural networks and how to use hyperparameter optimization to squeeze more performance out of your model.

您现在应该熟悉词嵌入,它们为何有用,以及如何在训练中使用预先训练的词嵌入。 您还学习了如何使用神经网络,以及如何使用超参数优化从模型中挤出更多性能。

One big topic which we have not covered here left for another time was recurrent neural networks, more specifically LSTM and GRU. Those are other powerful and popular tools to work with sequential data like text or time series. Other interesting developments are currently in neural networks that employ attention which are under active research and seem to be a promising next step since LSTM tend to be heavy on the computation.

递归神经网络 ,更具体地说是LSTMGRU ,是我们在这里没有再讨论的一个大话题。 这些是其他强大且流行的工具,可用于处理文本或时间序列等顺序数据。 目前,神经网络中还有其他有趣的发展,它们正在引起人们的关注 ,这些都在积极研究中,并且由于LSTM往往会占用大量计算资源,因此似乎是有希望的下一步。

You have now an understanding of a crucial cornerstone in natural language processing which you can use for text classification of all sorts. Sentiment analysis is the most prominent example for this, but this includes many other applications such as:

您现在已经了解了自然语言处理中的关键基石,可以将其用于各种文本分类。 情感分析是最突出的例子,但这包括许多其他应用程序,例如:

  • Spam detection in emails
  • Automatic tagging of texts
  • Categorization of news articles with predefined topics
  • 电子邮件中的垃圾邮件检测
  • 自动标记文字
  • 具有预定义主题的新闻文章的分类

You can use this knowledge and the models that you have trained on an advanced project as in this tutorial to employ sentiment analysis on a continuous stream of twitter data with Kibana and Elasticsearch. You could also combine sentiment analysis or text classification with speech recognition like in this handy tutorial using the SpeechRecognition library in Python.

您可以像在本教程中一样,使用此知识以及在高级项目中训练的模型,使用Kibana和Elasticsearch对连续的Twitter数据流进行情感分析。 你也可以用语音识别像这个方便的结合情感分析或文本分类教程使用Python中的语音识别库。

进一步阅读 (Further Reading)

If you want to delve deeper into the various topics from this article you can take a look at these links:

如果您想深入研究本文中的各种主题,可以查看以下链接:

翻译自: https://www.pybloggers.com/2018/10/practical-text-classification-with-python-and-keras/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值