词嵌入 网络嵌入
When applying one-hot encoding to words, we end up with sparse (containing many zeros) vectors of high dimensionality. On large data sets, this could cause performance issues.
当对单词应用一键编码时,我们得到的是稀疏(包含许多零)高维向量。 在大型数据集上,这可能会导致性能问题。
Additionally, one-hot encoding does not take into account the semantics of the words. So words like airplane and aircraft are considered to be two different features while we know that they have a very similar meaning. Word embeddings address these two issues.
另外,一键式编码不考虑单词的语义。 因此,像飞机和飞机这样的词被认为是两个不同的特征,而我们知道它们的含义非常相似。 词嵌入解决了这两个问题。
Word embeddings are dense vectors with much lower dimensionality. Secondly, the semantic relationships between words are reflected in the distance and direction of the vectors.
词嵌入是维数很低的密集向量。 其次,单词之间的语义关系反映在向量的距离和方向上。
We will work with the TwitterAirlineSentiment data set on Kaggle. This data set contains roughly 15K tweets with 3 possible classes for the sentiment (positive, negative and neutral). In my previous post, we tried to classify the tweets by tokenizing the words and applying two classifiers. Let’s see if word embeddings can outperform that.
我们将使用Kaggle上的TwitterAirlineSentiment数据集 。 该数据集包含大约15,000条推文,以及3种可能的情绪类别(正面,负面和中性)。 在我之前的文章中,我们尝试通过对单词进行标记并应用两个分类器来对推文进行分类。 让我们看看词嵌入是否可以胜过它。
After reading this tutorial you will know how to compute task-specific word embeddings with the Embedding layer of Keras. Secondly, we will investigate whether word embeddings trained on a larger corpus can improve the accuracy of our model.
阅读完本教程后,您将知道如何使用Keras的Embedding层来计算特定于任务的单词嵌入。 其次,我们将研究在较大语料库上训练的词嵌入是否可以提高模型的准确性。
The structure of this tutorial is:
本教程的结构为:
- Intuition behind word embeddings 词嵌入背后的直觉
- Project set-up 项目设置
- Data preparation 资料准备
- Keras and its Embedding layer Keras及其嵌入层
- Pre-trained word embeddings — GloVe 预训练词嵌入— GloVe
- Training word embeddings with more dimensions 训练单词嵌入的更多维度
词嵌入背后的直觉 (Intuition behind word embeddings)
Before we can use words in a classifier, we need to convert them into numbers. One way to do that is to simply map words to integers. Another way is to one-hot encode words. Each tweet could then be represented as a vector with a dimension equal to (a limited set of) the words in the corpus. The words occurring in the tweet have a value of 1 in the vector. All other vector values equal zero.
在分类器中使用单词之前,我们需要将它们转换为数字。 一种方法是简单地将单词映射为整数。 另一种方法是对单词进行一次热编码。 然后,每个推文都可以表示为一个向量,其维数等于语料库中单词的有限集合。 鸣叫中出现的单词在向量中的值为1。 所有其他向量值等于零。
Word embeddings are computed differently. Each word is positioned into a multi-dimensional space. The number of dimensions in this space is chosen by the data scientist. You can experiment with different dimensions and see what provides the best result.
词嵌入的计算方式有所不同。 每个词都放在多维空间中 。 该空间中的维数由数据科学家选择。 您可以尝试不同的尺寸,看看有什么效果最好。
The vector values for a word represent its position in this embedding space. Synonyms are found close to each other while words with opposite meanings have a large distance between them. You can also apply mathematical operations on the vectors which should produce semantically correct results. A typical example is that the sum of the word embeddings of king and female produces the word embedding of queen.
单词的向量值表示其在此嵌入空间中的位置 。 同义词彼此接近,而含义相反的单词之间的距离则很大。 您还可以对向量应用数学运算,这将产生语义上正确的结果。 一个典型的例子是王与母词嵌入的总和产生皇后词嵌入。
项目设置 (Project set-up)
Let’s start by importing all packages for this project.
首先,导入该项目的所有软件包。
import pandas as pd
import numpy as np
import re
import collections
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras import models
from keras import layers
We define some parameters and paths used throughout the project. Most of them are self-explanatory. But others will be explained further in the code.
我们定义了整个项目中使用的一些参数和路径。 其中大多数是不言自明的。 但是其他的将在代码中进一步解释。
NB_WORDS = 10000 # Parameter indicating the number of words we'll put in the dictionary
VAL_SIZE = 1000 # Size of the validation set
NB_START_EPOCHS = 10 # Number of epochs we usually start to train with
BATCH_SIZE = 512 # Size of the batches used in the mini-batch gradient descent
MAX_LEN = 24 # Maximum number of words in a sequence
GLOVE_DIM = 100 # Number of dimensions of the GloVe word embeddings
root = Path('../')
input_path = root / 'input/'
ouput_path = root / 'output/'
source_path = root / 'source/'
Throughout this code, we will also use some helper functions for data preparation, modeling and visualization. These function definitions are not shown here to keep the blog post clutter free. You can always refer to the notebook in Github to look at the code.
在整个代码中,我们还将使用一些辅助函数来进行数据准备,建模和可视化。 这些功能定义未在此处显示,以使博客文章更加整洁。 您始终可以在Github中参考笔记本查看代码。
资料准备 (Data preparation)
读取数据并清洁 (Reading the data and cleaning)
We read in the CSV file with the tweets and apply a random shuffle on its indexes. After that, we remove stop words and @ mentions. A test set of 10% is split off to evaluate the model on new data.
我们使用推文读取CSV文件,并在其索引上应用随机随机播放。 之后,我们删除停用词和@提及。 分离出10%的测试集以根据新数据评估模型。
df = pd.read_csv(input_path / 'Tweets.csv')
df = df.reindex(np.random.permutation(df.index))
df = df[['text', 'airline_sentiment']]
df.text = df.text.apply(remove_stopwords).apply(remove_mentions)
X_train, X_test, y_train, y_test = train_test_split(df.text, df.airline_sentiment, test_size=0.1, random_state=37)
将单词转换为整数 (Convert words into integers)
With the Tokenizer from Keras, we convert the tweets into sequences of integers. We limit the number of words to the NB_WORDS most frequent words. Additionally, the tweets are cleaned with some filters, set to lowercase and split on spaces.
使用Keras的Tokenizer ,我们将推文转换为整数序列。 我们将单词数限制为NB_WORDS个最常用的单词。 此外,这些推文还使用一些过滤器进行了清理,设置为小写并在空格处分割。
tk = Tokenizer(num_words=NB_WORDS,
filters='!"#$%&()*+,-./:;<=>?@[\]^_`{"}~\t\n',lower=True, split=" ")
tk.fit_on_texts(X_train)
X_train_seq = tk.texts_to_sequences(X_train)
X_test_seq = tk.texts_to_sequences(X_test)
等长的序列 (Equal length of sequences)
Each batch needs to provide sequences of equal length. We achieve this with the pad_sequences method. By specifying maxlen, the sequences or padded with zeros or truncated.
每一批都需要提供相等长度的序列。 我们使用pad_sequences方法来实现。 通过指定maxlen ,序列可以用零填充或截断。
X_train_seq_trunc = pad_sequences(X_train_seq, maxlen=MAX_LEN)
X_test_seq_trunc = pad_sequences(X_test_seq, maxlen=MAX_LEN)
编码目标变量 (Encoding the target variable)
The target classes are strings which need to be converted into numeric vectors. This is done with the LabelEncoder from Sklearn and the to_categorical method from Keras.
目标类是需要转换为数字向量的字符串。 这是通过Sklearn的LabelEncoder和Keras的to_categorical方法完成的。
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_test_le = le.transform(y_test)
y_train_oh = to_categorical(y_train_le)
y_test_oh = to_categorical(y_test_le)
拆分验证集 (Splitting off the validation set)
From the training data, we split off a validation set of 10% to use during training.
从训练数据中,我们划分出10%的验证集以在训练期间使用。
X_train_emb, X_valid_emb, y_train_emb, y_valid_emb = train_test_split(X_train_seq_trunc, y_train_oh, test_size=0.1, random_state=37)
造型 (Modeling)
Keras和嵌入层 (Keras and the Embedding layer)
Keras provides a convenient way to convert each word into a multi-dimensional vector. This can be done with the Embedding layer. It will compute the word embeddings (or use pre-trained embeddings) and look up each word in a dictionary to find its vector representation. Here we will train word embeddings with 8 dimensions.
Keras提供了一种将每个单词转换为多维向量的便捷方法。 这可以通过嵌入层来完成。 它将计算单词嵌入(或使用预训练的嵌入),并在字典中查找每个单词以找到其向量表示形式。 在这里,我们将训练8个维度的词嵌入。
emb_model = models.Sequential()
emb_model.add(layers.Embedding(NB_WORDS, 8, input_length=MAX_LEN))
emb_model.add(layers.Flatten())
emb_model.add(layers.Dense(3, activation='softmax'))
emb_history = deep_model(emb_model, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)
We have a validation accuracy of about 74%. The number of words in the tweets is rather low, so this result is quite good. By comparing the training and validation loss, we see that the model starts overfitting from epoch 6.
我们的验证准确性约为74%。 推文中的单词数量很少,因此此结果相当不错。 通过比较训练损失和验证损失,我们看到模型从时期6开始过度拟合 。
In a previous article, I discussed how we can avoid overfitting. You might want to read that if you want to deep dive on that topic.
在上一篇文章中,我讨论了如何避免过度拟合 。 如果您想深入了解该主题,则可能需要阅读。
When we train the model on all data (including the validation data, but excluding the test data) and set the number of epochs to 6, we get a test accuracy of 78%. This test result is OK, but let’s see if we can improve with pre-trained word embeddings.
当我们在所有数据(包括验证数据,但不包括测试数据)上训练模型并将纪元数设置为6时,测试精度为78%。 这个测试结果还可以,但是让我们看看是否可以通过预训练的词嵌入来改善。
emb_results = test_model(emb_model, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 6)
print('/n')
print('Test accuracy of word embeddings model: {0:.2f}%'.format(emb_results[1]*100))
预训练单词嵌入—手套 (Pre-trained word embeddings — Glove)
Because the training data is not so large, the model might not be able to learn good embeddings for the sentiment analysis. Alternatively, we can load pre-trained word embeddings built on a much larger training data.
由于训练数据不是很大,因此该模型可能无法为情感分析学习良好的嵌入。 或者,我们可以加载基于更大的训练数据构建的预训练词嵌入。
The GloVe database contains multiple pre-trained word embeddings, and more specific embeddings trained on tweets. So this might be useful for the task at hand.
GloVe数据库包含多个预训练的词嵌入,以及在推特上训练的更具体的嵌入 。 因此,这对于手头的任务可能很有用。
First, we put the word embeddings in a dictionary where the keys are the words and the values the word embeddings.
首先,我们将词嵌入嵌入字典中,其中键是词,词是嵌入词的值。
glove_file = 'glove.twitter.27B.' + str(GLOVE_DIM) + 'd.txt'
emb_dict = {}
glove = open(input_path / glove_file)
for line in glove:
values = line.split()
word = values[0]
vector = np.asarray(values[1:], dtype='float32')
emb_dict[word] = vector
glove.close()
With the GloVe embeddings loaded in a dictionary, we can look up the embedding for each word in the corpus of the airline tweets. These will be stored in a matrix with a shape of NB_WORDS and GLOVE_DIM. If a word is not found in the GloVe dictionary, the word embedding values for the word are zero.
通过在字典中加载GloVe嵌入,我们可以在航空公司推文的语料库中查找每个单词的嵌入。 这些将存储在形状为NB_WORDS和GLOVE_DIM的矩阵中 。 如果在GloVe词典中找不到单词,则该单词的单词嵌入值为零。
emb_matrix = np.zeros((NB_WORDS, GLOVE_DIM))
for w, i in tk.word_index.items():
if i < NB_WORDS:
vect = emb_dict.get(w)
if vect is not None:
emb_matrix[i] = vect
else:
break
Then we specify the model just like we did with the model above.
然后,像上面的模型一样指定模型。
glove_model = models.Sequential()
glove_model.add(layers.Embedding(NB_WORDS, GLOVE_DIM, input_length=MAX_LEN))
glove_model.add(layers.Flatten())
glove_model.add(layers.Dense(3, activation='softmax'))
In the Embedding layer (which is layer 0 here) we set the weights for the words to those found in the GloVe word embeddings. By setting trainable to False we make sure that the GloVe word embeddings cannot be changed. After that, we run the model.
在“嵌入”层(此处为第0层)中,我们将单词的权重设置为在GloVe单词嵌入中找到的权重 。 通过将trainable设置为False,我们可以确保不能更改GloVe单词嵌入。 之后,我们运行模型。
glove_model.layers[0].set_weights([emb_matrix])
glove_model.layers[0].trainable = False
glove_history = deep_model(glove_model, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)
The model overfits fast after 3 epochs. Furthermore, the validation accuracy is lower compared to the embeddings trained on the training data.
该模型在3个时期后快速过拟合。 此外,与在训练数据上训练的嵌入相比,验证准确性较低。
glove_results = test_model(glove_model, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 3)
print('/n')
print('Test accuracy of word glove model: {0:.2f}%'.format(glove_results[1]*100))
As a final exercise, let’s see what results we get when we train the embeddings with the same number of dimensions as the GloVe data.
作为最后的练习,让我们看看以与GloVe数据相同数量的维数训练嵌入时得到的结果。
训练单词嵌入的更多维度 (Training word embeddings with more dimensions)
We will train the word embeddings with the same number of dimensions as the GloVe embeddings (i.e. GLOVE_DIM).
我们将训练单词嵌入的维数与GloVe嵌入的维数相同(即GLOVE_DIM)。
emb_model2 = models.Sequential()
emb_model2.add(layers.Embedding(NB_WORDS, GLOVE_DIM, input_length=MAX_LEN))
emb_model2.add(layers.Flatten())
emb_model2.add(layers.Dense(3, activation='softmax'))
emb_history2 = deep_model(emb_model2, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)
emb_results2 = test_model(emb_model2, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 3)
print('/n')
print('Test accuracy of word embedding model 2: {0:.2f}%'.format(emb_results2[1]*100))
On the test data we get good results, but we do not outperform the LogisticRegression with the CountVectorizer. So there is still room for improvement.
在测试数据上,我们获得了良好的结果,但是在CountVectorizer上,我们的性能并没有超过LogisticRegression。 因此,仍有改进的空间。
结论 (Conclusion)
The best result is achieved with 100-dimensional word embeddings that are trained on the available data. This even outperforms the use of word embeddings that were trained on a much larger Twitter corpus.
最好的结果是通过在可用数据上训练的100维单词嵌入实现的。 这甚至超过了在更大的Twitter语料库上训练的单词嵌入的使用。
Until now we have just put a Dense layer on the flattened embeddings. By doing this, we do not take into account the relationships between the words in the tweet. This can be achieved with a recurrent neural network or a 1D convolutional network. But that’s something for a future post :)
到目前为止,我们只是在扁平化嵌入中放置了一个Dense层。 这样, 我们就不会考虑推文中单词之间的关系 。 这可以通过递归神经网络或一维卷积网络来实现。 但这是将来的帖子:)
翻译自: https://www.freecodecamp.org/news/word-embeddings-for-sentiment-analysis/
词嵌入 网络嵌入