可视化 nltk_词嵌入：具有Genism，NLTK和t-SNE可视化的Word2Vec

最新推荐文章于 2023-09-12 19:36:22 发布

weixin_26752765

最新推荐文章于 2023-09-12 19:36:22 发布

阅读量448

点赞数

文章标签：可视化数据可视化

原文链接：https://medium.com/@afafathar3007/word-embedding-word2vec-with-genism-nltk-and-t-sne-visualization-43eae8ab3e2e

版权

可视化 nltk

什么是词嵌入？ (What is Word Embeddings?)

In extremely simplified terms, Word Embeddings are the writings changed over into numbers, as there might be diverse numerical portrayals of a similar book. Be that as it may, before we jump into the subtleties of Word Embeddings, the accompanying inquiry ought to be posed — Why do we need Word Embeddings?

用极为简化的术语来说，词嵌入是将作品转换为数字，因为相似书籍的数字刻画可能多种多样。就是这样，在我们深入研究单词嵌入的微妙之处之前，应该提出伴随的询问-为什么我们需要单词嵌入？

For reasons unknown, many Machine Learning calculations and practically all Deep Learning architectures are not capable of handling strings or raw content in their crude structure. They require numbers as contributions to play out such a vocation, be it grouping, relapse, and so forth in broader terms. What’s more, with the tremendous measure of information that is available in the content organization, it is easy to extract information out of it and manufacture applications.

由于未知原因，许多机器学习计算以及几乎所有的深度学习架构都无法处理其原始结构中的字符串或原始内容。他们需要数字作为贡献来发挥这种职业，更广义地说，是分组，复发等等。而且，借助内容组织中可用的大量信息，可以很容易地从中提取信息并制造应用程序。

Some live uses of text applications are — opinion investigation of surveys by Myntra, Amazon, and so on., record or news arrangement or grouping by Google and so forth. Several word embedding approaches currently exist and all of them have their pros and cons. We will discuss one of them here: Word2Vec.

文本应用程序的一些实时用途是-Myntra，Amazon等对调查的意见调查，Google的记录或新闻编排或分组等等。当前存在几种词嵌入方法，并且每种方法都有其优缺点。我们将在这里讨论其中之一：Word2Vec。

For instance, believe our corpus to be a solitary sentence “The quick brown fox jumps over the lazy dog”. Our sentence is [‘the’,’ quick’,’brown’,’ fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]. Presently the one-hot encoding for separate words are,

例如，相信我们的语料库是一个单独的句子：“敏捷的棕色狐狸跳过了懒狗”。我们的句子是['the'，'fast'，'brown'，fox'，'jumps'，'over'，'the'，'lazy'，'dog']。目前，用于单独单词的一键编码是

The -> [1,0,0,0,0,0,0,0,0] , quick -> [0,1,0,0,0,0,0,0,0] brown -> [0,0,1,0,0,0,0,0,0] , fox -> [0,0,0,1,0,0,0,0,0] , jumps -> [0,0,0,0,1,0,0,0,0] ,over -> [0,0,0,0,1,0,0,0,0] , the -> [0,0,0,0,0,0,1,0,0] ,lazy -> [0,0,0,0,0,0,0,1,0], dog -> [0,0,0,0,0,0,0,0,1]

-> [1,0,0,0,0,0,0,0,0]，快速-> [0,1,0,0,0,0,0,0,0,0]棕色-> [0 ，0,1,0,0,0,0,0,0]，狐狸-> [0,0,0,1,0,0,0,0,0]，跳转-> [0,0,0 ，0,1,0,0,0,0]-> [0,0,0,0,1,0,0,0,0]上-> [0,0,0,0,0 ，0,1,0,0]，懒惰-> [0,0,0,0,0,0,0,1,0]，狗-> [0,0,0,0,0,0,0 ，0,1]

Word2Vec： (Word2Vec:)

Word2vec is a gathering of related models that are utilized to create word embeddings. These models are shallow, two-layer neural systems that are prepared to remake etymological settings of words. Word2vec takes as its info an enormous corpus of text and produces a vector space, normally of a few hundred measurements, with every extraordinary word in the corpus being allocated a comparing vector in the space.

Word2vec是用于创建单词嵌入的相关模型的集合。这些模型是两层浅浅的神经系统，可以重塑单词的词源设置。 Word2vec将巨大的文本语料库作为其信息，并产生一个向量空间，通常具有几百个度量值，并且该语料库中的每个非凡单词都会在该空间中分配一个比较向量。

Word vectors are situated in the vector space to such an extent that words that share regular settings in the corpus are found near each other in the space

词向量位于向量空间中的程度应使在语料库中共享常规设置的词在空间中彼此靠近

“A man can be distinguished by the organization he keeps”, comparably a word can be recognized by the gathering of words that are utilized with it often, this is the possibility that Word2Vec depends on. Word2Vec has two variations, one dependent on the Skip Gram model and the other one dependent on the Continuous Bag of words model.

“一个人可以通过他所拥有的组织来区分”，相对而言，一个单词可以通过经常使用的单词的集合而被识别，这就是Word2Vec所依赖的可能性。 Word2Vec有两种变体，一种依赖于Skip Gram模型，另一种依赖于单词连续袋模型。

跳格模型： (Skip Gram Model:)

For the Skip-Gram model, the undertaking of the basic neural system is: Given an info word in a sentence, the system will foresee how likely it is for each word in the jargon being that information word’s close by word. The preparation guides to the neural system are word sets which comprise of the info word and its close by words.

对于Skip-Gram模型，基本神经系统的工作是：给定句子中的一个信息词，系统将预测行话中每个单词被该信息词逐个关闭的可能性。神经系统的准备指南是单词集，该单词集由信息词及其附近的词组成。

For instance, consider the sentence “ The quick brown fox jumps over the lazy dog.” and a window size of 2. The preparation models are All together for the guides to be prepared by the neural system, we need to speak to the words in some numerical structure. We utilize one-hot vectors, in which the situation of the information word is “1” and every single other position is “0”. So, the contributions to the neural system simply input one-hot vectors, and the yield is additionally a vector with the component of the one-hot vector, containing, for each word in the jargon, the likelihood that an arbitrarily chosen close byword is that jargon word.

例如，考虑句子“快速的棕色狐狸跳过懒狗”。窗口大小为2。准备模型全部结合在一起，以便神经系统准备指南，我们需要对某些数字结构中的单词说话。我们利用一个热向量，其中信息字的情况为“ 1”，而每个其他位置为“ 0”。因此，对神经系统的贡献只需输入一个热门向量，而产量又是一个包含一个热门向量分量的向量，对于行话中的每个单词，包含任意选择的封闭单词的可能性为那个专业术语。

Presently how about we take a gander at the design of the neural system. For instance, accept we utilize a jargon of size V, and a shrouded layer of size N, the accompanying chart shows the system’s design:

目前，我们如何研究神经系统。例如，接受我们使用大小为V的行话和大小为N的覆盖层，下面的图表显示了系统的设计：

连续词袋模型： (Continuous Bag of Words Model:)

The continuous Bag-of-Words model (CBOW) is just the opposite of Skip-Gram. For the CBOW model, the task of the simple neural network is: Given a context of words (surrounding a word) in a sentence, the network will predict how likely it is for each word in the vocabulary is the word.

连续词袋模型( CBOW )与Skip-Gram相反。对于CBOW模型，简单神经网络的任务是：给定句子中单词的上下文(围绕单词)，网络将预测词汇表中每个单词对单词的可能性。

In Continuous Bag-of-Words model, we attempt to foresee a word utilizing its encompassing words(context words), the contribution to the model is the one-hot encoded vector of the setting words inside the window size, the window size is a hyper boundary and alludes to the quantity of setting words on either side(words happening when the current word.) that are utilized to anticipate it.

在连续词袋模型中，我们尝试利用其包含的词(上下文词)预见一个词，该模型的贡献是窗口大小内设置词的单次热编码矢量，窗口大小为超边界，并暗示用于预测的任一侧设置单词的数量(当当前单词出现时出现的单词)。

“ The quick brown fox jumps over the lazy dog.”. Suppose the word viable is ‘sluggish’, now for a window size of 2, the information vector will have ones at positions comparing to the words ‘quick’, fox’,’ over’, ’the’, ’lazy’, and ‘dog’.

“ 敏捷的棕色狐狸跳过了懒狗。”。假设“可行”一词是“缓慢的”，现在对于窗口大小为2，信息向量在与“快速”，“狐狸”，“在”，“该”，“懒惰”和“狗'。

实现方式： (Implementation:)

Below I define four parameters that we used to define a Word2Vec model:

下面，我定义了四个用于定义Word2Vec模型的参数：

·size: The size means the dimensionality of word vectors. It defines the number of tokens used to represent each word. For example, rake a look at the picture above. The size would be equal to 4 in this example. Each input word would be represented by 4 tokens: King, Queen, Women, Princess. Rule-of-thumb: If a dataset is small, then the size should be small too. If a dataset is large, then size should be greater too. It’s the question of tuning.

· size：大小表示单词向量的维数。它定义了用于表示每个单词的令牌数量。例如，看一下上面的图片。在此示例中，大小将等于4。每个输入词将由4个标记表示：国王，女王，妇女，公主。经验法则：如果数据集很小，那么大小也应该很小。如果数据集很大，那么大小也应该更大。这是调音的问题。

·window: The maximum distance between the target word and its neighboring word. For example, let’s take the phrase “agama is a reptile “ with 4 words (suppose that we do not exclude the stop words). If the window size is 2, then the vector of the word “agama” is directly affected by the word “is” and “a”. Rule-of-thumb: a smaller window should provide terms that are more related (of course, the exclusion of stop words should be considered).

· 窗口：目标单词与其相邻单词之间的最大距离。例如，让我们使用带有4个单词的短语“ agama is a reptile”(假设我们不排除停用词)。如果窗口大小为2，则单词“ agama”的向量直接受到单词“ is”和“ a”的影响。经验法则：较小的窗口应提供相关性更高的术语(当然，应考虑排除停用词)。

·min_count: Ignores all words with a total frequency lower than this. For example, if the word frequency is extremely low, then this word might be considered as unimportant.

· min_count：忽略所有总频率低于此频率的单词。例如，如果单词频率极低，则该单词可能不重要。

·sg: Selects training algorithm: 1 for Skip-Gram; 0 for CBOW (Continuous Bag of Words).

· sg：选择训练算法：Skip-Gram为1； CBOW(连续词袋)为0。

·workers: The number of worker threads used to train the model.

· worker：用于训练模型的工作线程数。

模型构建： (The model building:)

Used the hotel-reviews dataset from the Kaggle repository. Click here for the dataset

使用了Kaggle存储库中的hotel-reviews数据集。单击此处获取数据集

脚步- (Steps-)

Clean the data
清理数据
Build a corpus
建立语料库
Train a Word2Vec Model
训练Word2Vec模型
Visualize t-SNE representations of the most common words
可视化最常用单词的t-SNE表示

import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import re
import nltk
import gensim
from gensim.models import word2vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inlinenltk.download('stopwords')

Loading the hotel-reviews dataset in data and viewing it’s top 5 rows.

将酒店评价数据集加载到数据中并查看其前5行。

data = pd.read_csv('/content/hotel-reviews.csv',sep=',',encoding='utf-8',error_bad_lines=False)data.head()

Viewing the Columns of the dataset

查看数据集的列

data.columns

To remove all the stop words

删除所有停用词

STOP_WORDS = nltk.corpus.stopwords.words()

Extraction of Clean_sentence of the dataset.

提取数据集的Clean_sentence。

def clean_sentence(val):"remove chars that are not letters or numbers, downcase, then remove stop words"regex = re.compile('([^\s\w]|_)+')
sentence = regex.sub('', val).lower()
sentence = sentence.split(" ")
for word in list(sentence):
if word in STOP_WORDS:
sentence.remove(word)
sentence = " ".join(sentence)
return sentence

Drop nans, then apply ‘clean_sentence’ function to Description”

删除nans，然后将'clean_sentence'函数应用于Description”

def clean_dataframe(data):"drop nans, then apply 'clean_sentence' function to Description"data = data.dropna(how="any")
for col in ['Description']:
data[col] = data[col].apply(clean_sentence)
return data

Clean_Data

data = clean_dataframe(data)
data.head(5)

Building the corpus of the dataset — Creates a list of lists containing words from each sentence

建立数据集的语料库—创建一个包含每个句子中的单词的列表列表

def build_corpus(data):"Creates a list of lists containing words from each sentence"corpus = []
for col in ['Description']:
for sentence in data[col].iteritems():
word_list = sentence[1].split(" ")
corpus.append(word_list)return corpus

View the build Corpus

查看构建语料库

corpus = build_corpus(data)
corpus[0:10]

Importing word2vec from genism and calculating the word-vector of the word.

从遗传学中导入word2vec并计算单词的单词向量。

model = word2vec.Word2Vec(corpus, size=100, window=20, min_count=2, workers=4)
model.wv['luxurious']

t-SNE：t分布随机邻居嵌入： (t-SNE: t-Distributed Stochastic Neighbor Embedding:)

t-Distributed Stochastic Neighbor Embedding is a non-straight dimensionality decrease calculation utilized for investigating high-dimensional information. It maps multi-dimensional information to at least two measurements appropriate for human perception.

t分布随机邻居嵌入是用于研究高维信息的非直维降幅计算。它将多维信息映射到至少两个适合人类感知的度量。

How t-SNE works?

t-SNE如何工作？

The intuition of what and how t-SNE works.

t-SNE工作原理和原理的直觉。

Suppose you have a 50-dimensional data set, as it is like an impossible task for us to visualize and get a sense of it. We have to convert that 50D data set to something which we can visualize or with which we can play around. This is where t-SNE comes into the picture it converts the higher dimensional data into the lower dimensional data by following steps-

假设您有一个50维的数据集，因为对于我们来说，可视化和理解它是一项不可能的任务。我们必须将50D数据集转换为可以可视化或可以使用的数据。这是t-SNE进入图片的地方，它通过以下步骤将高维数据转换为低维数据：

It measures the similarity between the two data points and it does for every pair. Similar data points will have more value of similarity and the different data points will have less value.
它测量两个数据点之间的相似度，并针对每一对测量相似度。相似的数据点将具有更多的相似性值，而不同的数据点将具有较少的价值。
Then it converts that similarity distance to probability(joint probability) according to the normal distribution.
然后根据正态分布将该相似距离转换为概率(联合概率)。
As I said in the first point, it does the similarity check for every point. Thus it will have the similarity matrix `S1` for every point. This is all calculation it does for our data points that lie in higher-dimensional space.
正如我在第一点所述，它对每个点进行相似性检查。因此，对于每个点，它将具有相似度矩阵“ S1”。这是对位于高维空间中的数据点所做的所有计算。
Now, t-SNE arranges all of the data points randomly on the required lower-dimensional (let’s suppose 2).
现在，t-SNE将所有数据点随机排列在所需的较低维度上(假设2)。
And it does all of the same calculation for lower dimensional data points as it does for higher ones — calculating similarity distance but with a major difference it assigns probability according to t- distribution instead of normal distribution and this is because it is called t-SNE not simple SNE.
对于低维数据点，它与高维数据点都进行相同的计算-计算相似距离，但主要区别在于，它根据t分布而不是正态分布来分配概率，这是因为它被称为t-SNE不是简单的SNE。
Now we also have the similarity matrix for lower dimensional data points. Let’s call it S2.
现在，我们还具有针对低维数据点的相似度矩阵。我们称它为S2。
Now, what t-SNE does is it compares matrix S1 and S2 and tries to make the difference between matrix S1 and S2 much smaller by doing some complex mathematics.
现在，t-SNE所做的是比较矩阵S1和S2，并通过做一些复杂的数学尝试使矩阵S1和S2之间的差异小得多。
In the end, we will have lower-dimensional data points that try to capture even complex relationships at which PCA fails.
最后，我们将具有较低维的数据点，这些数据点试图捕获PCA失败时的甚至复杂的关系。
So on a very high level, this is how t-SNE works.
因此，在非常高的水平上，这就是t-SNE的工作方式。

def tsne_plot(model):"Creates and TSNE model and plots it"labels = []
tokens = []
for word in model.wv.vocab:
tokens.append(model[word])
labels.append(word)
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(16, 16))
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()tsne_plot(model)

Now, let’s see the more selective model:

现在，让我们看看更具选择性的模型：

# A more selective modelmodel1 = word2vec.Word2Vec(corpus, size=100, window=20, min_count=3, workers=4)tsne_plot(model1)

The most similar words that are similar to a target word

与目标词相似的最相似词

model.most_similar('walking')

model.most_similar('pretty')

进一步改进： (Further improvements:)

Training of word2vec is a very computationally expensive process. With millions of words, the training may take a lot of time. Some methods to counter this are negative sampling and Hierarchical softmax. A good link to understand both can be found here.

word2vec的训练是一个计算量非常大的过程。数以百万计的单词可能需要花费大量时间。解决此问题的一些方法是负采样和分层softmax。可以在这里找到了解两者的良好链接。

Hope this helps :)

希望这可以帮助：)

Follow if you like my posts.

如果您喜欢我的帖子，请关注。

For more help, check my Github :- https://github.com/Afaf-Athar/Word2Vec

有关更多帮助，请检查我的Github：-https: //github.com/Afaf-Athar/Word2Vec

Additional Resources I found Useful:1. https://www.kaggle.com/harmanpreet93/train-word2vec-on-hotel-reviews-dataset
2. https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1
3.https://github.com/nltk/nltk/blob/develop/nltk/test/gensim.doctest
4. Kullback-Liebler Divergence: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
5. Good hyperparameter Information: https://distill.pub/2016/misread-tsne/
6. L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579–2605, 2008.

Please leave comments for any clarifications or questions.

如有任何澄清或疑问，请留下评论。

Happy learning 😃

快乐学习😃