Google新闻和Leo Tolstoy：使用t-SNE可视化Word2Vec单词嵌入

最新推荐文章于 2024-04-17 10:15:22 发布

cullen2012

最新推荐文章于 2024-04-17 10:15:22 发布

阅读量1k

点赞数

文章标签：可视化 python 机器学习深度学习自然语言处理

原文链接：https://habr.com/en/company/mailru/blog/449984/

版权

本文探讨了使用t-SNE算法对Word2Vec词嵌入进行可视化的方法，展示了如何理解和解释文本中词汇间的数学关系，特别关注Google新闻数据集及列夫·托尔斯泰作品中的词向量。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Everyone uniquely perceives texts, regardless of whether this person reads news on the Internet or world-known classic novels. This also applies to a variety of algorithms and machine learning techniques, which understand texts in a more mathematical way, namely, using high-dimensional vector space.

无论该人是在互联网上阅读新闻还是在世界知名的经典小说中阅读，每个人都独特地感知文本。这也适用于各种算法和机器学习技术，它们以更数学的方式(即，使用高维向量空间)理解文本。

This article is devoted to visualizing high-dimensional Word2Vec word embeddings using t-SNE. The visualization can be useful to understand how Word2Vec works and how to interpret relations between vectors captured from your texts before using them in neural networks or other machine learning algorithms. As training data, we will use articles from Google News and classical literary works by Leo Tolstoy, the Russian writer who is regarded as one of the greatest authors of all time.

本文致力于使用t-SNE可视化高维Word2Vec词嵌入。可视化对了解Word2Vec的工作原理以及在神经网络或其他机器学习算法中使用它们之前，如何解释从文本中捕获的向量之间的关系非常有用。作为培训数据，我们将使用Google新闻中的文章和列夫·托尔斯泰(Leo Tolstoy)的古典文学作品，列奥·托尔斯泰被认为是有史以来最伟大的作家之一。

We go through the brief overview of t-SNE algorithm, then move to word embeddings calculation using Word2Vec, and finally, proceed to word vectors visualization with t-SNE in 2D and 3D space. We will write our scripts in Python using Jupyter Notebook.

我们简要介绍了t-SNE算法，然后转到使用Word2Vec进行单词嵌入计算，最后，在2D和3D空间中使用t-SNE进行单词矢量可视化。我们将使用Jupyter Notebook用Python编写脚本。

T分布随机邻居嵌入 (T-distributed Stochastic Neighbor Embedding)

T-SNE is a machine learning algorithm for data visualization, which is based on a nonlinear dimensionality reduction technique. The basic idea of t-SNE is to reduce dimensional space keeping relative pairwise distance between points. In other words, the algorithm maps multi-dimensional data to two or more dimensions, where points which were initially far from each other are also located far away, and close points are also converted to close ones. It can be said that t-SNE looking for a new data representation where the neighborhood relations are preserved. The detailed description of the t-SNE entire logic can be found in the original article [1].

T-SNE是一种基于非线性降维技术的用于数据可视化的机器学习算法。 t-SNE的基本思想是减少维空间，保持点之间的相对成对距离。换句话说，该算法将多维数据映射到两个或多个维度，其中最初彼此相距较远的点也位于相距较远的地方，并且接近点也转换为接近点。可以说，t-SNE正在寻找保留邻域关系的新数据表示形式。有关t-SNE整个逻辑的详细说明，请参见原始文章[1]。

Word2Vec模型 (The Word2Vec Model)

To begin with, we should obtain vector representations of words. For this purpose, I selected Word2vec [2], that is, a computationally-efficient predictive model for learning multi-dimensional word embeddings from raw textual data. The key concept of Word2Vec is to locate words, which share common contexts in the training corpus, in close proximity in the vector space in comparison with others.

首先，我们应该获得单词的向量表示。为此，我选择了Word2vec [2]，即从原始文本数据中学习多维单词嵌入的计算有效预测模型。 Word2Vec的关键概念是在向量空间中与其他词相比，将在训练语料库中共享通用上下文的词定位得很近。

As input data for visualization, we will use articles from Google News and a few novels by Leo Tolstoy. Pre-trained vectors trained on part of Google News dataset (about 100 billion words) was published by Google at the official page, so we will use it.

作为可视化的输入数据，我们将使用Google新闻的文章和列夫·托尔斯泰的一些小说。 Google在官方页面上发布了在Google新闻数据集的一部分(约1000亿个单词)上经过训练的预训练向量，因此我们将使用它。

import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In addition to the pre-trained model, we will train another model on Tolstoy’s novels using Gensim [3] library. Word2Vec takes sentences as input data and produces word vectors as an output. Firstly, it is necessary to download pre-trained Punkt Sentence Tokenizer, which divides a text into a list of sentences considering abbreviation words, collocations, and words, which probably indicate a start or end of sentences. By default, NLTK data package does not include a pre-trained Punkt tokenizer for Russian, so we will use third-party models from github.com/mhq/train_punkt.

除了预训练的模型外，我们还将使用Gensim [3]库在托尔斯泰的小说中训练另一个模型。 Word2Vec将句子作为输入数据，并生成单词向量作为输出。首先，有必要下载经过预训练的Punkt Sentence Tokenizer，它将考虑到缩写词，并置词和单词(可能指示句子的开始或结束)，将文本分成句子列表。默认情况下，NLTK数据包不包含针对俄语的经过预先训练的Punkt令牌生成器，因此我们将使用github.com/mhq/train_punkt中的第三方模型。

import re
import codecs


def preprocess_text(text):
    text = re.sub('[^a-zA-Zа-яА-Я1-9]+', ' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip()


def prepare_for_w2v(filename_from, filename_to, lang):
    raw_text = codecs.open(filename_from, "r", encoding='windows-1251').read()
    with open(filename_to, 'w', encoding='utf-8') as f:
        for sentence in nltk.sent_tokenize(raw_text, lang):
            print(preprocess_text(sentence.lower()), file=f)

On the Word2Vec training stage the following hyperparameters were used:

在Word2Vec培训阶段，使用了以下超参数：

Dimensionality of the feature vector is 200.
特征向量的维数为200。
The maximum distance between analyzed words within a sentence is 5.
句子中分析的单词之间的最大距离是5。
Ignores all words with the total frequency lower than 5 per corpus.
忽略所有语料的总频率低于5的所有单词。

import multiprocessing
from gensim.models import Word2Vec


def train_word2vec(filename):
    data = gensim.models.word2vec.LineSentence(filename)
    return Word2Vec(data, size=200, window=5, min_count=5, workers=multiprocessing.cpu_count())

使用t-SNE可视化单词嵌入 (Visualizing Word Embeddings using t-SNE)

T-SNE is quite useful in case it is necessary to visualize the similarity between objects which are located into multidimensional space. With a large dataset, it is becoming more and more challenging to make an easy-to-read t-SNE plot, so it is common practice to visualize groups of the most similar words.

如果需要可视化位于多维空间中的对象之间的相似性，则T-SNE非常有用。对于大型数据集，制作易于阅读的t-SNE图变得越来越具有挑战性，因此通常的做法是可视化最相似单词的组。

Let us select a few words from the vocabulary of the pre-trained Google News model and prepare word vectors for visualization.

让我们从经过预训练的Google新闻模型的词汇表中选择几个单词，并准备单词向量以进行可视化。

keys = ['Paris', 'Python', 'Sunday', 'Tolstoy', 'Twitter', 'bachelor', 'delivery', 'election', 'expensive',
        'experience', 'financial', 'food', 'iOS', 'peace', 'release', 'war']

embedding_clusters = []
word_clusters = []
for word in keys:
    embeddings = []
    words = []
    for similar_word, _ in model.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(model[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)

Fig. 1. The effect of various perplexity values on the shape of words clusters. 图1.各种困惑度值对单词簇形状的影响。

Next, we proceed to the fascinating part of this paper, the configuration of t-SNE. In this section, we should pay our attention to the following hyperparameters.

接下来，我们继续进行本文的引人入胜的部分，即t-SNE的配置。在本节中，我们应注意以下超参数。

The number of components, i.e. the dimension of the output space.
组件的数量 ，即输出空间的尺寸。
Perplexity value, which in the context of t-SNE, may be viewed as a smooth measure of the effective number of neighbors. It is related to the number of nearest neighbors that are employed in many other manifold learners (see the picture above). According to [1], it is recommended to select a value between 5 and 50.
在t-SNE的背景下， 困惑度值可以看作是邻居有效数量的平滑度量。它与许多其他流形学习者中使用的最近邻居的数量有关(请参见上图)。根据[1]，建议选择5到50之间的值。
The type of initial initialization for embeddings.
嵌入的初始初始化类型 。

tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

It should be mentioned that t-SNE has a non-convex objective function, which is minimized using a gradient descent optimization with random initiation, so different runs produce slightly different results.

应该提到的是，t-SNE具有非凸目标函数，使用带有随机起始的梯度下降优化可将其最小化，因此不同的运行会产生略有不同的结果。

Below you can find a script for creating a 2D scatter plot using Matplotlib, one of the most popular libraries for data visualization in Python.

在下面，您可以找到使用Matplotlib创建2D散点图的脚本，Matplotlib是Python中用于数据可视化的最受欢迎的库之一。

Fig. 2. Clusters of similar words from Google News (preplexity=15). 图2。来自Google新闻的相似词簇(复杂度= 15)。

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
% matplotlib inline


def tsne_plot_similar_words(labels, embedding_clusters, word_clusters, a=0.7):
    plt.figure(figsize=(16, 9))
    colors = cm.rainbow(np.linspace(0, 1, len(labels)))
    for label, embeddings, words, color in zip(labels, embedding_clusters, word_clusters, colors):
        x = embeddings[:,0]
        y = embeddings[:,1]
        plt.scatter(x, y, c=color, alpha=a, label=label)
        for i, word in enumerate(words):
            plt.annotate(word, alpha=0.5, xy=(x[i], y[i]), xytext=(5, 2), 
                         textcoords='offset points', ha='right', va='bottom', size=8)
    plt.legend(loc=4)
    plt.grid(True)
    plt.savefig("f/г.png", format='png', dpi=150, bbox_inches='tight')
    plt.show()


tsne_plot_similar_words(keys, embeddings_en_2d, word_clusters)

In some cases, it can be useful to plot all word vectors at once in order to see the whole picture. Let us now analyze Anna Karenina, an epic novel of passion, intrigue, tragedy, and redemption.

在某些情况下，一次绘制所有单词向量以查看整个图片可能很有用。现在让我们分析安娜·卡列尼娜(Anna Karenina)，这是一部关于激情，阴谋，悲剧和救赎的史诗小说。

prepare_for_w2v('data/Anna Karenina by Leo Tolstoy (ru).txt', 'train_anna_karenina_ru.txt', 'russian')
model_ak = train_word2vec('train_anna_karenina_ru.txt')

words = []
embeddings = []
for word in list(model_ak.wv.vocab):
    embeddings.append(model_ak.wv[word])
    words.append(word)
    
tsne_ak_2d = TSNE(n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_ak_2d = tsne_ak_2d.fit_transform(embeddings)

def tsne_plot_2d(label, embeddings, words=[], a=1):
    plt.figure(figsize=(16, 9))
    colors = cm.rainbow(np.linspace(0, 1, 1))
    x = embeddings[:,0]
    y = embeddings[:,1]
    plt.scatter(x, y, c=colors, alpha=a, label=label)
    for i, word in enumerate(words):
        plt.annotate(word, alpha=0.3, xy=(x[i], y[i]), xytext=(5, 2), 
                     textcoords='offset points', ha='right', va='bottom', size=10)
    plt.legend(loc=4)
    plt.grid(True)
    plt.savefig("hhh.png", format='png', dpi=150, bbox_inches='tight')
    plt.show()


tsne_plot_2d('Anna Karenina by Leo Tolstoy', embeddings_ak_2d, a=0.1)

Fig. 3. Visualization of the Word2Vec model trained on Anna Karenina. 图3.在Anna Karenina上训练的Word2Vec模型的可视化。

The whole picture can be even more informative if we map initial embeddings in 3D space. In this time let us have a look at War and Peace, one of the vital novel of world literature and one of Tolstoy’s greatest literary achievements.

如果我们在3D空间中映射初始嵌入，则整个图片甚至可以提供更多信息。在这个时候，让我们看一下《战争与和平》，这是世界文学的重要小说之一，也是托尔斯泰最伟大的文学成就之一。

prepare_for_w2v('data/War and Peace by Leo Tolstoy (ru).txt', 'train_war_and_peace_ru.txt', 'russian')
model_wp = train_word2vec('train_war_and_peace_ru.txt')

words_wp = []
embeddings_wp = []
for word in list(model_wp.wv.vocab):
    embeddings_wp.append(model_wp.wv[word])
    words_wp.append(word)
    
tsne_wp_3d = TSNE(perplexity=30, n_components=3, init='pca', n_iter=3500, random_state=12)
embeddings_wp_3d = tsne_wp_3d.fit_transform(embeddings_wp)

from mpl_toolkits.mplot3d import Axes3D


def tsne_plot_3d(title, label, embeddings, a=1):
    fig = plt.figure()
    ax = Axes3D(fig)
    colors = cm.rainbow(np.linspace(0, 1, 1))
    plt.scatter(embeddings[:, 0], embeddings[:, 1], embeddings[:, 2], c=colors, alpha=a, label=label)
    plt.legend(loc=4)
    plt.title(title)
    plt.show()


tsne_plot_3d('Visualizing Embeddings using t-SNE', 'War and Peace', embeddings_wp_3d, a=0.1)

Fig. 4. Visualization of the Word2Vec model trained on War and Peace. 图4.经过战争与和平训练的Word2Vec模型的可视化。

结果 (The Results)

This is what texts look like from the Word2Vec and t-SNE prospective. We plotted a quite informative chart for similar words from Google News and two diagrams for Tolstoy’s novels. Also, one more thing, GIFs! GIFs are awesome, but plotting GIFs is almost the same as plotting regular graphs. So, I decided not to mention them in the article, but you can find the code for the generation of animations in the sources.

这就是Word2Vec和t-SNE前瞻性文本的样子。我们为Google新闻中的类似词语绘制了一个内容丰富的图表，并为托尔斯泰的小说绘制了两个图表。另外，还有一件事，GIF！ GIF非常棒，但是绘制GIF几乎与绘制常规图形相同。因此，我决定在文章中不提及它们，但是您可以在源代码中找到用于生成动画的代码。

The source code is available at Github.

源代码可从Github获得。

The article was originally published in Towards Data Science.

该文章最初发表于《迈向数据科学》。

参考文献 (References)

L. Maate and G. Hinton, “Visualizing data using t-SNE”, Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
L. Maate和G. Hinton，“使用t-SNE可视化数据”，《机器学习研究杂志》，第1卷。 9，第2579–2605页，2008年。
T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality”, Advances in Neural Information Processing Systems, pp. 3111–3119, 2013.
T. Mikolov，I。Sutskever，K。Chen，G。Corrado和J. Dean，“单词和短语的分布式表示及其组成”，《神经信息处理系统进展》，第3111–3119页，2013年。
R. Rehurek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora”, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010.
R. Rehurek和P. Sojka，“大型语料库主题建模的软件框架”，LREC 2010 NLP框架的新挑战研讨会论文集，2010年。