如何使用python和word2vec绘制地图

最新推荐文章于 2022-10-09 10:00:25 发布

weixin_26704853

最新推荐文章于 2022-10-09 10:00:25 发布

阅读量336

点赞数 1

文章标签： python

原文链接：https://towardsdatascience.com/how-to-draw-a-map-using-python-and-word2vec-e9627b4eae34

版权

Word2vec is definitely the most playful concept I’ve met during my Natural Language Processing studies so far. Imagine an algorithm that can really successfully mimic understanding meanings of words and their functions in the language, that can measure the closeness of words along the lines of hundreds of different topics, that can answer more complicated questions like “who was to literature what Beethoven was to music”.

到目前为止，Word2vec绝对是我在自然语言处理研究中遇到的最有趣的概念。想象一个算法，它可以真正成功地模仿单词的含义及其在语言中的功能，可以测量数百个不同主题之间单词的接近度，可以回答更复杂的问题，例如“ 谁是文学家贝多芬？音乐 ”。

I thought it would be interesting to visually represent word2vec vectors: essentially, we can take the vectors of countries or cities, apply principal component analysis to reduce the dimensions, and put them on a 2-D chart. And then, we can observe how close we are to an actual geographical map.

我认为以视觉方式表示word2vec向量将很有趣：从本质上讲，我们可以采用国家或城市的向量，进行主成分分析以减小维数，然后将其放置在二维图表中。然后，我们可以观察到我们离实际的地理地图有多近。

In this post, we are going to:

在这篇文章中，我们将要：

discuss the word2vec theory in broad terms;
广泛讨论word2vec理论；
download the original pre-trained vectors;
下载原始的预训练向量；
check out a few playful applications: finding the odd one out of a list or doing arithmetical operations on words like the famous king — man + woman = queen example;
查看一些有趣的应用程序：从列表中查找奇数应用程序或对著名的king — man + woman = queen例子)等单词进行算术运算；
see how accurately we can draw the capitals of Europe based on nothing else but the word2vec vectors.
看看我们能以word2vec向量为基础准确地绘制欧洲首都的情况。

The original word2vec research paper and the pre-trained model is from 2013, and considering the rate with which the NLP literature is expanding, it’s old technology at this point. Newer approaches include GloVe (faster training, different algorithm, can be trained on smaller corpus) and fastText (capable of handling character n-grams). I’m sticking with the results of the original algorithm for now.

最初的word2vec研究论文和预训练的模型来自2013年，并且考虑到NLP文献的发展速度，这是目前的过时技术。较新的方法包括GloVe (更快的训练，不同的算法，可以在较小的主体上训练)和fastText (能够处理字符n-gram)。我现在坚持原始算法的结果。

快速Word2Vec简介 (Quick Word2Vec Intro)

One of the core concepts of Natural Language Processing is how we can quantify words and expressions in order to be able to work with them in a model setting. This mapping of language elements to numerical representations is called word embedding.

自然语言处理的核心概念之一是我们如何量化单词和表达式，以便能够在模型设置中使用它们。语言元素到数字表示形式的这种映射称为单词嵌入。

Word2vec is a word embedding process. The concept is relatively simple: sentence by sentence, it loops through the corpus, and fits a model that predicts words based on neighbouring words from a pre-defined sized window. To do that, it uses a neural network, but doesn’t actually use the predictions, once the model is saved, it only saves the weights from the first layer. In the original model, the one we are going to use, there are 300 weights, so every word is represented by a 300-dimensional vector.

Word2vec是一个单词嵌入过程。这个概念相对简单：逐个句子，在语料库中循环，并拟合一个模型，该模型根据来自预定义大小窗口的相邻单词来预测单词。为此，它使用神经网络，但实际上并不使用预测，一旦保存了模型，它仅保存第一层的权重。在原始模型中，我们将要使用的模型有300个权重，因此每个单词都由一个300维向量表示。

Note that two words don’t have to be in each other’s proximity to be deemed similar. If two words never appear in the same sentence, but they are usually surrounded by the same words, it is safe to assume that they have a similar meaning.

请注意，两个单词不必彼此接近即可视为相似。如果两个单词永远不会出现在同一句子中，但是通常将它们用相同的单词包围，则可以假定它们具有相似的含义。

There are two modelling approaches within word2vec: skip-gram and continuous bag-of-words, both with their own advantages and sensitivities to certain hyperparameters… but you know what? We aren’t going to fit our own model, so I’m not going to spend more time on it, you can read more about the different approaches and parameters in this article or the wiki site.

word2vec中有两种建模方法： skip-gram和连续词袋 ，它们都有自己的优势和对某些超参数的敏感性……但是您知道吗？我们不会适应我们自己的模型，因此我不会在上面花费更多的时间，您可以在本文或Wiki网站上阅读更多有关不同方法和参数的信息。

Naturally, the word vectors you get depend on the corpus you train your model on. Generally, you do need a huge corpus, there are versions trained on Wikipedia, or news articles from various sources. The results that we are going to use were trained on Google News.

自然，获得的词向量取决于训练模型的语料库。通常，您确实需要一个庞大的语料库，在Wikipedia上有经过培训的版本，或者来自各种来源的新闻文章。我们将要使用的结果已在Google新闻中接受了培训。

如何下载和安装 (How to Download & Install)

First, you will need to download pre-trained word2vec vectors. You can choose from a wide variety of models, trained on different types of documents. I went with the original model, trained on Google News, which you can download from many sources, just search for “Google News vectors negative 300”. For example, this GitHub link is a convenient method: https://github.com/mmihaltz/word2vec-GoogleNews-vectors.

首先，您将需要下载预训练的word2vec向量。您可以从经过不同类型文档培训的多种模型中选择。我采用的是经过Google新闻培训的原始模型，您可以从许多来源下载该模型，只需搜索“ Google News vectors negative 300 ”。例如，此GitHub链接是一种便捷的方法： https : //github.com/mmihaltz/word2vec-GoogleNews-vectors 。

Be careful, the file is 1.66 GB, but in its defence, it contains the 300-dimensional representation of 3 billion words.

请注意，该文件的大小为1.66 GB，但从防御角度来说，它包含30亿个字的300维表示。

When it comes to working with word2vec in Python, once again, you have a lot of packages to choose from, we are going to use the gensim library. Assuming you have the file saved in the word2vec_pretrained folder, you can load it in Python like so:

再说一次在Python中使用word2vec时，您有很多可供选择的软件包，我们将使用gensim库。假设您已将文件保存在word2vec_pretrained文件夹中，则可以像这样将其加载到Python中：

from gensim.models.keyedvectors import KeyedVectorsword_vectors = KeyedVectors.load_word2vec_format(\
    './word2vec_pretrained/GoogleNews-vectors-negative300.bin.gz', \
    binary = True, limit = 1000000)

The limit parameter defines how many words you are importing, 1 million was plenty enough for my purposes.

limit参数定义您要导入的单词数，对于我来说，一百万个就足够了。

玩单词 (Playing with Words)

Now that we have the word2vec vectors in place, we can check out some of its applications.

现在我们已经有了word2vec向量，我们可以检查一下它的一些应用程序。

First of all, you can actually check the vector representation of any word:

首先，您实际上可以检查任何单词的向量表示形式：

word_vectors['dog']

The result is, as we expected, a 300-dimensional vector that is quite difficult to interpret. But that’s the basis of the whole concept, we are making calculations on these vectors by adding and subtracting them from each other, and then we calculate the cosine similarities to find closest matching words.

如我们所料，结果是一个很难解释的300维向量。但这是整个概念的基础，我们通过对这些向量进行相加和相减来进行计算，然后计算余弦相似度以找到最接近的匹配词。

You can find synonyms with the most_similar function, the topn parameter defines how many words you want to be listed:

您可以使用most_similar函数找到同义词， topn参数定义要列出的单词数：

word_vectors.most_similar(positive = ['nice'], topn = 5)

results in

结果是

[('good', 0.6836092472076416),
 ('lovely', 0.6676311492919922),
 ('neat', 0.6616737246513367),
 ('fantastic', 0.6569241285324097),
 ('wonderful', 0.6561347246170044)]

Now, you might think that with a similar approach, you can also find antonyms, you just need to enter the word ‘nice’ as a negative, right? Not really, the result is this:

现在，您可能会认为，通过类似的方法，您还可以找到反义词，您只需要输入单词“ nice ”作为negative ，对吗？不完全是，结果是这样的：

[('J.Gordon_###-###', 0.38660115003585815),
 ('M.Kenseth_###-###', 0.35581791400909424),
 ('D.Earnhardt_Jr._###-###', 0.34227001667022705),
 ('G.Biffle_###-###', 0.3420777916908264),
 ('HuMax_TAC_TM', 0.3141660690307617)]

These are the words that are farthest away from the word ‘nice’, suggesting that it does not always work as you would expect.

这些是与“ nice ”一词相距最远的词，表明它并不总是如您所愿。

You can find odd ones out using the doesnt_match function:

您可以使用doesnt_match函数找出奇数：

word_vectors.doesnt_match(
['Hitler', 'Churchill', 'Stalin', 'Beethoven'])

returns Beethoven. Which is handy, I guess.

返回Beethoven 。我猜这很方便。

And finally, let’s see a couple of examples of the operations that made the algorithm famous by giving it a false sense of intelligence. If we want to combine the values of the word vectorsfather and woman but subtract the values assigned to the word vectorman:

最后，让我们看几个通过给人一种错误的智能感而使该算法著名的操作示例。如果我们想组合单词向量father和woman的值，但减去分配给单词向量man的值：

word_vectors.most_similar(
positive = ['father', 'woman'], negative = ['man'], topn = 1)

we get:

我们得到：

[('mother', 0.8462507128715515)]

It’s a bit difficult to wrap your head around this operation first, and I think phrasing the question like “What is to a woman that father is to a man?” is not really that helpful. Imagine that we have only 2 dimensions: parentness and gender. The word ‘woman’ might be represented by this vector: [0, 1], ‘man’ is[0, -1], ‘father’ would be [1, -1], and ‘mother’ would be [1, 1]. Now, if we do the same operations as we did up there with the word vectors, we get the same results. Of course, the difference is that we have 300 dimensions instead of the mere 2 in the example, and the dimensions’ meaning are nigh impossible to interpret.

首先将您的头部缠绕在此操作上有点困难，我想提出这样一个问题：“ 女人对女人而言，父亲对男人而言是什么？ ”并没有那么大的帮助。想象一下，我们只有两个维度：父母身份和性别。这个向量可能代表着“ 女人 ”一词： [0, 1] ，“ 男人 ”为[0, -1] ，“ 父亲 ”为[1, -1] ，“ 母亲”为[1, 1] 。现在，如果我们对单词向量进行与上面相同的操作，那么我们将获得相同的结果。当然，不同之处在于我们有300个尺寸，而不是示例中的2个，而且尺寸的含义几乎无法解释。

There was a famous example of gender bias when it came to word2vec operations, the woman version of the word ‘doctor’ (which is, as we know, a gender-neutral word) used to be calculated as ‘nurse’. I tried replicating it, but did not get the same result:

在word2vec运算中有一个著名的性别偏见示例，“ 医生”一词的女性版本(众所周知，这是一个与性别无关的词)，以前被算作“ 护士” 。我尝试复制它，但没有得到相同的结果：

word_vectors.most_similar(
positive = ['doctor', 'woman'], negative = ['man'], topn = 1)
[('gynecologist', 0.7093892097473145)]

So, progress, I guess?

那么，进展，我猜呢？

All right, now that we checked out a few of the basic possibilities, let’s work on our map!

好了，既然我们已经检查了一些基本的可能性，那么让我们在地图上进行工作吧！

映射功能 (Mapping Function)

First, we need a plan of what we want our mapping function to do. Assuming we have a list of strings we want to visualise and a word vector object, we want to:

首先，我们需要一个计划我们的映射功能要做的事情。假设我们有一个要可视化的字符串列表和一个单词矢量对象，我们想要：

find the word vector representation of each word in the list;
在列表中找到每个单词的单词向量表示；
reduce the dimensions to 2 using principal component analysis;
使用主成分分析将尺寸减小到2；
create a scatter plot, add the words as labels to each data point;
创建一个散点图，将单词作为标签添加到每个数据点；
as an added bonus, make it possible to “flip” the results by any dimension — the vectors from the principal component analysis are of an arbitrary direction, which we might want to change when we plot geographical words to better align with the real-world directions.
作为一项额外的奖励，可以将结果按任意维度“翻转”-主成分分析的向量具有任意方向，当我们绘制地理文字以更好地与现实世界对齐时，我们可能希望更改此方向指示。

We will need the following libraries:

我们将需要以下库：

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCAimport adjustText

One library that is not commonly used from the list is adjustText, it’s a very handy package that makes it simple to write legends in scatter plots, without them overlapping. It was surprisingly hard for me to find this solution, and as far as I know, there is no way to do this in matplotlib or seaborn.

列表中不常用的一个库是AdjustText ，它是一个非常方便的软件包，可以轻松地在分散图中编写图例，而不会使其重叠。我很难找到这种解决方案，而且据我所知，没有办法在matplotlib或seaborn中做到这一点。

Without further ado, this is the function that will do exactly what we need:

事不宜迟，这正是我们所需要的功能：

def plot_2d_representation_of_words(
    word_list, 
    word_vectors, 
    flip_x_axis = False,
    flip_y_axis = False,
    label_x_axis = "x",
    label_y_axis = "y", 
    label_label = "city"):
    pca = PCA(n_components = 2)
    word_plus_coordinates=[]
    for word in word_list: 
        current_row = []
        current_row.append(word)
        current_row.extend(word_vectors[word])    word_plus_coordinates.append(current_row)
    word_plus_coordinates = pd.DataFrame(word_plus_coordinates)
    coordinates_2d = pca.fit_transform(
        word_plus_coordinates.iloc[:,1:300])
    coordinates_2d = pd.DataFrame(
        coordinates_2d, columns=[label_x_axis, label_y_axis])
    coordinates_2d[label_label] = word_plus_coordinates.iloc[:,0]    if flip_x_axis:
        coordinates_2d[label_x_axis] = \
        coordinates_2d[label_x_axis] * (-1)    if flip_y_axis:
        coordinates_2d[label_y_axis] = \
        coordinates_2d[label_y_axis] * (-1)
    plt.figure(figsize = (15,10))    p1=sns.scatterplot(
        data=coordinates_2d, x=label_x_axis, y=label_y_axis)
    x = coordinates_2d[label_x_axis]
    y = coordinates_2d[label_y_axis]
    label = coordinates_2d[label_label]
    texts = [plt.text(x[i], y[i], label[i]) for i in range(len(x))]    adjustText.adjust_text(texts)

Now it’s time to test the function. I plotted the capitals of the European countries, but you can use literally any list, names of presidents or other historical figures, car brands, cooking ingredients, rock bands, etc, just pass it in the word_list parameter. I had some fun with it, it’s interesting to see clusters forming up and trying to come up with a meaning behind the two axes.

现在是时候测试该功能了。我绘制了欧洲国家的首都图，但是您可以使用任何列表，总统的名字或其他历史人物，汽车品牌，烹饪原料，摇滚乐队等，只需在word_list参数中传递它word_list 。我玩得很开心，很有趣的是看到集群形成并试图在两个轴后面提出含义。

In case you want to reproduce the results, here are the cities:

如果要重现结果，请在以下城市中进行：

capitals = [
    'Amsterdam', 'Athens', 'Belgrade', 'Berlin', 'Bern', 
    'Bratislava', 'Brussels', 'Bucharest', 'Budapest', 
    'Chisinau', 'Copenhagen','Dublin', 'Helsinki', 'Kiev',
    'Lisbon', 'Ljubljana', 'London', 'Luxembourg','Madrid',
    'Minsk', 'Monaco', 'Moscow', 'Nicosia', 'Nuuk', 'Oslo', 
    'Paris','Podgorica', 'Prague', 'Reykjavik', 'Riga', 
    'Rome', 'San_Marino', 'Sarajevo','Skopje', 'Sofia', 
    'Stockholm', 'Tallinn', 'Tirana', 'Vaduz', 'Valletta',
    'Vatican', 'Vienna', 'Vilnius', 'Warsaw', 'Zagreb']

(Andorra’s capital, Andorra la Vella is missing from the list, couldn’t find a format that word2vec recognises. We will live with that.)

(清单中缺少安道尔的首都安道尔城，找不到word2vec可以识别的格式。我们将继续使用。)

Assuming you still have the word_vectors object we created in the previous section, you can call the function like so:

假设您仍然有上一节中创建的word_vectors对象，则可以这样调用该函数：

plot_2d_representation_of_words(
    word_list = capitals, 
    word_vectors = word_vectors, 
    flip_y_axis = True)

(Axis y is flipped in order to create a representation that better resembles a real map.)

(将y轴翻转以创建更类似于真实地图的表示。)

And the result is:

结果是：

I don’t know how you feel, when I first saw the map, I couldn’t believe how well it turned out! Yes, sure, the longer you look, the more “mistakes” you find, one ominous outcome is that Moscow is not as far to the east as it should be… Still, east and west are almost perfectly separated, Scandinavian and Baltic countries are nicely grouped, so are capitals around Italy, and the list goes on.

我不知道您的感受，当我第一次看到地图时，我简直不敢相信它的效果如何！是的，可以肯定的是，您观察的时间越长，发现的“错误”就越多，一个不祥的结果是莫斯科离东方应该的距离不远……仍然，东西方几乎完全分开，斯堪的纳维亚和波罗的海国家很好地分组在一起，意大利各地的首都也是如此，并且列表还在继续。

It’s important to emphasise that this was never meant to be purely the geographical location, for example, Athens is very far to the west, but there’s a reason for that. Let’s recap how the map above is derived, just so we can fully appreciate it:

需要强调的是，这绝不是纯粹的地理位置，例如，雅典位于西部很远的地方，但这是有原因的。让我们回顾一下上面的地图是如何派生的，以便我们可以完全理解它：

a group of researchers at Google trained a gigantic neural network that predicted words based on their context;
Google的一组研究人员训练了一个巨大的神经网络，该网络根据上下文来预测单词；
they saved the weights of every single word in a 300-dimensional vector representation;
他们将每个单词的权重保存为300维矢量表示形式；
we took the vectors of the European capitals;
我们采用了欧洲首都的媒介；
reduced the dimensions to 2 by using principal component analysis;
通过使用主成分分析将尺寸减小到2；
put the calculated components on a chart.
将计算出的组件放在图表上。

And I think that’s pretty awesome!

而且我认为这太棒了！