在Python中用word2vec实现odonetoneout算法

最新推荐文章于 2024-10-02 10:53:34 发布

weixin_26721705

最新推荐文章于 2024-10-02 10:53:34 发布

阅读量201

点赞数

文章标签： python 算法

原文链接：https://medium.com/swlh/implementing-oddoneout-algorithm-with-word2vec-in-python-bb8c314baa44

版权

Firstly, let’s talk about what is a Word2Vec model. Word2Vec is one of the most popular techniques to learn word embeddings using a shallow neural network. It was developed by Tomas Mikolov in 2013 at Google. For the algorithm Odd One Out that we are going to implement soon, we will use the Google pre-trained model: ‘Googlenews-vectors-negative300.bin’, which can be downloaded from here. This model can be loaded using the gensim module, by the following code:

首先，让我们谈谈什么是Word2Vec模型。 Word2Vec是使用浅层神经网络学习单词嵌入的最受欢迎的技术之一。它是由Tomas Mikolov于2013年在Google上开发的。对于即将实施的Odd One Out算法，我们将使用Google预先训练的模型：“ Googlenews-vectors-negative300.bin”，可以从此处下载。可以使用gensim模块通过以下代码加载此模型：

from gensim.models import KeyedVectorsmodel = KeyedVectors.load_word2vec_format(‘GoogleNews-vectors-negative300.bin’,binary=True)
#model loaded

The model contains 300-dimensional vectors for 3 million words and phrases.

该模型包含用于300万个单词和短语的300维向量。

samsung_vector = model["samsung"] #word vector of the word "samsung"
apple_vector = model["apple"] #word vector of the word "apple"
print(samsung_vector, apple_vector)((300,), (300,)) #printed result, both vectors are of 300 dimension.

To get a good idea about what is word2vec, you can refer to this article.

要了解什么是word2vec，可以参考本文。

In this implementation, we will be using KeyedVectors(from gensim module) and cosine similarity function(provided by sklearn), import these two by the following code,

在此实现中，我们将使用KeyedVectors(来自gensim模块)和余弦相似度函数(由sklearn提供)，通过以下代码导入这两个代码，

from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

Now, coming to the algorithm. What do I mean by OddOneOut? Let’s take an example so you can understand better. Assume we have a list of 5 words as [“apple,” mango,” banana,” red,” papaya”]. If we have to tell which one of these five words is an odd one out, we can tell quickly tell it’s red because all the other words are names of fruits(all those words have the same context →fruits), that’s what we are going to implement. Our program will take input a list of words and then tell which word out of them is an odd one out.

现在，介绍算法。 OddOneOut是什么意思？让我们举个例子，以便您更好地理解。假设我们有一个5个单词的列表，例如[“苹果”，“芒果”，“香蕉”，“红色”，“木瓜”]。如果我们必须说出这五个词中的哪个是奇数，我们可以很快说出它是红色的，因为所有其他词都是水果的名称(所有这些词都具有相同的上下文→水果)，这就是我们要做的实施。我们的程序将输入一个单词列表，然后说出其中一个单词是一个奇怪的单词。

The cosine similarity function will play a primary role in implementing this algorithm. What does cosine_similarity do? It computes similarity as the normalized dot product of X and Y. In simple words, we can use it to tell how much two terms are related to each other. Let us see by some examples,

余弦相似度函数将在实现此算法中起主要作用。 cosine_similarity有什么作用？它计算相似度为X和Y的归一化点积。简而言之，我们可以用它来表明两个词之间有多少关联。让我们看一些例子，

#similarity between the two words "samsung" and "apple" #here apple and samsung will be interpreted as mobile companies as context by the model.
print(cosine_similarity([samsung_vector],[apple_vector]))array([[0.2928738]], dtype=float32) #Printed result

As we can see, the similarity came out to be 0.29, which is close to zero. The more the cosine_similarity is close to zero more, the more the similarity is between the two words.

我们可以看到，相似度为0.29，接近零。余弦相似度越接近零越多，两个词之间的相似度就越大。

Image for post — Visual Image of word vectors of words with context as axes

Let’s discuss the algorithm of OddOneOut. What we are doing is passing a list of words to our program. So, what we will do is we will take the average of the word vectors of all the words, i.e., if word vectors of the words in the list are as v1,v2,v3……vn(n = no. of words in the list), the average vector can be found out by taking the mean of all the word vectors by np.mean([v1,v2,v3,…,vn],axis=0). Then we will set a variable mini and giving it a considerable high value, which will help in some comparisons we will see soon. Then we will commence a for loop and iterate over all the words in the list and will check the cosine similarity between each word with the avg vector we calculated. The word with the maximum value of similarity with the average vector will be our odd one out, as our average vector is made up of n-k words with the same context and k words (where k will be a small number) of a very different context from that of n-k words. By this, we are done with our implementation. You can follow the below code for implementing this algorithm.

让我们讨论OddOneOut的算法。我们正在做的是将单词列表传递给我们的程序。因此，我们要做的是取所有单词的单词向量的平均值，即，如果列表中单词的单词向量为v1，v2，v3……vn(n =单词中的单词数列表)，则可以通过将所有单词向量的平均值乘以np.mean([v1，v2，v3，...，vn]，axis = 0)来找到平均向量。然后，我们将设置一个变量最小值，并为其赋予较高的价值，这将有助于我们很快进行一些比较。然后，我们将开始for循环并遍历列表中的所有单词，并使用计算出的avg向量检查每个单词之间的余弦相似度。与平均向量具有最大相似度的单词将是我们的奇数，因为我们的平均向量由上下文相同的nk个单词和上下文不同的k个单词(其中k为小数)组成从nk词至此，我们完成了实现。您可以按照以下代码来实现此算法。

def oddoneout(words):
    words_vectors=[model[w] for w in words]
    avg = np.mean(words_vectors,axis=0)
    #iterate
    ret = None
    mini = 9999 #max val
    for w in words:   
        cos_sim = cosine_similarity([model[w]],[avg])
        if cos_sim < mini:
            mini = cos_sim
            ret = w
    return wwords = [“apple”,” mango”,” banana”,”red”,” papaya”]print(oddoneout(words))'red' #Printed result

I hope you liked this. Try implementing the algorithm on your own first. It’s a pretty cool algorithm, so I hope you had fun knowing about it. Keep Learning.

我希望你喜欢这个。尝试首先自己实施算法。这是一个非常酷的算法，所以希望您对此有所了解。保持学习。