Word2Vec(CBOW+Skip-gram) 概念介绍+代码实现

Word2Vec

Word2vec attempts to decide the importance of a word by breaking down its neighboring words (the context) and thus resolving the context loss issue.

The two major architectures for word2vec are continuous bag-of-words (CBOW) and skip-gram (SG).

continuous bag-of-words
在这里插入图片描述

  1. Input layer: One-Hot encoding word vector of context words, V V V is the number of words in vocabulary list, and C C C is the number of context words.

  2. Initialize a weight matrix W V × N W_{V \times N} WV×N, and then left-multiply the matrix with all input One-Hot coding word vectors to get a vector ω 1 ω 2 , . . . , ω c \omega_{1} \omega_{2} , ... , \omega_{c} ω1ω2,...,ωc of dimension N N N, where N N N is set by oneself according to the task needs.

  3. Add the obtained vectors ω 1 ω 2 , . . . , ω c \omega_{1} \omega_{2} , ... , \omega_{c} ω1ω2,...,ωc and calculate the average as the hidden layer vector h h h.

  4. Initialize another weight matrix W N × V ′ W'_{N \times V} WN×V, multiply it by the hidden layer vector h h h, and then get the V-dimensional vector y y y through the activation function. Each element of y y y represents the probability distribution of each corresponding word.

  5. The word indicated by the element with the largest probability in y y y is the predicted intermediate word (target word) compared with the One-Hot coding word vector of true label. The smaller the error, the better (update the two weight matrices according to the error)

Before training, the loss function (generally referred to as cross-entropy loss function) should be defined, and the gradient descent algorithm should be used to update W W W and W ′ W' W. After the training, each word of the input layer is multiplied by matrix W W W to obtain the vector of the word vector represented by the Distributed Representation, which is also called word embedding.

skip-gram (SG)
在这里插入图片描述

In Skip-gram, the model iterates over the words in the corpus and predicts the neighbors (i.e. the context). That is, by giving a word that you want to predict the context in which it is likely to occur. By training on a large corpus, a weight model is obtained from the input layer to the hidden layer.

# Python program to generate word vectors using Word2Vec
 
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action = 'ignore')
 
import gensim
from gensim.models import Word2Vec
 
#  Reads ‘alice.txt’ file
sample = open("alice.txt")
s = sample.read()
 
# Replaces escape character with space
f = s.replace("\n", " ")
 
data = []
 
# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []
     
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
 
    data.append(temp)
 
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1,
                              vector_size = 100, window = 5)


# Print results
print("Cosine similarity between 'alice' " +
               "and 'wonderland' - CBOW : ",
    model1.wv.similarity('alice', 'wonderland'))
     
print("Cosine similarity between 'alice' " +
                 "and 'machines' - CBOW : ",
      model1.wv.similarity('alice', 'machines'))



# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100,
                                             window = 5, sg = 1)
 
# Print results
print("Cosine similarity between 'alice' " +
          "and 'wonderland' - Skip Gram : ",
    model2.wv.similarity('alice', 'wonderland'))
     
print("Cosine similarity between 'alice' " +
            "and 'machines' - Skip Gram : ",
      model2.wv.similarity('alice', 'machines'))

# output
Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.9845774
Cosine similarity between 'alice' and 'machines' - CBOW :  0.94986236
Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.7097177
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.81548774

References

  1. Getting started with Word2vec
  2. Word2Vec
  3. Python | Word Embedding using Word2Vec
  4. Data(alice.txt) can be downloaded here
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

hUaleeF

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值