Word2Vec(CBOW+Skip-gram) 概念介绍+代码实现

hUaleeF

于 2022-10-31 21:41:20 发布

阅读量383

点赞数

分类专栏： NLP Learning Notes 文章标签： word2vec python 人工智能

本文链接：https://blog.csdn.net/hua_453/article/details/127624010

版权

NLP Learning Notes 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

Word2Vec

Word2vec attempts to decide the importance of a word by breaking down its neighboring words (the context) and thus resolving the context loss issue.

The two major architectures for word2vec are continuous bag-of-words (CBOW) and skip-gram (SG).

continuous bag-of-words
在这里插入图片描述

Input layer: One-Hot encoding word vector of context words, $V$ is the number of words in vocabulary list, and $C$ is the number of context words.
Initialize a weight matrix $W_{V \times N}$ , and then left-multiply the matrix with all input One-Hot coding word vectors to get a vector $\omega_{1} \omega_{2} , ... , \omega_{c}$ of dimension $N$ , where $N$ is set by oneself according to the task needs.
Add the obtained vectors $\omega_{1} \omega_{2} , ... , \omega_{c}$ and calculate the average as the hidden layer vector $h$ .
Initialize another weight matrix $W'_{N \times V}$ , multiply it by the hidden layer vector $h$ , and then get the V-dimensional vector $y$ through the activation function. Each element of $y$ represents the probability distribution of each corresponding word.
The word indicated by the element with the largest probability in $y$ is the predicted intermediate word (target word) compared with the One-Hot coding word vector of true label. The smaller the error, the better (update the two weight matrices according to the error)

Before training, the loss function (generally referred to as cross-entropy loss function) should be defined, and the gradient descent algorithm should be used to update $W$ and $W^{'}$ . After the training, each word of the input layer is multiplied by matrix $W$ to obtain the vector of the word vector represented by the Distributed Representation, which is also called word embedding.

skip-gram (SG)
在这里插入图片描述

In Skip-gram, the model iterates over the words in the corpus and predicts the neighbors (i.e. the context). That is, by giving a word that you want to predict the context in which it is likely to occur. By training on a large corpus, a weight model is obtained from the input layer to the hidden layer.

# Python program to generate word vectors using Word2Vec
 
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action = 'ignore')
 
import gensim
from gensim.models import Word2Vec
 
#  Reads ‘alice.txt’ file
sample = open("alice.txt")
s = sample.read()
 
# Replaces escape character with space
f = s.replace("\n", " ")
 
data = []
 
# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []
     
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
 
    data.append(temp)
 
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1,
                              vector_size = 100, window = 5)


# Print results
print("Cosine similarity between 'alice' " +
               "and 'wonderland' - CBOW : ",
    model1.wv.similarity('alice', 'wonderland'))
     
print("Cosine similarity between 'alice' " +
                 "and 'machines' - CBOW : ",
      model1.wv.similarity('alice', 'machines'))



# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100,
                                             window = 5, sg = 1)
 
# Print results
print("Cosine similarity between 'alice' " +
          "and 'wonderland' - Skip Gram : ",
    model2.wv.similarity('alice', 'wonderland'))
     
print("Cosine similarity between 'alice' " +
            "and 'machines' - Skip Gram : ",
      model2.wv.similarity('alice', 'machines'))

# output
Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.9845774
Cosine similarity between 'alice' and 'machines' - CBOW :  0.94986236
Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.7097177
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.81548774