word2vec - cs224n class 1

最新推荐文章于 2022-05-15 12:15:19 发布

DecafTea

最新推荐文章于 2022-05-15 12:15:19 发布

阅读量166

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/DecafTea/article/details/110091999

版权

NLP 专栏收录该内容

52 篇文章 3 订阅

订阅专栏

借鉴：
cs224n notes 1:
http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf

1. word2vec background theory

distributional semantics: represent the meaning of a word based on the context in which it usually appears. The obtaiend representations/ word embeddings are dense and can better capture similarity.
distributional similarity: the idea that similar words have similar context.

2. word2vec

A language model can assign a probability to a sequence of tokens w1, w2, …, wn.
e.g. unigram, bigram model

word2vec contains:
language models:
CBOW: predict center word from the context
skip-gram: predict the context from the center word

training methods:
negative sampling: defines an objective by sampling negative examples
hierarchical softmax: defines an objective by using an efficient tree structure to compute probabilities for all the vocabulary.

word2vec is iteration-based:

model parameters are the word vectors
train the model on a certain objective
at every iteration, evaluate the errors and follow an update rule that has some notion of penalizing the model parameters that caused the errors thus, we learn the word vectors (modelparameters)

2.1 CBOW

steps:

generate one-hot word vectors for the input context of size m: (x_c-m, x_c-m+1, …, x_c-1, x_c+1, …, x_c+m-1, x_c+m)
c: center word, x: |V|x1 vector
get our embedded word vectors for the context (v_c-m = V.x_c-m, v_c-m+1 = V.x_c-m+1, …, v_c-1, v_c+1, …, v_c+m-1, v_c+m)
v: nx1 embedded vector
average these 2m vectors to get v_ave
generate a score vector z = Uv_ave
yˆ = softmax(z)
We desire our probabilities generated, yˆ ∈ R|V| to match the true probabilities

how to measure the degree of matching/similarity? cross-entropy.
so our optimization function is about cross entropy. The probability of (correct) center word given its context can be calculated using the cross entropy (similarity between the center word uc and its context v_bar).
So, our objective of updating the word vectors of center word and context words is to maximize the similarity of the center word and its context.
在这里插入图片描述

We use stochastic gradient descent to update. SGD compute gradients for a window with the size of m (2m context words) and update the parameters:
在这里插入图片描述