CS 224n Assignment #2: word2vec (written部分）_recall that the key insight behind word2vec is tha-CSDN博客

本文链接：https://blog.csdn.net/qq_34406071/article/details/113180525

CS 224n Assignment #2: word2vec (written部分）

written部分

CS 224n Assignment #2: word2vec (written部分）
understanding word2vec
Question and Answer

understanding word2vec

==The key insight behind word2vec is that ‘a word is known by the company it keeps’. == Concretely, suppose we have a ‘center’ word c and a contextual window surrounding c. We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution $P (O ∣ C)$ . Given a speciﬁc word o and a speciﬁc word c, we want to calculate $P (O = o ∣ C = c)$ , which is the probability that word o is an ‘outside’ word for c, i.e., the probability that o falls within the contextual window of c.
在片描述
In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
在这里插入图片描述
Here, $u _o$ is the ‘outside’ vector representing outside word o, and v_c is the ‘center’ vector representing center word c. To contain these parameters, we have two matrices, $U$ and $V$ . The columns of $U$ are all the ‘outside’ vectors $u_w$ . The columns of $V$ are all of the ‘center’ vectors $v_w$ . Both $U$ and
$V$ contain a vector for every $w ∈ Vocabulary.^1$

Recall from lectures that, for a single pair of words c and o, the loss is given by:

在这里插入图片描述
Another way to view this loss is as the $cross-entropy^2$ between the true distribution y and the predicted distribution $y ˆ$ . Here, both y and $y ˆ$ are vectors with length equal to the number of words in the vocabulary. Furthermore, the $k^{th}$ entry in these vectors indicates the conditional probability of the $k^{th}$ word being an ‘outside word’ for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true outside word o, and 0 everywhere else. The predicted distribution $y ˆ$ is the probability distribution $P (O ∣ C = c)$ given by our model in equation (1).

$^1$ Assume that every word in our vocabulary is matched to an integer number k. $u_k$ is both the $k^th$ column of U and the ‘outside’ word vector for the word indexed by k. $v_k$ is both the $k^th$ column of V and the ‘center’ word vector for the word indexed by k. In order to simplify notation we shall interchangeably use k to refer to the word and the index-of-the-word.
$^2$ The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is $-\sum_{i}p_ilog(q_i)$