CS 224n Assignment #2: word2vec (written部分)

understanding word2vec

==The key insight behind word2vec is that ‘a word is known by the company it keeps’. == Concretely, suppose we have a ‘center’ word c and a contextual window surrounding c. We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P ( O ∣ C ) P(O|C) P(OC). Given a specific word o and a specific word c, we want to calculate P ( O = o ∣ C = c ) P(O = o|C = c) P(O=oC=c), which is the probability that word o is an ‘outside’ word for c, i.e., the probability that o falls within the contextual window of c.
In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
Here, u o u _o uo is the ‘outside’ vector representing outside word o, and v_c is the ‘center’ vector representing center word c. To contain these parameters, we have two matrices, U U U and V V V . The columns of U U U are all the ‘outside’ vectors u w u_w uw. The columns of V V V are all of the ‘center’ vectors v w v_w vw. Both U U U and
V V V contain a vector for every w ∈ V o c a b u l a r y . 1 w ∈ Vocabulary.^1 wVocabulary.1

Recall from lectures that, for a single pair of words c and o, the loss is given by:

Another way to view this loss is as the c r o s s − e n t r o p y 2 cross-entropy^2 crossentropy2 between the true distribution y and the predicted distribution y ˆ yˆ yˆ. Here, both y and y ˆ yˆ yˆ are vectors with length equal to the number of words in the vocabulary. Furthermore, the k t h k^{th} kth entry in these vectors indicates the conditional probability of the k t h k^{th} kth word being an ‘outside word’ for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true outside word o, and 0 everywhere else. The predicted distribution y ˆ yˆ yˆ is the probability distribution P ( O ∣ C = c ) P(O|C = c) P(OC=c) given by our model in equation (1).

1 ^1 1Assume that every word in our vocabulary is matched to an integer number k. u k u_k uk is both the k t h k^th kth column of U and the ‘outside’ word vector for the word indexed by k. v k v_k vk is both the k t h k^th kth column of V and the ‘center’ word vector for the word indexed by k. In order to simplify notation we shall interchangeably use k to refer to the word and the index-of-the-word.
2 ^2 2The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is − ∑ i p i l o g ( q i ) -\sum_{i}p_ilog(q_i) ipilog(qi).

Question and Answer

(a) (3 points) Show that the naive-softmax loss given in Equation (2) is the same as the cross-entropy loss between y and y ˆ yˆ yˆ ;i.e.,show that
y w y_w yw是单位矩阵的一列(one-hot),因此对于只有中心词 w 0 w_0 w0的位置为1,其余位置为0:
− ∑ w ∈ V o c a b y w l o g ( y ^ w ) = − ∑ w ∈ V o c a b w ≠ w o − y w o l o g ( y ^ w o ) = 0 − l o g ( y ^ w o ) = − l o g ( y ^ w o ) -\sum_{w\in Vocab} y_wlog(\hat y_w)=-\sum_{w\in Vocab w\neq w_o}-y_{w_o}log(\hat y_{wo})=0-log(\hat y_{wo})=-log(\hat y_{wo}) wVocabywlog(y^w)=wVocabw=woywolog(y^wo)=0log(y^wo)=log(y^wo)

(b) (5 points) Compute the partial derivative of J n a i v e − s o f t m a x ( v c , o , U ) J_{naive-softmax}(v_c,o,U) Jnaivesoftmax(vc,o,U) with respect to v c v_c vc. Please write your answer in terms of y , y ^ y, \hat y y,y^, and U U U.


  1. U U U代表一个单词作为上下文的坐标;
  2. V V V代表一个单词作为中心词的坐标;
  3. y y y是输入(训练集);
  4. y ^ \hat y y^是输出的估计值(对该训练数据的预测);

因为这里 u 0 = U T y 表 示 取 的 第 o 单 词 的 坐 标 , 由 于 y 是 o n e − h o t 的 u_0=U^Ty表示取的第o单词的坐标,由于y是one-hot的 u0=UTyoyonehot ∑ u w P ( w ∣ c ) \sum u_wP(w|c) uwP(wc)对于给定的中心词c的概率分布,因此有 P ( w ∣ c ) = y ^ w P(w|c)=\hat y_w P(wc)=y^w是我们的预测值,因此上式变为:
∂ J n a i v e − s o f t m a x ∂ v c = U T ( y ^ − y ) \frac{\partial J_{naive-softmax}}{\partial v_c}=U^T(\hat y-y) vcJnaivesoftmax=UT(y^y)

© (5 points) Compute the partial derivatives of J n a i v e − s o f t m a x ( v c , o , U ) J_{naive-softmax}(vc,o,U) Jnaivesoftmax(vc,o,U) with respect to each of the ‘outside’ word vectors, u w ’ s u_w’s uws. There will be two cases: when w = o w = o w=o, the true ‘outside’ word vector, and w ≠ o w \neq o w=o, for all other words. Please write you answer in terms of y , y ^ y, \hat y y,y^, and v c v_c vc.

(d) (3 Points) The sigmoid function is given by Equation 4:
σ ( x ) = 1 1 + e − x = e x e x + 1 \sigma(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1} σ(x)=1+ex1=ex+1ex
Please compute the derivative of σ(x) with respect to x, where x is a vector.

d σ ( x ) d x = σ ( x ) ( 1 − σ ( x ) ) \frac{{\rm d}\sigma(x)}{{\rm d}x}=\sigma(x)(1-\sigma(x)) dxdσ(x)=σ(x)(1σ(x))

(e) (4 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w 1 , w 2 , . . . , w K w_1,w_2,...,w_K w1,w2,...,wK and their outside vectors as u 1 , . . . , u K u_1,...,u_K u1,...,uK. Note that o ∉ w 1 , . . . , w K o\notin{w_1,...,w_K} o/w1,...,wK. For a center word c and an outside word o, the negative sampling loss function is given by:
for a sample w 1 , . . . w K w_1,...w_K w1,...wK, where σ(·) is the sigmoid f u n c t i o n . 3 function.^3 function.3 Please repeat parts (b) and ©, computing the partial derivatives of J n e g − s a m p l e J_{neg-sample} Jnegsample with respect to v c v_c vc, with respect to u o u_o uo, and with respect to a negative sample u k u_k uk. Please write your answers in terms of the vectors u o , v c u_o, v_c uo,vc, and u k u_k uk, where k ∈ [1,K]. After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.

(f) (3 points) Suppose the center word is c = w t c = w_t c=wt and the context window is [ w t − m , . . . , w t − 1 , w t , w t + 1 , . . . , w t + m ] [w_{t−m}, ..., w_{t−1}, w_t, w_{t+1}, ..., w_{t+m}] [wtm,...,wt1,wt,wt+1,...,wt+m], where m is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:
Here, J ( v c , w t + j , U ) J(v_c,w_{t+j},U) J(vc,wt+j,U) represents an arbitrary loss term for the center word c = w t c = w_t c=wt and outside word w t + j w_{t+j} wt+j. J ( v c , w t + j , U ) J(v_c,w_{t+j},U) J(vc,wt+j,U) could be J n a i v e − s o f t m a x ( v c , w t + j , U ) J_{naive-softmax}(v_c,w_{t+j},U) Jnaivesoftmax(vc,wt+j,U) or J n e g − s a m p l e ( v c , w t + j , U ) J_{neg-sample}(v_c,w_{t+j},U) Jnegsample(vc,wt+j,U), depending on your implementation.
Write down three partial derivatives:

  • 1
