CS 224n Assignment #2: word2vec (written部分)

CS 224n Assignment #2: word2vec (written部分)

understanding word2vec

==The key insight behind word2vec is that ‘a word is known by the company it keeps’. == Concretely, suppose we have a ‘center’ word c and a contextual window surrounding c. We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P ( O ∣ C ) P(O|C) P(OC). Given a specific word o and a specific word c, we want to calculate P ( O = o ∣ C = c ) P(O = o|C = c) P(O=oC=c), which is the probability that word o is an ‘outside’ word for c, i.e., the probability that o falls within the contextual window of c.
在片描述
In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
在这里插入图片描述
Here, u o u _o uo is the ‘outside’ vector representing outside word o, and v_c is the ‘center’ vector representing center word c. To contain these parameters, we have two matrices, U U U and V V V . The columns of U U U are all the ‘outside’ vectors u w u_w uw. The columns of V V V are all of the ‘center’ vectors v w v_w vw. Both U U U and
V V V contain a vector for every w ∈ V o c a b u l a r y . 1 w ∈ Vocabulary.^1 wVocabulary.1

Recall from lectures that, for a single pair of words c and o, the loss is given by:

在这里插入图片描述
Another way to view this loss is as the c r o s s − e n t r o p y 2 cross-entropy^2 crossentropy2 between the true distribution y and the predicted distribution y ˆ yˆ yˆ. Here, both y and y ˆ yˆ yˆ are vectors with length equal to the number of words in the vocabulary. Furthermore, the k t h k^{th} kth entry in these vectors indicates the conditional probability of the k t h k^{th} kth word being an ‘outside word’ for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true outside word o, and 0 everywhere else. The predicted distribution y ˆ yˆ yˆ is the probability distribution P ( O ∣ C = c ) P(O|C = c) P(OC=c) given by our model in equation (1).


1 ^1 1Assume that every word in our vocabulary is matched to an integer number k. u k u_k uk is both the k t h k^th kth column of U and the ‘outside’ word vector for the word indexed by k. v k v_k vk is both the k t h k^th kth column of V and the ‘center’ word vector for the word indexed by k. In order to simplify notation we shall interchangeably use k to refer to the word and the index-of-the-word.
2 ^2 2The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is − ∑ i p i l o g ( q i ) -\sum_{i}p_ilog(q_i) ipilog(qi).

Question and Answer

(a) (3 points) Show that the naive-softmax loss given in Equation (2) is the same as the cross-entropy loss between y and y ˆ yˆ yˆ ;i.e.,show that
在这里插入图片描述
Answer:
y w y_w yw是单位矩阵的一列(one-hot),因此对于只有中心词 w 0 w_0 w0的位置为1,其余位置为0:
− ∑ w ∈ V o c a b y w l o g ( y ^ w ) = − ∑ w ∈ V o c a b w ≠ w o − y w o l o g ( y ^ w o ) = 0 − l o g ( y ^ w o ) = − l o g ( y ^ w o ) -\sum_{w\in Vocab} y_wlog(\hat y_w)=-\sum_{w\in Vocab w\neq w_o}-y_{w_o}log(\hat y_{wo})=0-log(\hat y_{wo})=-log(\hat y_{wo}) wVocabywlog(y^w)=wVocabw=woywolog(y^wo)=0log(y^wo)=log(y^wo)

(b) (5 points) Compute the partial derivative of J n a i v e − s o f t m a x ( v c , o , U ) J_{naive-softmax}(v_c,o,U) Jnaivesoftmax(vc,o,U) with respect to v c v_c vc. Please write your answer in terms of y , y ^ y, \hat y y,y^, and U U U.

Answer:
首先先解释一下:

  1. U U U代表一个单词作为上下文的坐标;
  2. V V V代表一个单词作为中心词的坐标;
  3. y y y是输入(训练集);
  4. y ^ \hat y y^是输出的估计值(对该训练数据的预测);

在这里插入图片描述
因为这里 u 0 = U T y 表 示 取 的 第 o 单 词 的 坐 标 , 由 于 y 是 o n e − h o t 的 u_0=U^Ty表示取的第o单词的坐标,由于y是one-hot的 u0=UTyoyonehot ∑ u w P ( w ∣ c ) \sum u_wP(w|c) uwP(wc)对于给定的中心词c的概率分布,因此有 P ( w ∣ c ) = y ^ w P(w|c)=\hat y_w P(wc)=y^w是我们的预测值,因此上式变为:
∂ J n a i v e − s o f t m a x ∂ v c = U T ( y ^ − y ) \frac{\partial J_{naive-softmax}}{\partial v_c}=U^T(\hat y-y) vcJnaivesoftmax=UT(y^y)

© (5 points) Compute the partial derivatives of J n a i v e − s o f t m a x ( v c , o , U ) J_{naive-softmax}(vc,o,U) Jnaivesoftmax(vc,o,U) with respect to each of the ‘outside’ word vectors, u w ’ s u_w’s uws. There will be two cases: when w = o w = o w=o, the true ‘outside’ word vector, and w ≠ o w \neq o w=o, for all other words. Please write you answer in terms of y , y ^ y, \hat y y,y^, and v c v_c vc.

Answer:
在这里插入图片描述
(d) (3 Points) The sigmoid function is given by Equation 4:
σ ( x ) = 1 1 + e − x = e x e x + 1 \sigma(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1} σ(x)=1+ex1=ex+1ex
Please compute the derivative of σ(x) with respect to x, where x is a vector.

Answer:
d σ ( x ) d x = σ ( x ) ( 1 − σ ( x ) ) \frac{{\rm d}\sigma(x)}{{\rm d}x}=\sigma(x)(1-\sigma(x)) dxdσ(x)=σ(x)(1σ(x))

(e) (4 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w 1 , w 2 , . . . , w K w_1,w_2,...,w_K w1,w2,...,wK and their outside vectors as u 1 , . . . , u K u_1,...,u_K u1,...,uK. Note that o ∉ w 1 , . . . , w K o\notin{w_1,...,w_K} o/w1,...,wK. For a center word c and an outside word o, the negative sampling loss function is given by:
在这里插入图片描述
for a sample w 1 , . . . w K w_1,...w_K w1,...wK, where σ(·) is the sigmoid f u n c t i o n . 3 function.^3 function.3 Please repeat parts (b) and ©, computing the partial derivatives of J n e g − s a m p l e J_{neg-sample} Jnegsample with respect to v c v_c vc, with respect to u o u_o uo, and with respect to a negative sample u k u_k uk. Please write your answers in terms of the vectors u o , v c u_o, v_c uo,vc, and u k u_k uk, where k ∈ [1,K]. After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.

在这里插入图片描述
在这里插入图片描述
(f) (3 points) Suppose the center word is c = w t c = w_t c=wt and the context window is [ w t − m , . . . , w t − 1 , w t , w t + 1 , . . . , w t + m ] [w_{t−m}, ..., w_{t−1}, w_t, w_{t+1}, ..., w_{t+m}] [wtm,...,wt1,wt,wt+1,...,wt+m], where m is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:
在这里插入图片描述
Here, J ( v c , w t + j , U ) J(v_c,w_{t+j},U) J(vc,wt+j,U) represents an arbitrary loss term for the center word c = w t c = w_t c=wt and outside word w t + j w_{t+j} wt+j. J ( v c , w t + j , U ) J(v_c,w_{t+j},U) J(vc,wt+j,U) could be J n a i v e − s o f t m a x ( v c , w t + j , U ) J_{naive-softmax}(v_c,w_{t+j},U) Jnaivesoftmax(vc,wt+j,U) or J n e g − s a m p l e ( v c , w t + j , U ) J_{neg-sample}(v_c,w_{t+j},U) Jnegsample(vc,wt+j,U), depending on your implementation.
Write down three partial derivatives:
在这里插入图片描述
Answer:
在这里插入图片描述

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值