Assignment1的答案一共被我分成了4部分,分别包含第1,2,3,4题。这部分包含第3题的答案。
3. word2vec (40 points + 5 bonus)
(a). (3 points) Assume you are given a predicted word vector
vc
corresponding to the center word
c
for skipgram, and word prediction is made with the softmax function found in word2vec models
where w denotes the
Hint: It will be helpful to use notation from question 2. For instance, letting y^ be the vector of softmax predictions for every word, y as the expected word vector, and the loss function
where U=[u1,u2,…,uW] is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.
解:设词向量的维度为
ndim
,且各词向量为列向量,即
vc
的维度为
ndim×1
,
U
的维度为
(b)(3 points) As in the previous part, derive gradients for the “output” word vectors uw (including uo ).
解:同(a)中一样,设
θ=UTvc
,则有:
其中 θk=uTk⋅vc ,则 ∂θk∂Uij={vi0j=kj≠k ,其中 vi 表示 vc 的第 i 个元素。则有:
所以:
(c). (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector
vc
, and the expected output word is
o
. Assume that
where σ(⋅) is the sigmoid function.
After you’ve done this, describe with one sentence why this cost function is much more efficient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e. the runtime of the softmax-CE loss divided by the runtime of the negative sampling loss).
Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.
解:设所取的
K
个索引所在的集合为
之所以(6)式比(5)式快是因为: runtime of softmax-CEruntime of negative sampling loss=O(W)O(K) (不知道这么说是不是准确,望大神指正)。
(d). (8 points) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words
[wordc−m,…,wordc−1,wordc,wordc+1,…,wordc+m]
, where
m
is the context size. Denote the “input” and “output” word vectors for
Hint: feel free to use
F(o,vc)
(where
o
is the expected word) as a placeholder for the
Recall that for skip-gram, the cost for a context centered around
c
is
where wc+j refers to the word at the j -th index from the center.
CBOW is slightly different. Instead of using
then the CBOW cost is
Note: To be consistent with the v^ notation such as for the code portion, for skip-gram v^=vc .
解:设 vk,uk 分别为词 k 所对应的内外向量。
skip-gram对应的答案:
其中 wc+j 为从中心数第 j 个词所对应的one-hot vector。
CBOW对应的答案:
ps: 感觉这个答案好简单的样子,为什么要给8分呢?
(e)(f)(g)(h). 见代码,略
附一张训出来的图,也就是我跑完q3_run.py之后出现的图,reddit 上有人讨论怎么看这个图是否合理: