CS 224n Assignment #2: word2vec (written部分)
understanding word2vec
==The key insight behind word2vec is that ‘a word is known by the company it keeps’. == Concretely, suppose we have a ‘center’ word c and a contextual window surrounding c. We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.
The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution
P
(
O
∣
C
)
P(O|C)
P(O∣C). Given a specific word o and a specific word c, we want to calculate
P
(
O
=
o
∣
C
=
c
)
P(O = o|C = c)
P(O=o∣C=c), which is the probability that word o is an ‘outside’ word for c, i.e., the probability that o falls within the contextual window of c.
In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
Here,
u
o
u _o
uo is the ‘outside’ vector representing outside word o, and v_c is the ‘center’ vector representing center word c. To contain these parameters, we have two matrices,
U
U
U and
V
V
V . The columns of
U
U
U are all the ‘outside’ vectors
u
w
u_w
uw. The columns of
V
V
V are all of the ‘center’ vectors
v
w
v_w
vw. Both
U
U
U and
V
V
V contain a vector for every
w
∈
V
o
c
a
b
u
l
a
r
y
.
1
w ∈ Vocabulary.^1
w∈Vocabulary.1
Recall from lectures that, for a single pair of words c and o, the loss is given by:
Another way to view this loss is as the
c
r
o
s
s
−
e
n
t
r
o
p
y
2
cross-entropy^2
cross−entropy2 between the true distribution y and the predicted distribution
y
ˆ
yˆ
yˆ. Here, both y and
y
ˆ
yˆ
yˆ are vectors with length equal to the number of words in the vocabulary. Furthermore, the
k
t
h
k^{th}
kth entry in these vectors indicates the conditional probability of the
k
t
h
k^{th}
kth word being an ‘outside word’ for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true outside word o, and 0 everywhere else. The predicted distribution
y
ˆ
yˆ
yˆ is the probability distribution
P
(
O
∣
C
=
c
)
P(O|C = c)
P(O∣C=c) given by our model in equation (1).
1
^1
1Assume that every word in our vocabulary is matched to an integer number k.
u
k
u_k
uk is both the
k
t
h
k^th
kth column of U and the ‘outside’ word vector for the word indexed by k.
v
k
v_k
vk is both the
k
t
h
k^th
kth column of V and the ‘center’ word vector for the word indexed by k. In order to simplify notation we shall interchangeably use k to refer to the word and the index-of-the-word.
2
^2
2The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is
−
∑
i
p
i
l
o
g
(
q
i
)
-\sum_{i}p_ilog(q_i)
−∑ipilog(qi).
Question and Answer
(a) (3 points) Show that the naive-softmax loss given in Equation (2) is the same as the cross-entropy loss between y and
y
ˆ
yˆ
yˆ ;i.e.,show that
Answer:
y
w
y_w
yw是单位矩阵的一列(one-hot),因此对于只有中心词
w
0
w_0
w0的位置为1,其余位置为0:
−
∑
w
∈
V
o
c
a
b
y
w
l
o
g
(
y
^
w
)
=
−
∑
w
∈
V
o
c
a
b
w
≠
w
o
−
y
w
o
l
o
g
(
y
^
w
o
)
=
0
−
l
o
g
(
y
^
w
o
)
=
−
l
o
g
(
y
^
w
o
)
-\sum_{w\in Vocab} y_wlog(\hat y_w)=-\sum_{w\in Vocab w\neq w_o}-y_{w_o}log(\hat y_{wo})=0-log(\hat y_{wo})=-log(\hat y_{wo})
−∑w∈Vocabywlog(y^w)=−∑w∈Vocabw=wo−ywolog(y^wo)=0−log(y^wo)=−log(y^wo)
(b) (5 points) Compute the partial derivative of J n a i v e − s o f t m a x ( v c , o , U ) J_{naive-softmax}(v_c,o,U) Jnaive−softmax(vc,o,U) with respect to v c v_c vc. Please write your answer in terms of y , y ^ y, \hat y y,y^, and U U U.
Answer:
首先先解释一下:
- U U U代表一个单词作为上下文的坐标;
- V V V代表一个单词作为中心词的坐标;
- y y y是输入(训练集);
- y ^ \hat y y^是输出的估计值(对该训练数据的预测);
因为这里
u
0
=
U
T
y
表
示
取
的
第
o
单
词
的
坐
标
,
由
于
y
是
o
n
e
−
h
o
t
的
u_0=U^Ty表示取的第o单词的坐标,由于y是one-hot的
u0=UTy表示取的第o单词的坐标,由于y是one−hot的,
∑
u
w
P
(
w
∣
c
)
\sum u_wP(w|c)
∑uwP(w∣c)对于给定的中心词c的概率分布,因此有
P
(
w
∣
c
)
=
y
^
w
P(w|c)=\hat y_w
P(w∣c)=y^w是我们的预测值,因此上式变为:
∂
J
n
a
i
v
e
−
s
o
f
t
m
a
x
∂
v
c
=
U
T
(
y
^
−
y
)
\frac{\partial J_{naive-softmax}}{\partial v_c}=U^T(\hat y-y)
∂vc∂Jnaive−softmax=UT(y^−y)
© (5 points) Compute the partial derivatives of J n a i v e − s o f t m a x ( v c , o , U ) J_{naive-softmax}(vc,o,U) Jnaive−softmax(vc,o,U) with respect to each of the ‘outside’ word vectors, u w ’ s u_w’s uw’s. There will be two cases: when w = o w = o w=o, the true ‘outside’ word vector, and w ≠ o w \neq o w=o, for all other words. Please write you answer in terms of y , y ^ y, \hat y y,y^, and v c v_c vc.
Answer:
(d) (3 Points) The sigmoid function is given by Equation 4:
σ
(
x
)
=
1
1
+
e
−
x
=
e
x
e
x
+
1
\sigma(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}
σ(x)=1+e−x1=ex+1ex
Please compute the derivative of σ(x) with respect to x, where x is a vector.
Answer:
d
σ
(
x
)
d
x
=
σ
(
x
)
(
1
−
σ
(
x
)
)
\frac{{\rm d}\sigma(x)}{{\rm d}x}=\sigma(x)(1-\sigma(x))
dxdσ(x)=σ(x)(1−σ(x))
(e) (4 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as
w
1
,
w
2
,
.
.
.
,
w
K
w_1,w_2,...,w_K
w1,w2,...,wK and their outside vectors as
u
1
,
.
.
.
,
u
K
u_1,...,u_K
u1,...,uK. Note that
o
∉
w
1
,
.
.
.
,
w
K
o\notin{w_1,...,w_K}
o∈/w1,...,wK. For a center word c and an outside word o, the negative sampling loss function is given by:
for a sample
w
1
,
.
.
.
w
K
w_1,...w_K
w1,...wK, where σ(·) is the sigmoid
f
u
n
c
t
i
o
n
.
3
function.^3
function.3 Please repeat parts (b) and ©, computing the partial derivatives of
J
n
e
g
−
s
a
m
p
l
e
J_{neg-sample}
Jneg−sample with respect to
v
c
v_c
vc, with respect to
u
o
u_o
uo, and with respect to a negative sample
u
k
u_k
uk. Please write your answers in terms of the vectors
u
o
,
v
c
u_o, v_c
uo,vc, and
u
k
u_k
uk, where k ∈ [1,K]. After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.
(f) (3 points) Suppose the center word is
c
=
w
t
c = w_t
c=wt and the context window is
[
w
t
−
m
,
.
.
.
,
w
t
−
1
,
w
t
,
w
t
+
1
,
.
.
.
,
w
t
+
m
]
[w_{t−m}, ..., w_{t−1}, w_t, w_{t+1}, ..., w_{t+m}]
[wt−m,...,wt−1,wt,wt+1,...,wt+m], where m is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:
Here,
J
(
v
c
,
w
t
+
j
,
U
)
J(v_c,w_{t+j},U)
J(vc,wt+j,U) represents an arbitrary loss term for the center word
c
=
w
t
c = w_t
c=wt and outside word
w
t
+
j
w_{t+j}
wt+j.
J
(
v
c
,
w
t
+
j
,
U
)
J(v_c,w_{t+j},U)
J(vc,wt+j,U) could be
J
n
a
i
v
e
−
s
o
f
t
m
a
x
(
v
c
,
w
t
+
j
,
U
)
J_{naive-softmax}(v_c,w_{t+j},U)
Jnaive−softmax(vc,wt+j,U) or
J
n
e
g
−
s
a
m
p
l
e
(
v
c
,
w
t
+
j
,
U
)
J_{neg-sample}(v_c,w_{t+j},U)
Jneg−sample(vc,wt+j,U), depending on your implementation.
Write down three partial derivatives:
Answer: