CS224n : Assignment2 参考答案
本文为cs224n课程(winter,2019)的 assignment2 中的公式推导部分参考答案。如有疑问或者错误之处,欢迎交流。
Assignment2 原文
Assignment2 编码部分参考答案
Variables notation
Attention: All the variables’ dimensions here are consistent with the code part in Assignment 2 for easy understanding.
U \boldsymbol U U , matrix of shape (vocab_size,embedding_dim) ,all the ‘outside’ vectors .
V \boldsymbol V V, matrix of shape (vocab_size,embedding_dim) ,all the ‘center’ vectors .
y \boldsymbol y y, vector of shape (vocab_size,1), the true empirical distribution y \boldsymbol y y is a one-hot vector with a 1 for the true outside word o, and 0 everywhere else .
y ^ \hat{\boldsymbol{y}} y^, vector of shape (vocab_size,1), the predicted distribution y ^ \hat{\boldsymbol{y}} y^ is the probability distribution P ( O ∣ C = c ) P(O|C = c) P(O∣C=c) given by our model .
question a
Given outside word o and context word c.
The distribution of y is as follows:
y w = { 1 w=o 0 w!=o y_w= \begin{cases} 1& \text{w=o}\\ 0& \text{w!=o} \end{cases} yw={10w=ow!=o
− ∑ w = 1 V y w l o g ( y w ^ ) = − y o l o g ( y o ^ ) = − l o g ( y o ^ ) -\sum_{w=1}^{V} y_wlog(\hat{y_w}) = -y_olog(\hat{y_o})=-log(\hat{y_o}) −w=1∑Vywlog(yw^)=−yolog(yo^)=−log(yo^)
Here , V represents the vocab_size.
question b
∂ J n a i v e − s o f t m a x ( v c , o , U ) ∂ v c = − ∂ l o g ( P ( O = o ∣ C = c ) ) ∂ v c = − ∂ l o g ( e x p ( u o T v c ) ) ∂ v c + ∂ l o g ( ∑ w = 1 V e x p ( u w T v c ) ) ∂ v c = − u o + ∑ w = 1 V e x p ( u w T v c ) ∑ w = 1 V e x p ( u w T v c ) u w = − u o + ∑ w = 1 V P ( O = w ∣ C = c ) u w = U T ( y ^ − y ) \frac{\partial{J_{naive-softmax}(\boldsymbol v_c,o,\boldsymbol U)}}{\partial \boldsymbol v_c} \\= -\frac{\partial{log(P(O=o|C=c))}}{\partial \boldsymbol v_c} \\ = -\frac{\partial{log(exp( \boldsymbol u_o^T\boldsymbol v_c))}}{\partial \boldsymbol v_c} + \frac{\partial{log(\sum_{w=1}^{V}exp(\boldsymbol u_w^T\boldsymbol v_c))}}{\partial \boldsymbol v_c} \\= -\boldsymbol u_o + \sum_{w=1}^{V} \frac{exp(\boldsymbol u_w^T\boldsymbol v_c)}{\sum_{w=1}^{V}exp(\boldsymbol u_w^T\boldsymbol v_c)}\boldsymbol u_w \\= -\boldsymbol u_o+ \sum_{w=1}^{V}P(O=w|C=c)\boldsymbol u_w \\= \boldsymbol U^T(\hat{\boldsymbol y} - \boldsymbol y) ∂vc∂Jnaive−softmax(vc,o,U)=−∂vc∂log(P(O=o∣C=c))=−∂vc∂log(exp(uoTvc))+∂vc∂log(∑w=1Vexp(uwTvc))=−uo+w=1∑V∑w=1Vexp(uwTvc)exp(uwTvc)uw=−uo+w=1∑VP(O=w∣C=c)uw=UT(y^−y)
question c
∂ J n a i v e − s o f t m a x ( v c , o , U ) ∂ u w = − ∂ l o g ( e x p ( u o T v c ) ) ∂ u w + ∂ l o g ( ∑ w = 1 V e x p ( u w T v c ) ) ∂ u w \frac{\partial{J_{naive-softmax}(\boldsymbol v_c,o,\boldsymbol U)}}{\partial \boldsymbol u_w} \\= -\frac{\partial{log(exp(\boldsymbol u_o^T\boldsymbol v_c))}}{\partial \boldsymbol u_w} + \frac{\partial{log(\sum_{w=1}^{V}exp(\boldsymbol u_w^T\boldsymbol v_c))}}{\partial \boldsymbol u_w} ∂uw∂Jnaive−softmax(vc,o,U)=−∂uw∂log(exp(uoTvc))+∂uw∂log(∑w=1Vexp(uwTvc))
when w = o,
∂ J n a i v e − s o f t m a x ( v c , o , U ) ∂ u w = − v c + 1 ∑ w = 1 V e x p ( u w T v c ) ∂ ∑ w = 1 V e x p ( u w T v c ) ∂ u o = − v c + 1 ∑ w = 1 V e x p ( u w T v c ) ∂ e x p ( u o T v c ) ∂ u o = − v c + e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) v c = ( P ( O = o ∣ C = c ) − 1 ) ) v c \frac{\partial{J_{naive-softmax}(\boldsymbol v_c,o,\boldsymbol U)}}{\partial\boldsymbol u_w} \\= -\boldsymbol v_c + \frac{1}{\sum_{w=1}^{V} exp(\boldsymbol u_w^T\boldsymbol v_c)}\frac{\partial \sum_{w=1}^{V} exp(\boldsymbol u_w^T\boldsymbol v_c)}{\partial \boldsymbol u_o} \\= -\boldsymbol v_c + \frac{1}{\sum_{w=1}^{V} exp(\boldsymbol u_w^T\boldsymbol v_c)}\frac{\partial exp(\boldsymbol u_o^T\boldsymbol v_c)}{\partial \boldsymbol u_o} \\= -\boldsymbol v_c + \frac{ exp(\boldsymbol u_o^T\boldsymbol v_c)}{\sum_{w=1}^{V} exp(\boldsymbol u_w^T\boldsymbol v_c)}\boldsymbol v_c \\= (P(O=o|C=c)-1))\boldsymbol v_c ∂uw∂Jnaive−softmax(vc,o,U)=−vc+∑w=1Vexp(uwTvc)1∂uo∂∑w=1Vexp(uwTvc)=−vc+∑w=1Vexp(uwTvc)1∂uo∂exp(uoTvc)=−vc+∑w=1Vexp(uwTvc)exp(uoTvc)vc=(P(O=o∣C=c)−1))vc
when w != o,
∂ J n a i v e − s o f t m a x ( v c , o , U ) ∂ u w = e x p ( u w T v c ) ∑ w = 1 V e x p ( u w T v c ) v c = P ( O = w ∣ C = c ) v c \frac{\partial{J_{naive-softmax}(\boldsymbol v_c,o,\boldsymbol U)}}{\partial \boldsymbol u_w} \\= \frac{ exp(\boldsymbol u_w^T\boldsymbol v_c)}{\sum_{w=1}^{V} exp(\boldsymbol u_w^T\boldsymbol v_c)}\boldsymbol v_c \\= P(O=w|C=c)\boldsymbol v_c ∂uw∂Jnaive−softmax(vc,o,U)=∑w=1Vexp(uwTvc)exp(uwTvc)vc=P(O=w∣C=c)vc
In summary,
∂ J n a i v e − s o f t m a x ( v c , o , U ) ∂ U = ( y ^ − y ) T v c \frac{\partial{J_{naive-softmax}(\boldsymbol v_c,o,\boldsymbol U)}}{\partial \boldsymbol U} \\= (\hat {\boldsymbol y} - \boldsymbol y)^T\boldsymbol v_c ∂U∂Jnaive−softmax(vc,o,U)=(y^−y)Tvc
question d
∂ σ ( x ) ∂ x = ∂ e x e x + 1 ∂ x = e x ( e x + 1 ) − e x e x ( e x + 1 ) 2 = e x ( e x + 1 ) 2 = σ ( x ) ( 1 − σ ( x ) ) \frac{\partial \sigma(x)}{\partial x} = \frac{\partial \frac{e^x}{e^x+1}}{\partial x} = \frac{e^x(e^x+1)-e^xe^x}{(e^x+1)^2} \\= \frac{e^x}{(e^x+1)^2} = \sigma (x) (1- \sigma(x)) ∂x∂σ(x)=∂x∂ex+1ex=(ex+1)2ex(ex+1)−exex=(ex+1)2ex=σ(x)(1−σ(x))
question e
i)
∂
J
n
e
g
−
s
a
m
p
l
e
(
v
c
,
o
,
U
)
∂
v
c
=
∂
(
−
l
o
g
(
σ
(
u
o
T
v
c
)
)
−
∑
k
=
1
K
l
o
g
(
σ
(
−
u
k
T
v
c
)
)
)
∂
v
c
=
−
σ
(
u
o
T
v
c
)
(
1
−
σ
(
u
o
T
v
c
)
)
σ
(
u
o
T
v
c
)
∂
u
o
T
v
c
∂
v
c
−
∑
k
=
1
K
∂
l
o
g
(
σ
(
−
u
k
T
v
c
)
)
∂
v
c
=
−
(
1
−
σ
(
u
o
T
v
c
)
)
u
o
+
∑
k
=
1
K
(
1
−
σ
(
−
u
k
T
v
c
)
)
u
k
\frac{\partial{J_{neg-sample}(\boldsymbol v_c,o,\boldsymbol U)}}{\partial\boldsymbol v_c} \\= \frac{\partial (-log(\sigma (\boldsymbol u_o^T\boldsymbol v_c))-\sum_{k=1}^{K} log(\sigma (-\boldsymbol u_k^T\boldsymbol v_c)))}{\partial \boldsymbol v_c} \\= -\frac{\sigma(\boldsymbol u_o^T\boldsymbol v_c)(1-\sigma(\boldsymbol u_o^T\boldsymbol v_c))}{\sigma(\boldsymbol u_o^T\boldsymbol v_c)}\frac{\partial \boldsymbol u_o^T\boldsymbol v_c}{\partial \boldsymbol v_c} - \sum_{k=1}^{K}\frac{\partial log(\sigma(-\boldsymbol u_k^T\boldsymbol v_c))}{\partial \boldsymbol v_c} \\= -(1-\sigma(\boldsymbol u_o^T\boldsymbol v_c))\boldsymbol u_o+\sum_{k=1}^{K}(1-\sigma(-\boldsymbol u_k^T\boldsymbol v_c))\boldsymbol u_k
∂vc∂Jneg−sample(vc,o,U)=∂vc∂(−log(σ(uoTvc))−∑k=1Klog(σ(−ukTvc)))=−σ(uoTvc)σ(uoTvc)(1−σ(uoTvc))∂vc∂uoTvc−k=1∑K∂vc∂log(σ(−ukTvc))=−(1−σ(uoTvc))uo+k=1∑K(1−σ(−ukTvc))uk
ii)
∂
J
n
e
g
−
s
a
m
p
l
e
(
v
c
,
o
,
U
)
∂
u
o
=
∂
(
−
l
o
g
(
σ
(
u
o
T
v
c
)
)
∂
u
o
=
−
(
1
−
σ
(
u
o
T
v
c
)
)
v
c
\frac{\partial{J_{neg-sample}(\boldsymbol v_c,o,\boldsymbol U)}}{\partial \boldsymbol u_o} \\= \frac{\partial (-log(\sigma (\boldsymbol u_o^T\boldsymbol v_c))}{\partial \boldsymbol u_o} = -(1-\sigma(\boldsymbol u_o^T\boldsymbol v_c))\boldsymbol v_c
∂uo∂Jneg−sample(vc,o,U)=∂uo∂(−log(σ(uoTvc))=−(1−σ(uoTvc))vc
iii)
∂
J
n
e
g
−
s
a
m
p
l
e
(
v
c
,
o
,
U
)
∂
u
k
=
∂
(
−
l
o
g
(
σ
(
−
u
k
T
v
c
)
)
∂
u
k
=
(
1
−
σ
(
−
u
k
T
v
c
)
)
v
c
\frac{\partial{J_{neg-sample}(\boldsymbol v_c,o,\boldsymbol U)}}{\partial \boldsymbol u_k} \\= \frac{\partial (-log(\sigma (-\boldsymbol u_k^T\boldsymbol v_c))}{\partial \boldsymbol u_k} = (1-\sigma(-\boldsymbol u_k^T\boldsymbol v_c))\boldsymbol v_c
∂uk∂Jneg−sample(vc,o,U)=∂uk∂(−log(σ(−ukTvc))=(1−σ(−ukTvc))vc
qustion f
i)
∂ J s k i p − g r a m ( v c , w t − m , . . . , w t + m , U ) ∂ U = ∑ − m < = j < = m , j ! = 0 ∂ J ( v c , w t + j , U ) ∂ U \frac{\partial J_{skip-gram}(\boldsymbol v_c,w_{t-m},...,w_{t+m},\boldsymbol U)}{\partial \boldsymbol U} \\= \sum_{-m<=j<=m,j!=0}\frac{\partial J(\boldsymbol v_c,w_{t+j},\boldsymbol U)}{\partial \boldsymbol U} ∂U∂Jskip−gram(vc,wt−m,...,wt+m,U)=−m<=j<=m,j!=0∑∂U∂J(vc,wt+j,U)
ii)
when w=c,
∂ J s k i p − g r a m ( v c , w t − m , . . . , w t + m , U ) ∂ v c = ∑ − m < = j < = m , j ! = 0 ∂ J ( v c , w t + j , U ) ∂ v c \frac{\partial J_{skip-gram}(\boldsymbol v_c,w_{t-m},...,w_{t+m},\boldsymbol U)}{\partial \boldsymbol v_c} \\= \sum_{-m<=j<=m,j!=0}\frac{\partial J(\boldsymbol v_c,w_{t+j},\boldsymbol U)}{\partial \boldsymbol v_c} ∂vc∂Jskip−gram(vc,wt−m,...,wt+m,U)=−m<=j<=m,j!=0∑∂vc∂J(vc,wt+j,U)
iii)
when w!=c,
∂ J s k i p − g r a m ( v c , w t − m , . . . , w t + m , U ) ∂ v w = 0 \frac{\partial J_{skip-gram}(\boldsymbol v_c,w_{t-m},...,w_{t+m},\boldsymbol U)}{\partial \boldsymbol v_w} \\= \boldsymbol 0 ∂vw∂Jskip−gram(vc,wt−m,...,wt+m,U)=0