CS224d Assignment1 答案, Part(3/4)

最新推荐文章于 2019-11-27 22:11:52 发布

Silver-

最新推荐文章于 2019-11-27 22:11:52 发布

阅读量1.1k

点赞数

分类专栏： cs224d 文章标签： cs224d 答案

本文链接：https://blog.csdn.net/bumingqiu/article/details/72847265

版权

cs224d 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Assignment1的答案一共被我分成了4部分，分别包含第1，2，3，4题。这部分包含第3题的答案。

3. word2vec (40 points + 5 bonus)

(a). (3 points) Assume you are given a predicted word vector $v_c$ corresponding to the center word $c$ for skipgram, and word prediction is made with the softmax function found in word2vec models

y^0 = p (o | c) = exp ( u T o v c ) \sum W w = 1 exp ( u T w v c ) (4)

$\hat{y}_0=p(o|c)=\frac{\text{exp}(u_o^\text{T}v_c)} {\sum_{w=1}^W\text{exp}(u_w^\text{T}v_c)} \quad (4)$
where

w $w$ denotes the

w $w$ -th word and

uw $u_w$

(w=1,…,W) $(w = 1, \dots, W)$ are the “output” word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word

o $o$ is the expected word (the

o $o$ -th element of the one-hot label vector is one), derive the gradients with respect to

vc $v_c$ .
Hint: It will be helpful to use notation from question 2. For instance, letting

y^ $\hat{y}$ be the vector of softmax predictions for every word,

y $y$ as the expected word vector, and the loss function

J s o f t m a x - C E (o, v c, U) = C E (y, y^) (5)

$J_{softmax-CE}(o,v_c,U)=CE(y,\hat{y}) \quad (5)$
where

U=[u1,u2,…,uW] $U = [u_1, u_2, \dots, u_W]$ is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.

解：设词向量的维度为 $ndim$ ，且各词向量为列向量，即 $v_c$ 的维度为 $ndim\times1$ ， $U$ 的维度为 $ndim\times W$ 。并且记 $\theta=U^\text{T}v_c$ 。则有 $\hat{y}=softmax(\theta)$ 。由第2问(b)的结果可得：

\partial J s o f t m a x - C E \partial v c = \partial J s o f t m a x - C E \partial θ \cdot \partial θ \partial v c = U \cdot (y^- o)

$\begin{align} \frac{\partial J_{softmax-CE}}{\partial v_c}&=\frac{\partial J_{softmax-CE}}{\partial \theta} \cdot\frac{\partial \theta}{\partial v_c}\\ &=U\cdot (\hat{y}-o) \end{align}$

(b)(3 points) As in the previous part, derive gradients for the “output” word vectors $u_w$ (including $u_o$ ).

解：同(a)中一样，设 $\theta=U^\text{T}v_c$ ，则有：

\partial J s o f t m a x - C E \partial U i j = \sum k \partial J s o f t m a x - C E \partial θ k \partial θ k \partial U i j = \sum k (y^- o) | k \partial θ k \partial U i j

$\begin{align} \frac{\partial J_{softmax-CE}}{\partial U_{ij}} &= \sum_k \frac{\partial J_{softmax-CE}}{\partial \theta_k} \frac{\partial \theta_k} {\partial U_{ij}}\\ &=\sum_k(\hat{y}-o)|_k\frac{\partial \theta_k} {\partial U_{ij}}\\ \end{align}$
其中

θk=uTk⋅vc $\theta_k=u_k^\text{T}\cdot v_c$ ，则

∂θk∂Uij={vi0j=kj≠k $\frac{\partial \theta_k} {\partial U_{ij}}=\begin{cases} v_i&j=k \\ 0 & j\ne k\end{cases}$ ，其中

vi $v_i$ 表示

vc $v_c$ 的第

i $i$ 个元素。则有：

\sum k (y^- o) | k \partial θ k \partial U i j = (y^- o) | j v i = v c \cdot (y^- o) T | i, j

$\begin{align} \sum_k(\hat{y}-o)|_k\frac{\partial \theta_k} {\partial U_{ij}}&=(\hat{y}-o)|_jv_i\\ &=v_c\cdot (\hat{y}-o)^\text{T}|_{i,j} \end{align}$
所以：

\partial J s o f t m a x - C E \partial U = v c \cdot (y^- o) T

$\frac{\partial J_{softmax-CE}}{\partial U} = v_c\cdot (\hat{y}-o)^\text{T}$

(c). (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector $v_c$ , and the expected output word is $o$ . Assume that $K$ negative samples (words) are drawn, and they are $1, \dots ,K$ , respectively for simplicity of notation $(o \notin \{1, \dots, K\})$ . Again, for a given word, $o$ , denote its output vector as $u_o$ . The negative sampling loss function in this case is

J n e g - s a m p l e (o, v c, U) = - log (σ (u T o v c)) - \sum k = 1 K log (σ (- u T k v c)) (6)

$J_{neg-sample}(o,v_c,U)=-\text{log}(\sigma(u_o^\text{T}v_c))-\sum_{k=1}^K\text{log}(\sigma(-u_k^\text{T}v_c)) \quad (6)$
where

σ(⋅) $\sigma(\cdot)$ is the sigmoid function.
After you’ve done this, describe with one sentence why this cost function is much more efficient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e. the runtime of the softmax-CE loss divided by the runtime of the negative sampling loss).
Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.

解：设所取的 $K$ 个索引所在的集合为 $S$ 。

\partial J n e g - s a m p l e \partial v c = - \partial log σ ( u T o v c ) \partial v c - \sum i \in S log σ ( - u T i v c ) \partial v c = [σ (u T o v c) - 1] u o - \sum i \in S [σ (- u T i v c) - 1] u i

$\begin{align} \frac{\partial J_{neg-sample}}{\partial v_c}&=-\frac{\partial \text{log}\sigma(u_o^\text{T}v_c)} {\partial v_c}-\frac{\sum_{i\in S}\text{log}\sigma(-u_i^\text{T}v_c)} {\partial v_c}\\ &=[\sigma(u_o^\text{T}v_c)-1] u_o - \sum_{i\in S}[\sigma(-u_i^\text{T}v_c)-1]u_i \end{align}$

\partial J n e g - s a m p l e \partial u w = - \partial log σ ( u T o v c ) \partial u w - \sum i \in S log σ ( - u T i v c ) \partial u w = ⎧ ⎩ ⎨ [σ (u T o v c) - 1] v c [1 - σ (- u T w v c)] v c 0 w = o w \in S w \neq o 且 w \notin S

$\begin{align} \frac{\partial J_{neg-sample}}{\partial u_w}&=-\frac{\partial \text{log}\sigma(u_o^\text{T}v_c)} {\partial u_w}-\frac{\sum_{i\in S}\text{log}\sigma(-u_i^\text{T}v_c)} {\partial u_w}\\ &=\begin{cases} [\sigma(u_o^\text{T}v_c)-1] v_c & w=o\\ [1-\sigma(-u_w^\text{T}v_c)] v_c & w\in S\\ 0 &w\ne o且w\notin S \end{cases} \end{align}$

之所以(6)式比(5)式快是因为： $\frac{\text{runtime of softmax-CE}}{\text{runtime of negative sampling loss}} = \frac{\text{O(W)}}{\text{O(K)}}$ （不知道这么说是不是准确，望大神指正）。

(d). (8 points) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words $[\text{word}_{c-m}, \dots, \text{word}_{c-1}, \text{word}_c, \text{word}_{c+1}, \dots, \text{word}_{c+m}]$ , where $m$ is the context size. Denote the “input” and “output” word vectors for $\text{word}_k$ as $v_k$ and $u_k$ respectively.
Hint: feel free to use $F(o, v_c)$ (where $o$ is the expected word) as a placeholder for the $J_{softmax-CE}(o, v_c, \dots)$ or $J_{neg-sample}(o, v_c, \dots)$ cost functions in this part - you’ll see that this is a useful abstraction for the coding part. That is, your solution may contain terms of the form $\frac{F(o, v_c)} {\partial \dots}$ .
Recall that for skip-gram, the cost for a context centered around $c$ is

J s k i p - g r a m (word c - m \dots c + m) = \sum - m \leq j \leq m, j \neq 0 F (w c + j, v c) (7)

$J_{skip-gram}(\text{word}_{c-m\dots c+m})=\sum_{-m\le j\le m, j\ne0}F(w_{c+j},v_c) \quad (7)$
where

wc+j $w_{c+j}$ refers to the word at the

j $j$ -th index from the center.
CBOW is slightly different. Instead of using

vc $v_c$ as the predicted vector, we use

v^ $\hat{v}$ de fined below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context

v^= \sum - m \leq j \leq m, j \neq 0 v c + j (8)

$\hat{v}=\sum_{-m\le j\le m, j\ne0}v_{c+j} \quad (8)$
then the CBOW cost is

J C B O W (word c - m \dots c + m) = F (w c, v^) (9)

$J_{CBOW}(\text{word}_{c-m\dots c+m}) = F(w_c, \hat{v}) \quad (9)$
Note: To be consistent with the

v^ $\hat{v}$ notation such as for the code portion, for skip-gram

v^=vc $\hat{v} = v_c$ .

解：设 $v_k, u_k$ 分别为词 $k$ 所对应的内外向量。

skip-gram对应的答案：

\partial J s k i p - g r a m ( word c - m \dots c + m ) \partial v k = \sum - m \leq j \leq m, j \neq 0 \partial F ( w c + j , v c ) \partial v k

$\frac{\partial J_{skip-gram}(\text{word}_{c-m\dots c+m})} {\partial v_k} = \sum_{-m\le j\le m, j\ne0}\frac{\partial F(w_{c+j}, v_c)} {\partial v_k}$

\partial J s k i p - g r a m ( word c - m \dots c + m ) \partial u k = \sum - m \leq j \leq m, j \neq 0 \partial F ( w c + j , v c ) \partial u k

$\frac{\partial J_{skip-gram}(\text{word}_{c-m\dots c+m})} {\partial u_k} = \sum_{-m\le j\le m, j\ne0}\frac{\partial F(w_{c+j}, v_c)} {\partial u_k}$
其中

wc+j $w_{c+j}$ 为从中心数第

j $j$ 个词所对应的one-hot vector。

CBOW对应的答案：

∂JCBOW(wordc−m…+m)∂vk=∂F(wc,v^)∂vk=∂F(wc,v^)∂v^⋅∂v^∂vk={∂F(wc,v^)∂v^0k∈{wc−m,…,wc−1,wc+1,…,wc+m}k∉{wc−m,…,wc−1,wc+1,…,wc+m}

$\begin{align} \frac{\partial J_{CBOW} (\text{word}_{c-m\dotsc+m})} {\partial v_k} &=\frac{\partial F(w_c,\hat{v})} {\partial v_k} \\ &= \frac{\partial F(w_c,\hat{v})} {\partial \hat{v}}\cdot\frac{\partial \hat{v}} {\partial v_k} \\ &=\begin{cases}\frac{\partial F(w_c,\hat{v})} {\partial \hat{v}}&k\in \{w_{c-m}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+m}\}\\ 0 & k\notin \{w_{c-m}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+m}\}\end{cases} \end{align}$