Naive Softmax&Negative Sampling

最新推荐文章于 2022-05-30 11:47:45 发布

扬州小栗旬

最新推荐文章于 2022-05-30 11:47:45 发布

阅读量907

点赞数 2

分类专栏： NLP CS224n NLP with DL

本文链接：https://blog.csdn.net/weixin_37616971/article/details/101063041

版权

CS224n NLP with DL 同时被 2 个专栏收录

12 篇文章 1 订阅

订阅专栏

NLP

2 篇文章 0 订阅

订阅专栏

Naive Softmax&Negative Sampling

1 Naive Softmax

引入两个向量：对于词汇表 $\mathbf{V}$ (vocabulary)中的每个单词w，当 $w$ 是center单词的时候，使用 $d$ 维向量 $\mathbf{v_w}$ 表示，当 $w$ 是context单词的时候，使用 $\mathbf{u_w}$ 表示。

所以对于给定center单词 c预测context单词o出现的概率：
$p(o|c)=\frac{exp(\mathbf{u_o^T}\mathbf{v_c})}{\sum_{w\in {V}}exp(\mathbf{u_w^T}\mathbf{v_c})}$
其中 $V$ 代表词汇表(vocabulary)， $\mathbf{u_o^T}\mathbf{v_c}$ 即向量的点积(dot product)，点积的大小暗示了o与c之间的相似度，这里引入指数的目的是最后生成规范的概率(softmax的概念)。

损失定义为:
$\begin{aligned} J &= -\log \frac{exp(\mathbf{u_o^T}\mathbf{v_c})}{\sum_{w=1}^{V} exp(\mathbf{u_w^T}\mathbf{v_c})} \\ &=-\log exp(\mathbf{u_o^T}\mathbf{v_c})+ \log \sum_{w=1}^{V} exp(\mathbf{u_w^T}\mathbf{v_c})\\ &=-\mathbf{u_o^T}\mathbf{v_c} + \log \sum_{w=1}^{V} exp(\mathbf{u_x^T}\mathbf{v_c})\\ \end{aligned}$

对于 $v_c$ 的梯度为:
$\begin{aligned} \frac{\partial}{\partial \mathbf{v_c}}-\log\frac{exp(\mathbf{u_o^T}\mathbf{v_c})}{\sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})}&=-\frac{\partial}{\partial \mathbf{v_c}}\log exp(\mathbf{u_o^T}\mathbf{v_c})+\frac{\partial}{\partial \mathbf{v_c}}\log \sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})\\ &=-\frac{\partial}{\partial \mathbf{v_c}}\mathbf{u_o^T}\mathbf{v_c}+\frac{\partial}{\partial \mathbf{v_c}}\log \sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c}) \end{aligned}$
到这里左边的 $\frac{\partial}{\partial \mathbf{v_c}}\mathbf{u_o^T}\mathbf{v_c}$ 结果为 $u_o$ ，右边的 $\frac{\partial}{\partial \mathbf{v_c}}\log \sum_{w\in V}exp(\mathbf{u_w}^T\mathbf{v_c})$ 求解要用到链式法则(chain rule)：
$\begin{aligned} \frac{\partial}{\partial v_c}\log \sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})&=\frac{1}{\sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})}\frac{\partial}{\partial \mathbf{v_c}}\sum_{x\in V}exp(\mathbf{u_x^T}\mathbf{v_c})\\ &=\frac{1}{\sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})}\sum_{x\in V}\frac{\partial}{\partial \mathbf{v_c}}exp(\mathbf{u_x^T}\mathbf{v_c})\\ &=\frac{1}{\sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})}\sum_{x\in V}exp(\mathbf{u_x^T}\mathbf{v_c})\frac{\partial}{\partial v_c}\mathbf{u_x^T}\mathbf{v_c}\\ &=\frac{1}{\sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})}\sum_{x\in V}exp(\mathbf{u_x^T}\mathbf{v_c})\mathbf{u_x}\\ &=\frac{\sum_{x\in V}exp(\mathbf{u_x^T}\mathbf{v_c})\mathbf{u_x}}{\sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})}\\ &=\sum_{x\in V}\frac{exp(\mathbf{u_x^T}\mathbf{v_c})}{\sum_{w\in V}exp(\mathbf{u_w^T}\mathbf{v_c})}\mathbf{u_x} \end{aligned}$

将两个结果带回原式：
$\begin{aligned} \frac{\partial}{\partial v_c}\log\frac{exp(u_o^Tv_c)}{\sum_{w\in V}exp(u_w^Tv_c)}&=-\frac{\partial}{\partial v_c}\log exp(u_o^Tv_c)+\frac{\partial}{\partial v_c}\log \sum_{w\in V}exp(u_w^Tv_c)\\ &=-\frac{\partial}{\partial v_c}u_o^Tv_c+\frac{\partial}{\partial v_c}\log \sum_{w\in V}exp(u_w^Tv_c)\\ &=-u_o+\sum_{x\in V}\frac{exp(u_x^Tv_c)}{\sum_{w\in V}exp(u_w^Tv_c)}u_x\\ &=u_o+\sum_{x\in V}p(x|c)u_x \end{aligned}$

下面求 $u_o$ 的梯度
$\begin{aligned} \frac{\partial J}{\partial u_o}&=\frac{\partial}{\partial u_o}\left ( -u_o^Tv_c + \log \sum_{w=1}^V exp(u_w^Tv_c) \right)\\ &=\frac{\partial}{\partial u_o}-u_o^Tv_c+\frac{\partial}{\partial u_o}\log \sum_{w=1}^V exp(u_w^Tv_c)\\ &=-v_c+\frac{1}{\sum_{w=1}^V exp(u_w^Tv_c)}\sum_{w=1}^V\frac{\partial}{\partial u_o} exp(u_w^Tv_c)\\ &=-v_c+\frac{1}{\sum_{w=1}^V exp(u_w^Tv_c)} exp(u_w^Tv_c)v_c\\ &=-v_c+p(o|c)v_c\\ &=(p(o|c)-1)v_c \end{aligned}$
下面求除 $u_o$ 之外outside vector的梯度，不失一般性，令其为 $u_x$ ：
$\begin{aligned} \frac{\partial J}{\partial u_x}&=\frac{\partial}{\partial u_x}\left ( -u_o^Tv_c + \log \sum_{w=1}^V exp(u_w^Tv_c) \right)\\ &=\frac{\partial}{\partial u_x}\log \sum_{w=1}^V exp(u_w^Tv_c)\\ &=\frac{1}{\sum_{w=1}^V exp(u_w^Tv_c)}\sum_{w=1}^V\frac{\partial}{\partial u_x} exp(u_w^Tv_c)\\ &=\frac{1}{\sum_{w=1}^V exp(u_w^Tv_c)} exp(u_x^Tv_c)v_c\\ &=p(x|c)v_c \end{aligned}$

2 Negative Sampling

关于训练，一个非常消耗计算资源的问题就是softmax的计算
$\frac{exp(u_o^Tv_c)}{\sum_{w\in V}exp(u_w^Tv_c)}$
可以看出分母要计算词汇表(vacabulary)里所有单词与目标单词的点积，然后计算其自然指数的和，所以Mikilov在Distributed representations of words and phrases and their compositionality 中提出了Negative Sampling的技巧来解决这个问题。

负采样目的就是避免计算 $\mathbf{v_c}$ 与整个单词表 $\mathbf{U}$ 的点积，所以作者的做法是从词汇表中抽出一部分共现概率低的单词作为负样本，损失函数定义如下。这样优化的过程中，会使单词 $o$ 与单词 $c$ 的共现(co-occurrence)概率变大，而 $K$ 个负样本单词与单词 $c$ 的共现(co-occurrence)概率变低。

negative sampling的损失函数定义如下：
$J_{neg-sample}(o,v_c,U) = -\log\sigma(u_o^Tv_c)- \sum_{j=1}^{K}[\log\sigma(-u_j^Tv_c)]$
又 $\sigma(-u_j^Tv_c)=1-\sigma(u_j^Tv_c)$ ，所以损失可以写为：
$J_{neg-sample}(o,v_c,U) = -\log\sigma(u_o^Tv_c)- \sum_{j=1}^{K}\log\left ( 1-\sigma(u_j^Tv_c) \right )$
然后用向量化表示，使用 $\mathbf{W}$ (K+1, d)代表用到的参数， $\mathbf{W}=\left(\begin{array}{ccc} \mathbf{u_o}^T \\ \mathbf{u_1}^T\\ ...\\ \mathbf{u_K}^T \end{array} \right)$ ，令 $\mathbf{y}=\left(\begin{array}{ccc} 1 \\ 0\\ ...\\ 0\end{array} \right)$ ，令 $L(\mathbf{W}, \mathbf{v_c}, \mathbf{y})=y\cdot\sigma(\mathbf{W}\mathbf{v_c}) +(1-y)\cdot \sigma(\mathbf{W}\mathbf{v_c})$ ，上述损失可写成：
$J_{neg-sample}(\mathbf{L}) = \sum_{i=1}^{K+1}-L_i$
首先计算 $v_c$ 的偏导，其中用到了sigmoid的导数， $\sigma'(z)=\sigma(z)[1-\sigma(z)]$ ，
$\begin{aligned} \frac{\partial J}{\partial v_c} &= -\left ( \frac{\partial}{\partial v_c}\log\sigma(u_o^Tv_c)+ \frac{\partial}{\partial v_c}\sum_{j=1}^{K}\log\left ( 1-\sigma(u_j^Tv_c) \right) \right)\\ &=-\left ( \frac{1}{\sigma(u_o^Tv_c)}\frac{\partial}{\partial v_c}\sigma(u_o^Tv_c)+ \sum_{j=1}^{K}\frac{\partial}{\partial v_c}\log\left ( 1-\sigma(u_j^Tv_c) \right) \right)\\ &=-\left ( \frac{1}{\sigma(u_o^Tv_c)}\sigma(u_o^Tv_c)(1-\sigma(u_o^Tv_c))\frac{\partial}{\partial v_c}u_o^Tv_c+ \sum_{j=1}^{K}\frac{1}{1-\sigma(u_j^Tv_c)} \frac{\partial}{\partial v_c}\sigma(u_j^Tv_c) \right)\\ &=-\left ( \left(1-\sigma(u_o^Tv_c)\right)u_o+ \sum_{j=1}^{K}\frac{1}{1-\sigma(u_j^Tv_c)} \sigma(u_j^Tv_c) \left(1-\sigma(u_j^Tv_c)\right) \frac{\partial}{\partial v_c}u_j^Tv_c \right)\\ &=-\left ( \left(1-\sigma(u_o^Tv_c)\right)u_o+ \sum_{j=1}^{K}\sigma(u_j^Tv_c) u_j\right)\\ \end{aligned}$
然后是对于 $u_o$ 的求导
$\begin{aligned} \frac{\partial J}{\partial u_o} &= - \frac{\partial}{\partial u_o}\log\sigma(u_o^Tv_c)\\ &= - \frac{1}{\sigma(u_o^Tv_c)}\frac{\partial}{\partial u_o}\sigma(u_o^Tv_c)\\ &= - \frac{1}{\sigma(u_o^Tv_c)}\sigma(u_o^Tv_c)\left (1-\sigma(u_o^Tv_c) \right)v_c\\ &= - \left (1-\sigma(u_o^Tv_c) \right)v_c\\ &= \left (\sigma(u_o^Tv_c)-1 \right)v_c\\ \end{aligned}$
对于negative samples $u_j$ 的求导：
$\begin{aligned} \frac{\partial J}{\partial u_j} &= - \frac{\partial}{\partial u_j}\log\left ( 1-\sigma(u_j^Tv_c) \right) \\ &= \frac{1}{1-\sigma(u_j^Tv_c)} \frac{\partial}{\partial u_j}\sigma(u_j^Tv_c) \\ &= \frac{1}{1-\sigma(u_j^Tv_c)}\sigma(u_j^Tv_c) \left( 1-\sigma(u_j^Tv_c) \right)v_c\\ &= \sigma(u_j^Tv_c)v_c\\ \end{aligned}$

3 An implementation of Naive Softmax

def softmax(x):
    """Compute the softmax function for each row of the input x.

    Arguments:
    x -- A D dimensional vector or N x D dimensional numpy matrix.
    Return:
    x -- You are allowed to modify x in-place
    """
    orig_shape = x.shape

    if len(x.shape) > 1:
        # Matrix
        tmp = np.max(x, axis=1)
        x -= tmp.reshape((x.shape[0], 1))
        x = np.exp(x)
        tmp = np.sum(x, axis=1)
        x /= tmp.reshape((x.shape[0], 1))
    else:
        # Vector
        tmp = np.max(x)
        x -= tmp
        x = np.exp(x)
        tmp = np.sum(x)
        x /= tmp

    assert x.shape == orig_shape
    return x
    
def naiveSoftmaxLossAndGradient(
    centerWordVec,
    outsideWordIdx,
    outsideVectors,
    dataset
):
    """ Naive Softmax loss & gradient function for word2vec models

    Arguments:
    centerWordVec -- numpy ndarray, center word's embedding
                    (v_c in the pdf handout)
    outsideWordIdx -- integer, the index of the outside word
                    (o of u_o in the pdf handout)
    outsideVectors -- outside vectors (rows of matrix) for all words in vocab
                      (U in the pdf handout)
    dataset -- needed for negative sampling, unused here.

    Return:
    loss -- naive softmax loss
    gradCenterVec -- the gradient with respect to the center word vector
                     (dJ / dv_c)
    gradOutsideVecs -- the gradient with respect to all the outside word vectors
                    (dJ / dU)
    """

    # Forward
    centerWordVec = centerWordVec.reshape((centerWordVec.shape[0], 1))        # (d, 1)
    z = np.dot(outsideVectors, centerWordVec)       # (V, 1) = (V, d)x(d, 1)
    prob = softmax(z.reshape(-1)).reshape(-1, 1)        # (V, 1)
    loss = -np.log(prob[outsideWordIdx])        # negative log-probability

    # Back propagation
    cProb = prob.copy() # (V, 1)
    cProb[outsideWordIdx] -= 1.0
    gradCenterVec = np.dot(outsideVectors.T, cProb)     # (d, 1) = (d, V)x(V, 1)
    gradOutsideVecs = np.dot(cProb, centerWordVec.T)        # (V, d) = (V, 1)x(1, d)
    gradCenterVec = gradCenterVec.flatten()

    return loss, gradCenterVec, gradOutsideVecs

4 An implementation of Negative Sampling

def sigmoid(x):
    """
    Compute the sigmoid function for the input here.
    Arguments:
    x -- A scalar or numpy array.
    Return:
    s -- sigmoid(x)
    """

    s = 1/(1 + np.exp(-x))

    return s
    
def negSamplingLossAndGradient(
    centerWordVec,
    outsideWordIdx,
    outsideVectors,
    dataset,
    K=10
):
    """ Negative sampling loss function for word2vec models

    Implement the negative sampling loss and gradients for a centerWordVec
    and a outsideWordIdx word vector as a building block for word2vec
    models. K is the number of negative samples to take.

    Note: The same word may be negatively sampled multiple times. For
    example if an outside word is sampled twice, you shall have to
    double count the gradient with respect to this word. Thrice if
    it was sampled three times, and so forth.

    Arguments/Return Specifications: same as naiveSoftmaxLossAndGradient
    """

    # Negative sampling of words is done for you. Do not modify this if you
    # wish to match the autograder and receive points!
    negSampleWordIndices = getNegativeSamples(outsideWordIdx, dataset, K)
    indices = [outsideWordIdx] + negSampleWordIndices       # len(indices) = K+1

	# Fill up W(negative samples + outside word) by indices array
    W = np.zeros((len(indices), outsideVectors.shape[1]))       # (K+1, d)
    for i in range(len(indices)):           
        W[i] = outsideVectors[indices[i]]

    # Forward
    centerWordVec = centerWordVec.reshape((centerWordVec.shape[0], 1))      # (d, 1)
    z = np.dot(W, centerWordVec)        # (K+1, 1)
    prob = sigmoid(z)

    # Backprop
    y = np.zeros((prob.shape[0], 1))        # (K+1, 1)
    y[0] = 1  # index 0 is target
    
    loss = -(y * np.log(prob) + (1 - y) * np.log(1 - prob)).sum()

    delta = prob - y
    gradCenterVec = np.dot(W.T, delta)      # (V, 1)
    gradW = np.dot(delta, centerWordVec.T)      # (K+1, V)
    gradCenterVec = gradCenterVec.flatten()

    gradOutsideVecs = np.zeros_like(outsideVectors)
    for i in range(len(indices)):
        gradOutsideVecs[indices[i]] += gradW[i]

    return loss, gradCenterVec, gradOutsideVecs