CS224n - Assignment1 - Word2vec

代码链接,在原有代码做了些小的修改,适用于python3.6

3 Word2vec (40' + 2 bonus)

(a)假设中间单词c对应的向量v_c给定,word2vec模型中利用softmax函数进行单词预测:

{\widehat y}_o=p(o\vert c)=\frac{exp(u_o^Tv_c)}{\sum_{\omega=1}^Wexp(u_\omega^Tv_c)}

其中\omega代表第\omega个单词,u_\omega(\omega=1,\dots,W)代表vocabulary所有单词的"output"词向量。假设用交叉熵成本进行预测且o是期待的预测词(one-hot label vector的第o个元素是1),对v_c求导。

提示:\widehat y是每个词向量的softmax预测,y是预期的词向量,损失函数:

J_{softmax-CE}(o,v_c,U)=CE(y,\widehat y)=-\sum_iy_i\log({\widehat y}_i)

其中,U=\lbrack u_1,u_2,\cdots,u_W\rbrack是所有输出向量组成的矩阵

答案:\frac{\partial J}{\partial v_c}=U(\widehat y-y)

\frac{\partial J}{\partial v_c}=-u_i+\sum_{\omega=1}^W{\widehat y}_\omega u_\omega

(b)对"output"词向量u_\omega求导

答案:\frac{\partial J}{\partial U}=v_c(\widehat y-y)^T

\frac{\partial J}{\partial u_\omega}=\left\{\begin{array}{lc}({\widehat y}_\omega-1)v_c,&\omega=o\\{\widehat y}_\omega v_c,&otherwise\end{array}\right.

(c)对预测向量v_c,我们使用负样本损失,预期的输出词是o。假设K个负样例单词被丢弃:1,\cdots,K(o\not\in\{1,\cdots,K\})。给定单词o的输出向量为u_o,负样本损失为:

J_{neg-sample}(o,v_c,U)=-\log(\sigma(u_o^Tv_c))-\sum_{k=1}^K\log(\sigma(-u_k^Tv_c))

重复(a)(b)两个问题的求导

答案:

\frac{\partial J}{\partial v_c}=(\sigma(u_o^Tv_c)-1)u_o-\sum_{k=1}^K(\sigma(-u_k^Tv_c)-1)u_k

\frac{\partial J}{\partial u_o}=(\sigma(u_o^Tv_c)-1)v_c

\frac{\partial J}{\partial u_k}=-(\sigma(u_k^Tv_c)-1)v_c,\;\mathrm{for}\;\mathrm{all}\;k=1,\dots,K

(d)给定前面的结果以及上下文集合\lbrack word_{c-m},\dots,word_{c-1},word_c,word_{c+1},\dots,word_{c+m}\rbrack,对所有词向量进行梯度推导。"input"和"output"词向量k分别用v_ku_k表示。

提示:F(o,v_c)(其中o是预期的输出)用于表示成本函数J_{softmax-CE}(o,v_c,\dots)J_{neg-sample}(o,v_c,\dots)

对于skip-gram,word_{c-m...c+m}的cost函数是:

J_{skip-gram}(word_{c-m\dots c+m})=-\sum_{-m\leq j\leq m,j\neq0}F(\omega_{c+j},v_c)

对于CBOW,我们对上下文中输入词向量求和得到:

\widehat v=-\sum_{-m\leq j\leq m,j\neq0}v_{c+j}

则CBOW的成本为:J_{CBOW}(word_{c-m\dots c+m})=F(\omega_c,\widehat v)

由(a),(b),(c)我们已知导数\frac{\partial F(\omega_i,\widehat v)}{\partial U}\frac{\partial F(\omega_i,\widehat v)}{\partial\widehat v}

对于skip-gram,一个上下文窗口的成本函数的导数为:

\frac{\displaystyle\partial J_{skip-gram}(word_{c-m\dots c+m})}{\displaystyle\partial U}=\sum_{-m\leqslant j\leqslant m,j\neq0}\frac{\displaystyle\partial F(\omega_{c+j},v_c)}{\displaystyle\partial U}

\frac{\displaystyle\partial J_{skip-gram}(word_{c-m\dots c+m})}{\displaystyle\partial v_c}=\sum_{-m\leqslant j\leqslant m,j\neq0}\frac{\displaystyle\partial F(\omega_{c+j},v_c)}{\displaystyle\partial v_c}

\frac{\displaystyle\partial J_{skip-gram}(word_{c-m\dots c+m})}{\displaystyle\partial v_c}=0,\;\mathrm{for}\;\mathrm{all}\;j\neq c

类似的,对于CBOW,我们有如下推导:

\frac{\displaystyle\partial J_{CBOW}(word_{c-m\dots c+m})}{\displaystyle\partial U}=\frac{\displaystyle\partial F(\omega_c,\widehat v)}{\displaystyle\partial U},\;(\mathrm{using}\;\mathrm{the}\;\mathrm{defination}\;\mathrm{of}\;\widehat v\;\mathrm{in}\;\mathrm{the}\;\mathrm{problem})

\frac{\displaystyle\partial J_{CBOW}(word_{c-m\dots c+m})}{\displaystyle\partial v_j}=\frac{\displaystyle\partial F(\omega_c,\widehat v)}{\displaystyle\partial\widehat v}, \mathrm{for}\;\mathrm{all}\;j\in\{c-m,\dots,c-1,c+1,\dots,c+m\}

\frac{\displaystyle\partial J_{CBOW}(word_{c-m\dots c+m})}{\displaystyle\partial v_j}=0,\mathrm{for}\;\mathrm{all}\;j\not\in\{c-m,\dots,c-1,c+1,\dots,c+m\}

(e) 在q3_word2vec中实现word2vec模型的建立,利用随机梯度下降法训练词向量

1.首先,对矩阵的行向量进行正则化处理:

利用每一行元素的平方和再开根,对每个行向量做正则化处理

def normalizeRows(x):
    ### YOUR CODE HERE
    #对数组里的每个元素进行变换,axis=0表示列相加,axis=1表示行相加
    denom=np.apply_along_axis(lambda x:np.sqrt(x.T.dot(x)),1,x)
    x/=denom[:,None] # denom (2,)->(2,1)
    ### END YOUR CODE
    return x

2.然后为softmax和negative样例填写成本和梯度函数

h=sigmoid(xW_1+b_1)\hat y=sigmoid(hW_2+b_2)

(1) softmax:

成本函数:交叉熵J_{softmax-CE}(o,v_c,U)=CE(y,\widehat y)=-\sum_iy_i\log({\widehat y}_i)

对输入向量求导:\frac{\partial J}{\partial v_c}=U(\widehat y-y),即\frac{\partial J}{\partial v_c}=-u_i+\sum_{\omega=1}^W{\widehat y}_\omega u_\omega

对输出向量求导:\frac{\partial J}{\partial U}=v_c(\widehat y-y)^T,即\frac{\partial J}{\partial u_\omega}=\left\{\begin{array}{lc}({\widehat y}_\omega-1)v_c,&\omega=o\\{\widehat y}_\omega v_c,&otherwise\end{array}\right.

def softmaxCostAndGradient(predicted,target,outputVectors,dataset):
    ### YOUR CODE HERE

    ## Gradient for $\hat{v}$
    # Calculate the prediction:
    vhat=predicted # (3,)
    z=np.dot(outputVectors,vhat) # (5,)
    preds=softmax(z) # the column vector of the softmax prediction of words

    # Calculate the cost: the cross entropy function
    cost=-np.log(preds[target])

    # Gradients
    z=preds.copy()
    z[target]-=1.0

    grad=np.outer(z,vhat) # (5, 3) gradients for the "output" word vectors U (outputVectors)
    gradPred=np.dot(outputVectors.T,z) # (3,) gradients for the "input" word vectors v (predicted)
    ### END YOUR CODE
    return cost,gradPred,grad

(2) negative:

成本函数:交叉熵J_{neg-sample}(o,v_c,U)=-\log(\sigma(u_o^Tv_c))-\sum_{k=1}^K\log(\sigma(-u_k^Tv_c))

对输入向量求导:\frac{\partial J}{\partial v_c}=(\sigma(u_o^Tv_c)-1)u_o-\sum_{k=1}^K(\sigma(-u_k^Tv_c)-1)u_k

对输出向量求导:\frac{\partial J}{\partial u_o}=(\sigma(u_o^Tv_c)-1)v_c

\frac{\partial J}{\partial u_k}=-(\sigma(u_k^Tv_c)-1)v_c,\;\mathrm{for}\;\mathrm{all}\;k=1,\dots,K

def negSamplingCostAndGradient(predicted,target,outputVectors,dataset,K=10):

    # Generate the K negative samples (words), which aren't the expected output
    indices=[target]
    indices.extend(getNegativeSamples(target,dataset,K))

    ### YOUR CODE HERE
    grad=np.zeros(outputVectors.shape) # (5,3) the gradients for the "output" word vectors U (outputVectors)
    gradPred=np.zeros(predicted.shape) # (3,) the gradient for the predicted vector v_c (predicted)
    cost=0
    z=sigmoid(np.dot(outputVectors[target],predicted))
    cost-=np.log(z)
    grad[target]+=predicted*(z-1.0) # (3,) the gradients for u_o
    gradPred+=outputVectors[target]*(z-1.0)

    for k in range(K):
        samp=indices[k+1]
        z=sigmoid(np.dot(outputVectors[samp],predicted))
        cost-=np.log(1.0-z)
        grad[samp]+=predicted*z # (3,) the gradients for u_k
        gradPred+=outputVectors[samp]*z # (3,) the gradients for v_c
    ### END YOUR CODE

    return cost,gradPred,grad

3.最后针对skip-gram模型实现成本和梯度函数

成本函数:交叉熵J_{skip-gram}(word_{c-m\dots c+m})=-\sum_{-m\leq j\leq m,j\neq0}F(\omega_{c+j},v_c)

对输出向量求导:\frac{\displaystyle\partial J_{skip-gram}(word_{c-m\dots c+m})}{\displaystyle\partial U}=\sum_{-m\leqslant j\leqslant m,j\neq0}\frac{\displaystyle\partial F(\omega_{c+j},v_c)}{\displaystyle\partial U}

对输入向量求导:\frac{\displaystyle\partial J_{skip-gram}(word_{c-m\dots c+m})}{\displaystyle\partial v_c}=\sum_{-m\leqslant j\leqslant m,j\neq0}\frac{\displaystyle\partial F(\omega_{c+j},v_c)}{\displaystyle\partial v_c}

\frac{\displaystyle\partial J_{skip-gram}(word_{c-m\dots c+m})}{\displaystyle\partial v_c}=0,\;\mathrm{for}\;\mathrm{all}\;j\neq c

def skipgram(currentWord,C,contextWords,tokens,inputVectors,outputVectors,
             dataset,word2vecCostAndGradient=softmaxCostAndGradient):
    cost=0.0
    gradIn=np.zeros(inputVectors.shape) # (5,3)
    gradOut=np.zeros(outputVectors.shape) # (5,3)

    ### YOUR CODE HERE
    cword_idx=tokens[currentWord]
    vhat=inputVectors[cword_idx] # the input word vector for the center word_c

    for j in contextWords: # 2*5=10
        u_idx=tokens[j] # the output word vector for the word u_o
        c_cost,c_grad_in,c_grad_out=\
            word2vecCostAndGradient(vhat,u_idx,outputVectors,dataset)
        cost+=c_cost
        gradIn[cword_idx]+=c_grad_in # (5,3) the gradients for the input vector
        gradOut+=c_grad_out # (5,3) the gradients for the output vector
    ### END YOUR CODE

    return cost,gradIn,gradOut

4.对CBOW模型实现成本和梯度函数

(f) 实现随机梯度下降的优化器

具体参考q3_sgd.py

def sgd(f,x0,step,iterations,postprocessing=None,useSaved=False,
        PRINT_EVERY=10):
    """
    :param f: the function to optimize
    :param x0: the initial point to start SGD from
    :param step: the step size for SGD
    :param iterations: total iterations to run SGD for
    :param postprocessing: postprocessing function for the parameters if necessary
    In the case of word2vec we will need to normalize the word vectors to have unit length
    :param useSaved: parameter specified whether load previously saved parameters
    :param PRINT_EVERY: specifies how many iterations to output loss
    :return: x -- the parameter value after SGD finishes
    """
    # Anneal learning rate every several iterations
    ANNEAL_EVERY=20000
    if useSaved:
        start_iter,oldx,state=load_saved_params()
        if start_iter>0:
            x0=oldx
            step*=0.5**(start_iter/ANNEAL_EVERY)
        if state:
            random.setstate(state)
    else:
        start_iter=0

    x=x0

    if not postprocessing:

        postprocessing=lambda x:x

    expcost=None

    for iter in range(start_iter+1,iterations+1):

        # Don't forget to apply the postprocessing after every iteration!
        # You might want to print the process every few iterations.

        cost=None
        ### YOUR CODE HERE
        cost,grad=f(x)
        x-=step*grad
        postprocessing(x)
        ### END YOUR CODE

        if iter%PRINT_EVERY==0:
            if not expcost:
                expcost=cost
            else:
                expcost=.95*expcost+0.05*cost
            print("iter %d: %f" % (iter, expcost))

        if iter % SAVE_PARAMS_EVERY==0 and useSaved:
            save_params(iter,x)

        if iter % ANNEAL_EVERY==0:
            step*=0.5

    return x

下面说下pickle模块的序列化和反序列化操作

(1) pickle.dump(obj, file, [,protocol])

通过序列化操作将程序中运行的对象信息保存到文件中去,永久存储

--将对象obj保存到文件file中去

--protocol:序列化使用的协议版本(0:ASCII协议;1:老式二进制协议;2:新二进制协议;默认值为0)

--file:file必须有write()接口(可以是以'w'打开的文件/StringIO对象/其他任何实现IO接口的对象)

(2) pickle.load(file)

通过反序列化操作,从文件中创建上一次程序保存的对象

从file中读取一个字符串,并将它重构为原来的python对象

file:类文件对象,有read()和readline()接口

(g) 加载真实的数据集训练词向量:利用Stanford Sentiment Treebank (SST)作为数据集训练词向量并进行简单的语义分析

具体参考q3_run.py:

import random
import numpy as np
from utils.treebank import StanfordSentiment
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import time

from q3_word2vec import *
from q3_sgd import *

"数据加载"
# Reset the random seed to make sure that everyone gets the same results
random.seed(314)
dataset=StanfordSentiment()
tokens=dataset.tokens()
nWords=len(tokens) # 19539 the number of different words
# We are going to train 10-dimensional vector for this assignment
dimVecctors=10

# Context size
C=5

# Reset the random seed to make sure that everyone gets the same results
random.seed(31415)
np.random.seed(9265)

"随机生成词向量,利用skipgram不断训练"
startTime=time.time()
wordVectors=np.concatenate(
    ((np.random.rand(nWords,dimVecctors)-0.5)/
     dimVecctors,np.zeros((nWords,dimVecctors))),
    axis=0) # (2*19539, 10) = (39078, 10)
wordVectors=sgd(
    lambda vec:word2vec_sgd_wrapper(skipgram,tokens,vec,dataset,C,
        negSamplingCostAndGradient),
    wordVectors,0.3,40000,None,True,PRINT_EVERY=10)
# Note that normalization is not called here. This is not a bug,
# normalizing during training loses the notion of length

print("sanity check: cost at convergence should be around or below 10")
print("training took %d seconds" % (time.time()-startTime))

# concatenate the input and output word vectors
wordVectors=np.concatenate(
    (wordVectors[:nWords,:],wordVectors[nWords:,:]),
    axis=0)
# wordVectors=wordVectors[:nWords,:]+wordVectors[nWords:,:]

"数据可视化:降维,利用奇异值分解将词向量由25维降到2维,从而实现平面上二维可视化"
visualizeWords=[
    "the", "a", "an", ",", ".", "?", "!", "``", "''", "--",
    "good", "great", "cool", "brilliant", "wonderful", "well", "amazing",
    "worth", "sweet", "enjoyable", "boring", "bad", "waste", "dumb",
    "annoying"] # 25

visualizeIdx=[tokens[word] for word in visualizeWords]
visualizeVecs=wordVectors[visualizeIdx,:]
temp=(visualizeVecs-np.mean(visualizeVecs,axis=0)) # 25*10
covariance=1.0/len(visualizeIdx)*temp.T.dot(temp)
U,S,V=np.linalg.svd(covariance) # U-(10, 10) S-(10,) V-(10, 10)
coord=temp.dot(U[:,0:2]) # (25, 2)

for i in range(len(visualizeWords)):
    plt.text(coord[i,0],coord[i,1],visualizeWords[i],
        bbox=dict(facecolor='green',alpa=0.1))

plt.xlim((np.min(coord[:,0]),np.max(coord[:,0])))
plt.ylim((np.min(coord[:,1]),np.max(coord[:,1])))

plt.savefig('q3_word_vectors.png')

由于训练了40000次,训练用时共20309秒,最终得到结果图保存在q3_word_vectors.png:

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值