代码链接,在原有代码做了些小的修改,适用于python3.6
3 Word2vec (40' + 2 bonus)
(a)假设中间单词c对应的向量给定,word2vec模型中利用softmax函数进行单词预测:
其中代表第个单词,代表vocabulary所有单词的"output"词向量。假设用交叉熵成本进行预测且是期待的预测词(one-hot label vector的第个元素是1),对求导。
提示:是每个词向量的softmax预测,是预期的词向量,损失函数:
其中,是所有输出向量组成的矩阵
答案:
即
(b)对"output"词向量求导
答案:
即
(c)对预测向量,我们使用负样本损失,预期的输出词是。假设K个负样例单词被丢弃:。给定单词的输出向量为,负样本损失为:
重复(a)(b)两个问题的求导
答案:
(d)给定前面的结果以及上下文集合,对所有词向量进行梯度推导。"input"和"output"词向量k分别用和表示。
提示:(其中是预期的输出)用于表示成本函数和
对于skip-gram,的cost函数是:
对于CBOW,我们对上下文中输入词向量求和得到:
则CBOW的成本为:
由(a),(b),(c)我们已知导数和
对于skip-gram,一个上下文窗口的成本函数的导数为:
类似的,对于CBOW,我们有如下推导:
(e) 在q3_word2vec中实现word2vec模型的建立,利用随机梯度下降法训练词向量
1.首先,对矩阵的行向量进行正则化处理:
利用每一行元素的平方和再开根,对每个行向量做正则化处理
def normalizeRows(x):
### YOUR CODE HERE
#对数组里的每个元素进行变换,axis=0表示列相加,axis=1表示行相加
denom=np.apply_along_axis(lambda x:np.sqrt(x.T.dot(x)),1,x)
x/=denom[:,None] # denom (2,)->(2,1)
### END YOUR CODE
return x
2.然后为softmax和negative样例填写成本和梯度函数
,
(1) softmax:
成本函数:交叉熵
对输入向量求导:,即
对输出向量求导:,即
def softmaxCostAndGradient(predicted,target,outputVectors,dataset):
### YOUR CODE HERE
## Gradient for $\hat{v}$
# Calculate the prediction:
vhat=predicted # (3,)
z=np.dot(outputVectors,vhat) # (5,)
preds=softmax(z) # the column vector of the softmax prediction of words
# Calculate the cost: the cross entropy function
cost=-np.log(preds[target])
# Gradients
z=preds.copy()
z[target]-=1.0
grad=np.outer(z,vhat) # (5, 3) gradients for the "output" word vectors U (outputVectors)
gradPred=np.dot(outputVectors.T,z) # (3,) gradients for the "input" word vectors v (predicted)
### END YOUR CODE
return cost,gradPred,grad
(2) negative:
成本函数:交叉熵
对输入向量求导:
对输出向量求导:
def negSamplingCostAndGradient(predicted,target,outputVectors,dataset,K=10):
# Generate the K negative samples (words), which aren't the expected output
indices=[target]
indices.extend(getNegativeSamples(target,dataset,K))
### YOUR CODE HERE
grad=np.zeros(outputVectors.shape) # (5,3) the gradients for the "output" word vectors U (outputVectors)
gradPred=np.zeros(predicted.shape) # (3,) the gradient for the predicted vector v_c (predicted)
cost=0
z=sigmoid(np.dot(outputVectors[target],predicted))
cost-=np.log(z)
grad[target]+=predicted*(z-1.0) # (3,) the gradients for u_o
gradPred+=outputVectors[target]*(z-1.0)
for k in range(K):
samp=indices[k+1]
z=sigmoid(np.dot(outputVectors[samp],predicted))
cost-=np.log(1.0-z)
grad[samp]+=predicted*z # (3,) the gradients for u_k
gradPred+=outputVectors[samp]*z # (3,) the gradients for v_c
### END YOUR CODE
return cost,gradPred,grad
3.最后针对skip-gram模型实现成本和梯度函数
成本函数:交叉熵
对输出向量求导:
对输入向量求导:
def skipgram(currentWord,C,contextWords,tokens,inputVectors,outputVectors,
dataset,word2vecCostAndGradient=softmaxCostAndGradient):
cost=0.0
gradIn=np.zeros(inputVectors.shape) # (5,3)
gradOut=np.zeros(outputVectors.shape) # (5,3)
### YOUR CODE HERE
cword_idx=tokens[currentWord]
vhat=inputVectors[cword_idx] # the input word vector for the center word_c
for j in contextWords: # 2*5=10
u_idx=tokens[j] # the output word vector for the word u_o
c_cost,c_grad_in,c_grad_out=\
word2vecCostAndGradient(vhat,u_idx,outputVectors,dataset)
cost+=c_cost
gradIn[cword_idx]+=c_grad_in # (5,3) the gradients for the input vector
gradOut+=c_grad_out # (5,3) the gradients for the output vector
### END YOUR CODE
return cost,gradIn,gradOut
4.对CBOW模型实现成本和梯度函数
(f) 实现随机梯度下降的优化器
具体参考q3_sgd.py
def sgd(f,x0,step,iterations,postprocessing=None,useSaved=False,
PRINT_EVERY=10):
"""
:param f: the function to optimize
:param x0: the initial point to start SGD from
:param step: the step size for SGD
:param iterations: total iterations to run SGD for
:param postprocessing: postprocessing function for the parameters if necessary
In the case of word2vec we will need to normalize the word vectors to have unit length
:param useSaved: parameter specified whether load previously saved parameters
:param PRINT_EVERY: specifies how many iterations to output loss
:return: x -- the parameter value after SGD finishes
"""
# Anneal learning rate every several iterations
ANNEAL_EVERY=20000
if useSaved:
start_iter,oldx,state=load_saved_params()
if start_iter>0:
x0=oldx
step*=0.5**(start_iter/ANNEAL_EVERY)
if state:
random.setstate(state)
else:
start_iter=0
x=x0
if not postprocessing:
postprocessing=lambda x:x
expcost=None
for iter in range(start_iter+1,iterations+1):
# Don't forget to apply the postprocessing after every iteration!
# You might want to print the process every few iterations.
cost=None
### YOUR CODE HERE
cost,grad=f(x)
x-=step*grad
postprocessing(x)
### END YOUR CODE
if iter%PRINT_EVERY==0:
if not expcost:
expcost=cost
else:
expcost=.95*expcost+0.05*cost
print("iter %d: %f" % (iter, expcost))
if iter % SAVE_PARAMS_EVERY==0 and useSaved:
save_params(iter,x)
if iter % ANNEAL_EVERY==0:
step*=0.5
return x
下面说下pickle模块的序列化和反序列化操作
(1) pickle.dump(obj, file, [,protocol])
通过序列化操作将程序中运行的对象信息保存到文件中去,永久存储
--将对象obj保存到文件file中去
--protocol:序列化使用的协议版本(0:ASCII协议;1:老式二进制协议;2:新二进制协议;默认值为0)
--file:file必须有write()接口(可以是以'w'打开的文件/StringIO对象/其他任何实现IO接口的对象)
(2) pickle.load(file)
通过反序列化操作,从文件中创建上一次程序保存的对象
从file中读取一个字符串,并将它重构为原来的python对象
file:类文件对象,有read()和readline()接口
(g) 加载真实的数据集训练词向量:利用Stanford Sentiment Treebank (SST)作为数据集训练词向量并进行简单的语义分析
具体参考q3_run.py:
import random
import numpy as np
from utils.treebank import StanfordSentiment
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import time
from q3_word2vec import *
from q3_sgd import *
"数据加载"
# Reset the random seed to make sure that everyone gets the same results
random.seed(314)
dataset=StanfordSentiment()
tokens=dataset.tokens()
nWords=len(tokens) # 19539 the number of different words
# We are going to train 10-dimensional vector for this assignment
dimVecctors=10
# Context size
C=5
# Reset the random seed to make sure that everyone gets the same results
random.seed(31415)
np.random.seed(9265)
"随机生成词向量,利用skipgram不断训练"
startTime=time.time()
wordVectors=np.concatenate(
((np.random.rand(nWords,dimVecctors)-0.5)/
dimVecctors,np.zeros((nWords,dimVecctors))),
axis=0) # (2*19539, 10) = (39078, 10)
wordVectors=sgd(
lambda vec:word2vec_sgd_wrapper(skipgram,tokens,vec,dataset,C,
negSamplingCostAndGradient),
wordVectors,0.3,40000,None,True,PRINT_EVERY=10)
# Note that normalization is not called here. This is not a bug,
# normalizing during training loses the notion of length
print("sanity check: cost at convergence should be around or below 10")
print("training took %d seconds" % (time.time()-startTime))
# concatenate the input and output word vectors
wordVectors=np.concatenate(
(wordVectors[:nWords,:],wordVectors[nWords:,:]),
axis=0)
# wordVectors=wordVectors[:nWords,:]+wordVectors[nWords:,:]
"数据可视化:降维,利用奇异值分解将词向量由25维降到2维,从而实现平面上二维可视化"
visualizeWords=[
"the", "a", "an", ",", ".", "?", "!", "``", "''", "--",
"good", "great", "cool", "brilliant", "wonderful", "well", "amazing",
"worth", "sweet", "enjoyable", "boring", "bad", "waste", "dumb",
"annoying"] # 25
visualizeIdx=[tokens[word] for word in visualizeWords]
visualizeVecs=wordVectors[visualizeIdx,:]
temp=(visualizeVecs-np.mean(visualizeVecs,axis=0)) # 25*10
covariance=1.0/len(visualizeIdx)*temp.T.dot(temp)
U,S,V=np.linalg.svd(covariance) # U-(10, 10) S-(10,) V-(10, 10)
coord=temp.dot(U[:,0:2]) # (25, 2)
for i in range(len(visualizeWords)):
plt.text(coord[i,0],coord[i,1],visualizeWords[i],
bbox=dict(facecolor='green',alpa=0.1))
plt.xlim((np.min(coord[:,0]),np.max(coord[:,0])))
plt.ylim((np.min(coord[:,1]),np.max(coord[:,1])))
plt.savefig('q3_word_vectors.png')
由于训练了40000次,训练用时共20309秒,最终得到结果图保存在q3_word_vectors.png: