苦于网上没有翻译资源,自己借助工具大致写了下翻译,很多地方机翻有错误,我跟着重写了一遍(打符号下表i,t,j经常分不清),大致意思应该是对的。
课程主页:
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/
讲义:
作业原文:
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/assignments/a2.pdf
1 Written: Understanding word2vec
让我们快速回顾一下word2vec算法,word2vec背后的关键观点是“一个词是由它所保持的公司所知道的”( 'a word is known by the company it keeps'. )。具体地说,假设我们有一个中心词 c 和一个围绕着c的上下文窗口。我们将位于这个上下文窗口中的词称为“外部词”。
例如,在图1中,我们看到中间的单词c是“banking”。由于上下文窗口大小为2,因此外部词是“turning”、“into”、“crises”和“as”。
skip-gram word2vec算法的目标是准确学习概率分布P(OIC),给定一个特定的单词o和一个特定的单词c,我们想要计算P(O = o|C = c),这是单词o是c的“外部”单词的概率,即o落在c的上下文窗口内的概率。
在word2vec中,取矢量点积,应用softmax函数给出条件概率分布:
这里,u是表示外部单词o的“外部”向量,而是表示中心单词c的“中心”向量。为了包含这些参数,我们有两个矩阵U和V。U的列都是“外部”向量。V的列向量是所有的中心向量。
U和V都包含每个词汇表的一个向量。回想一下,对于单个单词c和o,损失是这样给出的:
我们可以把这个损失看作是真实分布y和预测分布y之间的交叉熵。这里,, 都是长度等于词汇表中单词数的向量。此外,这些向量的第k项表示给定第k个单词是向量c的“外部词”的条件概率。
真实的经验分布y是一个独热向量,对于真实的外部单词o为1,其他地方都是0。预测分布是由我们第一个式子给出的概率分布P(O|C = C)。
1)
证明给出的naive-softmax损失与y与y'之间的交叉熵损失相同;也就是说,证明:
你的答案应该是一行。
2)
计算对的偏导数。请用y,y'和U来写你的答案。注意,在这门课程中,我们希望你的最终答案遵循形状惯例,这意味着任何函数f(x)关于x的偏导数应该与z具有相同的形状。对于这一子部分,请以矢量形式给出答案。特别是,你不能在你的最终答案中提到, 和的特定元素(如, ,…)。
3)
计算对每个“外部”词向量的偏导数。有两种情况:当w = 0时,是真正的“外部”单词向量;当w = 0时,是所有其他单词的向量。请用, 和来表示你的答案。在此子部分中,您也可以使用这些术语中的特定元素,例如(, ,…)。
4)
计算对的偏导数,请将你的答案用以下形式表示:
解答应该是一至两行。
5)
sigmoid函数表示为下式:
请计算对x的导数,其中x为标量。提示:你可能想用来表示你的答案。
6)
现在我们将考虑负采样损失,这是Naive Softmax损失的一种替代方案。假设从词汇表中抽取K个负样本(单词)。为简便起见,我们称它们为, ,…, 和它们的外部向量,…,。对于这个问题,假设K个负样本是不同的。换句话说,意味着.对于,注意到。对于中心词c和外围词o,负采样损失函数为:
对于样本,…,其中σ(·)指sigmoid函数
请重复第2和3小题,计算对,的偏导数,以及负样本,请用,,,的形式写出你的答案。在你这样做之后,用一句话描述为什么
这个损失函数比naive-softmax损失计算效率高得多。注意,您应该能够使用5小题部分的解决方案来帮助计算这里必要的梯度。
7)
现在我们将重复前面的练习,但不假设K个采样词是不同的。假设从词汇表中抽取K个负样本(单词)。为简便起见,我们称它们为, ,…和它们的外向量,…,。在这个问题中,您可能不会假设单词是不同的。换句话说,当真时,可能为真。注意,。对于中心词c和外围词o,负采样损失函数为:
对于样本,…其中σ为sigmoid函数。
计算对负样本的偏导数。请用向量和来表示答案,其中。提示:将损失函数中的和分解为两个和:一个等于的所有采样词的和和一个不等于的所有采样词的和。
8)
假设中心词为,上下文窗口为[,…,,, ,…,],其中m为上下文窗口大小。回想一下,对于word2vec的skip-gram版本,上下文窗口的总损失为:
这里,表示中心词和外围词的任意损失项。可以是或,这取决于您的实现。
写下以下三个偏导数:
用和写出你的答案。这很简单,每个解应该是一行。
一旦你完成了:
假设你计算了对1到3部分中所有模型参数U和V的导数,你现在已经计算了整个损失函数Jskip-gram对所有参数的导数。您已经准备好实现word2vec了!
2 Coding: Implementing word2vec
省去开头建议配置conda环境的翻译,大概就是为scikit-learn等配置专属环境。(其实接下来大部分题目都不用看,代码里都有注释要写什么)
对于需要实现的每个方法,我们包含了大约多少行代码解决方案在代码注释中。这些数字是用来指导你的。你不需要坚持。您可以根据自己的意愿编写更短或更长的代码。如果你认为你的实现是显著的它比我们的代码长,这是一个信号,表明您可以使用一些愚蠢的方法来使您的代码两者都有效
更短更快。Python中的for循环在处理大型数组时需要很长时间才能完成,所以我们希望您使用numpy方法。我们将检查你的代码的效率。你会做到的。
1)
我们将从实现word2vec.py中的方法开始。您可以通过运行python word2vec.py m来测试特定的方法,其中m是您想要测试的方法。例如,可以通过运行python word2vec.py sigmoid来测试sigmoid方法。
(1)
实现sigmoid方法,该方法接受一个向量并对其应用sigmoid函数。
(2)
在naiveSoftmaxLossAndGradient方法中实现softmax损失和梯度。
(3)
在negSamplingLossAndGradient方法中实现负采样损失和梯度。
(4)
在skip-gram方法中实现skip-gram模型。
2)
在SGD .py的SGD方法中完成SGD优化器的实现。通过运行python sgd.py测试您的实现。
3)
show time!现在我们要加载一些真实的数据,用所有的东西训练词向量你刚刚实现了!我们将使用Stanford Sentiment Treebank (SST)数据集进行训练词向量,然后将它们应用于简单的情感分析任务。你需要去取数据集。为此,执行sh get datasets.sh命令。不需要为此编写额外的代码部分;运行python run.py。
注意:培训过程可能需要很长时间,这取决于您的执行效率以及机器的计算能力(一个有效的实现需要一到两个小时)。
相应的计划!
经过40,000次迭代后,脚本将完成,您的单词向量的可视化将出现。它也会以word vectors.png的形式保存在你的项目目录中。包括情节在你的写作业。用最多三句话简要说明你在情节中看到了什么。
代码如下:
word2vec.py
#!/usr/bin/env python
import argparse
import numpy as np
import random
from utils.gradcheck import gradcheck_naive, grad_tests_softmax, grad_tests_negsamp
from utils.utils import normalizeRows, softmax
def sigmoid(x):
"""
Compute the sigmoid function for the input here.
Arguments:
x -- A scalar or numpy array.
Return:
s -- sigmoid(x)
"""
### YOUR CODE HERE (~1 Line)
### END YOUR CODE
return s
def naiveSoftmaxLossAndGradient(
centerWordVec,
outsideWordIdx,
outsideVectors,
dataset
):
""" Naive Softmax loss & gradient function for word2vec models
Implement the naive softmax loss and gradients between a center word's
embedding and an outside word's embedding. This will be the building block
for our word2vec models. For those unfamiliar with numpy notation, note
that a numpy ndarray with a shape of (x, ) is a one-dimensional array, which
you can effectively treat as a vector with length x.
Arguments:
centerWordVec -- numpy ndarray, center word's embedding
in shape (word vector length, )
(v_c in the pdf handout)
outsideWordIdx -- integer, the index of the outside word
(o of u_o in the pdf handout)
outsideVectors -- outside vectors is
in shape (num words in vocab, word vector length)
for all words in vocab (tranpose of U in the pdf handout)
dataset -- needed for negative sampling, unused here.
Return:
loss -- naive softmax loss
gradCenterVec -- the gradient with respect to the center word vector
in shape (word vector length, )
(dJ / dv_c in the pdf handout)
gradOutsideVecs -- the gradient with respect to all the outside word vectors
in shape (num words in vocab, word vector length)
(dJ / dU)
"""
### YOUR CODE HERE (~6-8 Lines)
### Please use the provided softmax function (imported earlier in this file)
### This numerically stable implementation helps you avoid issues pertaining
### to integer overflow.
### END YOUR CODE
return loss, gradCenterVec, gradOutsideVecs
def getNegativeSamples(outsideWordIdx, dataset, K):
""" Samples K indexes which are not the outsideWordIdx """
negSampleWordIndices = [None] * K
for k in range(K):
newidx = dataset.sampleTokenIdx()
while newidx == outsideWordIdx:
newidx = dataset.sampleTokenIdx()
negSampleWordIndices[k] = newidx
return negSampleWordIndices
def negSamplingLossAndGradient(
centerWordVec,
outsideWordIdx,
outsideVectors,
dataset,
K=10
):
""" Negative sampling loss function for word2vec models
Implement the negative sampling loss and gradients for a centerWordVec
and a outsideWordIdx word vector as a building block for word2vec
models. K is the number of negative samples to take.
Note: The same word may be negatively sampled multiple times. For
example if an outside word is sampled twice, you shall have to
double count the gradient with respect to this word. Thrice if
it was sampled three times, and so forth.
Arguments/Return Specifications: same as naiveSoftmaxLossAndGradient
"""
# Negative sampling of words is done for you. Do not modify this if you
# wish to match the autograder and receive points!
negSampleWordIndices = getNegativeSamples(outsideWordIdx, dataset, K)
indices = [outsideWordIdx] + negSampleWordIndices
### YOUR CODE HERE (~10 Lines)
### Please use your implementation of sigmoid in here.
### END YOUR CODE
return loss, gradCenterVec, gradOutsideVecs
def skipgram(currentCenterWord, windowSize, outsideWords, word2Ind,
centerWordVectors, outsideVectors, dataset,
word2vecLossAndGradient=naiveSoftmaxLossAndGradient):
""" Skip-gram model in word2vec
Implement the skip-gram model in this function.
Arguments:
currentCenterWord -- a string of the current center word
windowSize -- integer, context window size
outsideWords -- list of no more than 2*windowSize strings, the outside words
word2Ind -- a dictionary that maps words to their indices in
the word vector list
centerWordVectors -- center word vectors (as rows) is in shape
(num words in vocab, word vector length)
for all words in vocab (V in pdf handout)
outsideVectors -- outside vectors is in shape
(num words in vocab, word vector length)
for all words in vocab (transpose of U in the pdf handout)
word2vecLossAndGradient -- the loss and gradient function for
a prediction vector given the outsideWordIdx
word vectors, could be one of the two
loss functions you implemented above.
Return:
loss -- the loss function value for the skip-gram model
(J in the pdf handout)
gradCenterVec -- the gradient with respect to the center word vector
in shape (word vector length, )
(dJ / dv_c in the pdf handout)
gradOutsideVecs -- the gradient with respect to all the outside word vectors
in shape (num words in vocab, word vector length)
(dJ / dU)
"""
loss = 0.0
gradCenterVecs = np.zeros(centerWordVectors.shape)
gradOutsideVectors = np.zeros(outsideVectors.shape)
### YOUR CODE HERE (~8 Lines)
### END YOUR CODE
return loss, gradCenterVecs, gradOutsideVectors
#############################################
# Testing functions below. DO NOT MODIFY! #
#############################################
def word2vec_sgd_wrapper(word2vecModel, word2Ind, wordVectors, dataset,
windowSize,
word2vecLossAndGradient=naiveSoftmaxLossAndGradient):
batchsize = 50
loss = 0.0
grad = np.zeros(wordVectors.shape)
N = wordVectors.shape[0]
centerWordVectors = wordVectors[:int(N/2),:]
outsideVectors = wordVectors[int(N/2):,:]
for i in range(batchsize):
windowSize1 = random.randint(1, windowSize)
centerWord, context = dataset.getRandomContext(windowSize1)
c, gin, gout = word2vecModel(
centerWord, windowSize1, context, word2Ind, centerWordVectors,
outsideVectors, dataset, word2vecLossAndGradient
)
loss += c / batchsize
grad[:int(N/2), :] += gin / batchsize
grad[int(N/2):, :] += gout / batchsize
return loss, grad
def test_sigmoid():
""" Test sigmoid function """
print("=== Sanity check for sigmoid ===")
assert sigmoid(0) == 0.5
assert np.allclose(sigmoid(np.array([0])), np.array([0.5]))
assert np.allclose(sigmoid(np.array([1,2,3])), np.array([0.73105858, 0.88079708, 0.95257413]))
print("Tests for sigmoid passed!")
def getDummyObjects():
""" Helper method for naiveSoftmaxLossAndGradient and negSamplingLossAndGradient tests """
def dummySampleTokenIdx():
return random.randint(0, 4)
def getRandomContext(C):
tokens = ["a", "b", "c", "d", "e"]
return tokens[random.randint(0,4)], \
[tokens[random.randint(0,4)] for i in range(2*C)]
dataset = type('dummy', (), {})()
dataset.sampleTokenIdx = dummySampleTokenIdx
dataset.getRandomContext = getRandomContext
random.seed(31415)
np.random.seed(9265)
dummy_vectors = normalizeRows(np.random.randn(10,3))
dummy_tokens = dict([("a",0), ("b",1), ("c",2),("d",3),("e",4)])
return dataset, dummy_vectors, dummy_tokens
def test_naiveSoftmaxLossAndGradient():
""" Test naiveSoftmaxLossAndGradient """
dataset, dummy_vectors, dummy_tokens = getDummyObjects()
print("==== Gradient check for naiveSoftmaxLossAndGradient ====")
def temp(vec):
loss, gradCenterVec, gradOutsideVecs = naiveSoftmaxLossAndGradient(vec, 1, dummy_vectors, dataset)
return loss, gradCenterVec
gradcheck_naive(temp, np.random.randn(3), "naiveSoftmaxLossAndGradient gradCenterVec")
centerVec = np.random.randn(3)
def temp(vec):
loss, gradCenterVec, gradOutsideVecs = naiveSoftmaxLossAndGradient(centerVec, 1, vec, dataset)
return loss, gradOutsideVecs
gradcheck_naive(temp, dummy_vectors, "naiveSoftmaxLossAndGradient gradOutsideVecs")
def test_negSamplingLossAndGradient():
""" Test negSamplingLossAndGradient """
dataset, dummy_vectors, dummy_tokens = getDummyObjects()
print("==== Gradient check for negSamplingLossAndGradient ====")
def temp(vec):
loss, gradCenterVec, gradOutsideVecs = negSamplingLossAndGradient(vec, 1, dummy_vectors, dataset)
return loss, gradCenterVec
gradcheck_naive(temp, np.random.randn(3), "negSamplingLossAndGradient gradCenterVec")
centerVec = np.random.randn(3)
def temp(vec):
loss, gradCenterVec, gradOutsideVecs = negSamplingLossAndGradient(centerVec, 1, vec, dataset)
return loss, gradOutsideVecs
gradcheck_naive(temp, dummy_vectors, "negSamplingLossAndGradient gradOutsideVecs")
def test_skipgram():
""" Test skip-gram with naiveSoftmaxLossAndGradient """
dataset, dummy_vectors, dummy_tokens = getDummyObjects()
print("==== Gradient check for skip-gram with naiveSoftmaxLossAndGradient ====")
gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
skipgram, dummy_tokens, vec, dataset, 5, naiveSoftmaxLossAndGradient),
dummy_vectors, "naiveSoftmaxLossAndGradient Gradient")
grad_tests_softmax(skipgram, dummy_tokens, dummy_vectors, dataset)
print("==== Gradient check for skip-gram with negSamplingLossAndGradient ====")
gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
skipgram, dummy_tokens, vec, dataset, 5, negSamplingLossAndGradient),
dummy_vectors, "negSamplingLossAndGradient Gradient")
grad_tests_negsamp(skipgram, dummy_tokens, dummy_vectors, dataset, negSamplingLossAndGradient)
def test_word2vec():
""" Test the two word2vec implementations, before running on Stanford Sentiment Treebank """
test_sigmoid()
test_naiveSoftmaxLossAndGradient()
test_negSamplingLossAndGradient()
test_skipgram()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Test your implementations.')
parser.add_argument('function', nargs='?', type=str, default='all',
help='Name of the function you would like to test.')
args = parser.parse_args()
if args.function == 'sigmoid':
test_sigmoid()
elif args.function == 'naiveSoftmaxLossAndGradient':
test_naiveSoftmaxLossAndGradient()
elif args.function == 'negSamplingLossAndGradient':
test_negSamplingLossAndGradient()
elif args.function == 'skipgram':
test_skipgram()
elif args.function == 'all':
test_word2vec()
sgd.py
#!/usr/bin/env python
# Save parameters every a few SGD iterations as fail-safe
SAVE_PARAMS_EVERY = 5000
import pickle
import glob
import random
import numpy as np
import os.path as op
def load_saved_params():
"""
A helper function that loads previously saved parameters and resets
iteration start.
"""
st = 0
for f in glob.glob("saved_params_*.npy"):
iter = int(op.splitext(op.basename(f))[0].split("_")[2])
if (iter > st):
st = iter
if st > 0:
params_file = "saved_params_%d.npy" % st
state_file = "saved_state_%d.pickle" % st
params = np.load(params_file)
with open(state_file, "rb") as f:
state = pickle.load(f)
return st, params, state
else:
return st, None, None
def save_params(iter, params):
params_file = "saved_params_%d.npy" % iter
np.save(params_file, params)
with open("saved_state_%d.pickle" % iter, "wb") as f:
pickle.dump(random.getstate(), f)
def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False,
PRINT_EVERY=10):
""" Stochastic Gradient Descent
Implement the stochastic gradient descent method in this function.
Arguments:
f -- the function to optimize, it should take a single
argument and yield two outputs, a loss and the gradient
with respect to the arguments
x0 -- the initial point to start SGD from
step -- the step size for SGD
iterations -- total iterations to run SGD for
postprocessing -- postprocessing function for the parameters
if necessary. In the case of word2vec we will need to
normalize the word vectors to have unit length.
PRINT_EVERY -- specifies how many iterations to output loss
Return:
x -- the parameter value after SGD finishes
"""
# Anneal learning rate every several iterations
ANNEAL_EVERY = 20000
if useSaved:
start_iter, oldx, state = load_saved_params()
if start_iter > 0:
x0 = oldx
step *= 0.5 ** (start_iter / ANNEAL_EVERY)
if state:
random.setstate(state)
else:
start_iter = 0
x = x0
if not postprocessing:
postprocessing = lambda x: x
exploss = None
for iter in range(start_iter + 1, iterations + 1):
# You might want to print the progress every few iterations.
loss = None
### YOUR CODE HERE (~2 lines)
### END YOUR CODE
x = postprocessing(x)
if iter % PRINT_EVERY == 0:
if not exploss:
exploss = loss
else:
exploss = .95 * exploss + .05 * loss
print("iter %d: %f" % (iter, exploss))
if iter % SAVE_PARAMS_EVERY == 0 and useSaved:
save_params(iter, x)
if iter % ANNEAL_EVERY == 0:
step *= 0.5
return x
def sanity_check():
quad = lambda x: (np.sum(x ** 2), x * 2)
print("Running sanity checks...")
t1 = sgd(quad, 0.5, 0.01, 1000, PRINT_EVERY=100)
print("test 1 result:", t1)
assert abs(t1) <= 1e-6
t2 = sgd(quad, 0.0, 0.01, 1000, PRINT_EVERY=100)
print("test 2 result:", t2)
assert abs(t2) <= 1e-6
t3 = sgd(quad, -1.5, 0.01, 1000, PRINT_EVERY=100)
print("test 3 result:", t3)
assert abs(t3) <= 1e-6
print("-" * 40)
print("ALL TESTS PASSED")
print("-" * 40)
if __name__ == "__main__":
sanity_check()
run.py
#!/usr/bin/env python
import random
import numpy as np
from utils.treebank import StanfordSentiment
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import time
from word2vec import *
from sgd import *
# Check Python Version
import sys
assert sys.version_info[0] == 3
assert sys.version_info[1] >= 5
# Reset the random seed to make sure that everyone gets the same results
random.seed(314)
dataset = StanfordSentiment()
tokens = dataset.tokens()
nWords = len(tokens)
# We are going to train 10-dimensional vectors for this assignment
dimVectors = 10
# Context size
C = 5
# Reset the random seed to make sure that everyone gets the same results
random.seed(31415)
np.random.seed(9265)
startTime=time.time()
wordVectors = np.concatenate(
((np.random.rand(nWords, dimVectors) - 0.5) /
dimVectors, np.zeros((nWords, dimVectors))),
axis=0)
wordVectors = sgd(
lambda vec: word2vec_sgd_wrapper(skipgram, tokens, vec, dataset, C,
negSamplingLossAndGradient),
wordVectors, 0.3, 40000, None, True, PRINT_EVERY=10)
# Note that normalization is not called here. This is not a bug,
# normalizing during training loses the notion of length.
print("sanity check: cost at convergence should be around or below 10")
print("training took %d seconds" % (time.time() - startTime))
# concatenate the input and output word vectors
wordVectors = np.concatenate(
(wordVectors[:nWords,:], wordVectors[nWords:,:]),
axis=0)
visualizeWords = [
"great", "cool", "brilliant", "wonderful", "well", "amazing",
"worth", "sweet", "enjoyable", "boring", "bad", "dumb",
"annoying", "female", "male", "queen", "king", "man", "woman", "rain", "snow",
"hail", "coffee", "tea"]
visualizeIdx = [tokens[word] for word in visualizeWords]
visualizeVecs = wordVectors[visualizeIdx, :]
temp = (visualizeVecs - np.mean(visualizeVecs, axis=0))
covariance = 1.0 / len(visualizeIdx) * temp.T.dot(temp)
U,S,V = np.linalg.svd(covariance)
coord = temp.dot(U[:,0:2])
for i in range(len(visualizeWords)):
plt.text(coord[i,0], coord[i,1], visualizeWords[i],
bbox=dict(facecolor='green', alpha=0.1))
plt.xlim((np.min(coord[:,0]), np.max(coord[:,0])))
plt.ylim((np.min(coord[:,1]), np.max(coord[:,1])))
plt.savefig('word_vectors.png')