词向量经常被用作下游nlp任务的基本组成部分,例如:问答,文本生成,机器翻译。这里,我们要学习两种类型的词向量,基于词共现矩阵的和基于Glove的
1. Count-Based Word Vectors
大多数词向量的构建都源于如下想法:相似的词语会出现在相似的语境之中。因此,相似的词语经常出现在一个公共的词语集合中。通过检查这些语境,我们可以构建词嵌入。
词共现矩阵计数了一个窗口大小下每个词语共同出现的次数。如图:
在自然语言处理中,我们通常使用<START>和<END>词元标记语文档料的开头和结尾。这些词元也被包含在词共现矩阵中,封装着每个文档。
矩阵的行(或列)提供了基于词词共现矩阵的词向量,但是他们的维度等于文档语料的不同词个数,为了降低维度,我们使用Singular Value Decomposition (SVD),类似于PCA,转换为最主要的k个维度。
由于刚考研结束,凭借着热乎的线性代数的基础知识,我去《矩阵论》这本书上学习了一下SVD也就是矩阵奇异值分解,有点感触:我秉持着勿以浮沙筑高台的学习态度去搞懂SVD的数学原理,但实事求是而言,仍无法理解为什么到机器学习这里可以选定k值作为分解矩阵的维度,路漫漫!
任务1
实现distinct_words函数,使其返回语料库中不同词语的个数
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
num_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
num_corpus_words = -1
# ------------------
# Write your implementation here.
corpus_words=list(sorted({y for x in corpus for y in x}))
num_corpus_words=len(corpus_words)
# ------------------
return corpus_words, num_corpus_words
一个列表表达式,一个字典去重,还是挺精妙的。
任务二
实现compute_co_occurrence_matrix函数,构建词共现矩阵
def compute_co_occurrence_matrix(corpus, window_size=4):
words, num_words = distinct_words(corpus)
M = None
word2Ind = {}
# ------------------
# Write your implementation here.
M=np.zeros((num_words,num_words))
word2Ind={word:ix for ix,word in enumerate(words)}
for sentence in corpus:
for i,word in enumerate(sentence):
for context in sentence[max(0,i-window_size):i]:
M[word2Ind[word]][word2Ind[context]]+=1
for context in sentence[i+1:min(len(sentence),i+window_size+1)]:
M[word2Ind[word]][word2Ind[context]]+=1
# ------------------
return M, word2Ind
任务三
实现reduce_to_k_dim函数,对词向量降维
def reduce_to_k_dim(M, k=2):
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
svd = TruncatedSVD(n_components=k, n_iter=n_iters)
M_reduced=svd.fit_transform(M)
# ------------------
print("Done.")
return M_reduced
任务四
实现plot_embeddings函数,将词二维向量展示出来
def plot_embeddings(M_reduced, word2Ind, words):
# ------------------
# Write your implementation here.
index=[ word2Ind[word] for word in words]
print(index)
print(M_reduced.shape)
X=M_reduced[index]
plt.scatter(X[:,0],X[:,1])
for i,word in enumerate(words):
plt.text(X[i,0],X[i,1],word)
plt.title("word embeddings")
plt.show()
# ------------------