CS224N Assignment 1: Exploring Word Vectors

最新推荐文章于 2024-06-18 00:00:42 发布

53年7月11天

最新推荐文章于 2024-06-18 00:00:42 发布

阅读量7

点赞数 12

文章标签： word

本文链接：https://blog.csdn.net/m0_56741459/article/details/138373592

版权

词向量经常被用作下游nlp任务的基本组成部分，例如：问答，文本生成，机器翻译。这里，我们要学习两种类型的词向量，基于词共现矩阵的和基于Glove的

1. Count-Based Word Vectors

大多数词向量的构建都源于如下想法：相似的词语会出现在相似的语境之中。因此，相似的词语经常出现在一个公共的词语集合中。通过检查这些语境，我们可以构建词嵌入。

词共现矩阵计数了一个窗口大小下每个词语共同出现的次数。如图：

在自然语言处理中，我们通常使用<START>和<END>词元标记语文档料的开头和结尾。这些词元也被包含在词共现矩阵中，封装着每个文档。

矩阵的行（或列）提供了基于词词共现矩阵的词向量，但是他们的维度等于文档语料的不同词个数，为了降低维度，我们使用Singular Value Decomposition (SVD),类似于PCA，转换为最主要的k个维度。

由于刚考研结束，凭借着热乎的线性代数的基础知识，我去《矩阵论》这本书上学习了一下SVD也就是矩阵奇异值分解，有点感触：我秉持着勿以浮沙筑高台的学习态度去搞懂SVD的数学原理，但实事求是而言，仍无法理解为什么到机器学习这里可以选定k值作为分解矩阵的维度，路漫漫！

任务1

实现distinct_words函数，使其返回语料库中不同词语的个数

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    corpus_words=list(sorted({y for x in corpus for y in x}))
    num_corpus_words=len(corpus_words)

    # ------------------

    return corpus_words, num_corpus_words

一个列表表达式，一个字典去重，还是挺精妙的。

任务二

实现`compute_co_occurrence_matrix函数，构建词共现矩阵`

def compute_co_occurrence_matrix(corpus, window_size=4):
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    M=np.zeros((num_words,num_words))
    word2Ind={word:ix for ix,word in enumerate(words)}
    for sentence in corpus:
        for i,word in enumerate(sentence):
            for context in sentence[max(0,i-window_size):i]:
                M[word2Ind[word]][word2Ind[context]]+=1
            for context in sentence[i+1:min(len(sentence),i+window_size+1)]:
                M[word2Ind[word]][word2Ind[context]]+=1
    # ------------------

    return M, word2Ind

任务三

实现`reduce_to_k_dim函数，对词向量降维`

def reduce_to_k_dim(M, k=2):
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
    # ------------------
    # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced=svd.fit_transform(M)
    # ------------------

    print("Done.")
    return M_reduced

任务四

实现`plot_embeddings函数，将词二维向量展示出来`

def plot_embeddings(M_reduced, word2Ind, words):
    # ------------------
    # Write your implementation here.
    index=[ word2Ind[word] for word in words]
    print(index)
    print(M_reduced.shape)
    X=M_reduced[index]
    plt.scatter(X[:,0],X[:,1])
    for i,word in enumerate(words):
        plt.text(X[i,0],X[i,1],word)
    plt.title("word embeddings")
    plt.show()
    # ------------------

53年7月11天

关注

12
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
CS224N Assignment 1: Exploring Word Vectors

词向量经常被用作下游nlp任务的基本组成部分，例如：问答，文本生成，机器翻译。这里，我们要学习两种类型的词向量，基于词共现矩阵的和基于Glove的大多数词向量的构建都源于如下想法：相似的词语会出现在相似的语境之中。因此，相似的词语经常出现在一个公共的词语集合中。通过检查这些语境，我们可以构建词嵌入。词共现矩阵计数了一个窗口大小下每个词语共同出现的次数。如图：在自然语言处理中，我们通常使用<START>和<END>词元标记语文档料的开头和结尾。这些词元也被包含在词共现矩阵中，封装着每个文档。
复制链接

扫一扫