CS224N Assignment 1: Exploring Word Vectors

樱吹雪_

于 2023-08-15 16:24:57 发布

阅读量200

点赞数 1

分类专栏： NLP 文章标签： nlp python word 自然语言处理

本文链接：https://blog.csdn.net/m0_67146053/article/details/132281133

版权

NLP 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Exploring Word Vectors

任务介绍

本文中是笔者在阅读Stanford / Winter 2023 CS224N的过程中，针对assignment1给出的个人解答。各个函数基本都有文档注释，帮助大家理解阅读，解法可能不是最优，如果各位读者有更为简洁精妙的解法，欢迎在评论区中提出。
在此我给出本作业的链接。Exploring Word Vectors
本次作业主要是初步探究词向量（word vectors），词向量通常被用作下游 NLP 任务（如问题解答、文本生成、机器翻译等）的基本组成部分，本次作业中我们将探索两种类型的词向量：从共现矩阵中得出的词向量和通过 GloVe 得出的词向量。
首先自然是先import该作业需要用到的包和库。

   from gensim.models import KeyedVectors
   from gensim.test.utils import datapath
   import pprint
   import matplotlib.pyplot as plt
   plt.rcParams['figure.figsize'] = [10, 5]
   import nltk
   nltk.download('reuters') #to specify download location, optionally add the argument: download_dir='/specify/desired/path/'
   from nltk.corpus import reuters
   import numpy as np
   import random
   import scipy as sp
   from sklearn.decomposition import TruncatedSVD
   from sklearn.decomposition import PCA
   from typing import *
   START_TOKEN = '<START>'
   END_TOKEN = '<END>'
   np.random.seed(0)
   random.seed(0)  # 使得随机数据可预测，当我们设置相同的seed，每次生成的随机数相同

Part 1: Count-Based Word Vectors

Co-Occurrence Word Embeddings

许多词向量的实现都是基于这样一种想法，即词的相似性（word similarity）。类似的单词通常会被我们一同说出来或写出来。在此，我们将详细介绍其中一种策略——共现矩阵（co-occurrence matrix）。
这里词被我们分为两类，一类是中心词(center word)，一类是背景词(context word)。在语料库中，当一个背景词出现在某个中心词的周围的频次越高，我们有理由相信这两个词在语义上有一定的联系。这个“周围”我们把它叫做浅窗口(Fixed Window of n)，即中心词 $w_{i}$ 左右各n个词，words $w_{i-n} \dots w_{i-1}$ 与 $w_{i+1} \dots w_{i+n}$ 。下面我们将构建共现矩阵M，which is a symmetric word-by-word matrix in which $M_{ij}$ is the number of times $w_j$ appears inside $w_i$ ’s window among all documents.

Example: Co-Occurrence with Fixed Window of n=1:

Document 1: “all that glitters is not gold”

Document 2: “all is well that ends well”

*	`<START>`	all	that	glitters	is	not	gold	well	ends	`<END>`
`<START>`	0	2	0	0	0	0	0	0	0	0
all	2	0	1	0	1	0	0	0	0	0
that	0	1	0	1	0	0	0	1	1	0
glitters	0	0	1	0	1	0	0	0	0	0
is	0	1	0	1	0	1	0	1	0	0
not	0	0	0	0	1	0	1	0	0	0
gold	0	0	0	0	0	1	0	0	0	1
well	0	0	1	0	1	0	0	0	1	1
ends	0	0	1	0	0	0	0	1	0	0
`<END>`	0	0	0	0	0	0	1	1	0	0

Note: 我们把<START> 与 <END>作为句子或段落等开始与结束的标志，同时也算作token。如"all that glitters is not gold" 我们将改写为 “<START> All that glitters is not gold <END>”。
共现矩阵的行列数即为语料库的词表数| $\mathcal{V}$ |， $\mathcal{V}$ 为词表，这个值往往很大，所以我们需要进行降维处理(dimensionality reduction)，本文中我们采用sklearn中Truncated SVD方法进行处理，或者是使用主成分分析PCA。
本文中我们采用的数据集是路透社有关“gold”的语料库，下面读取数据集：

def read_corpus(category="gold") -> List[List[str]]:
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]  # 添加start与end token

我们尝试打印一条样本：

reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[0], compact=True, width=100)
# result
"""['<START>', 'western', 'mining', 'to', 'open', 'new', 'gold', 'mine', 'in', 'australia', 'western',
  'mining', 'corp', 'holdings', 'ltd', '&', 'lt', ';', 'wmng', '.', 's', '>', '(', 'wmc', ')',
  'said', 'it', 'will', 'establish', 'a', 'new', 'joint', 'venture', 'gold', 'mine', 'in', 'the',
  'northern', 'territory', 'at', 'a', 'cost', 'of', 'about', '21', 'mln', 'dlrs', '.', 'the',
  'mine', ',', 'to', 'be', 'known', 'as', 'the', 'goodall', 'project', ',', 'will', 'be', 'owned',
  '60', 'pct', 'by', 'wmc', 'and', '40', 'pct', 'by', 'a', 'local', 'w', '.', 'r', '.', 'grace',
  'and', 'co', '&', 'lt', ';', 'gra', '>', 'unit', '.', 'it', 'is', 'located', '30', 'kms', 'east',
  'of', 'the', 'adelaide', 'river', 'at', 'mt', '.', 'bundey', ',', 'wmc', 'said', 'in', 'a',
  'statement', 'it', 'said', 'the', 'open', '-', 'pit', 'mine', ',', 'with', 'a', 'conventional',
  'leach', 'treatment', 'plant', ',', 'is', 'expected', 'to', 'produce', 'about', '50', ',', '000',
  'ounces', 'of', 'gold', 'in', 'its', 'first', 'year', 'of', 'production', 'from', 'mid', '-',
  '1988', '.', 'annual', 'ore', 'capacity', 'will', 'be', 'about', '750', ',', '000', 'tonnes', '.',
  '<END>']"""

Question 1.1: Implement `distinct_words`

本问题我们需要生成该语料库的词表 $\mathcal{V}$ ，像英语字典一样按首字母排序。上述reuters_corpus是一个列表的列表，我们当然可以使用for loop的方式解决，但是这里我们使用list comprehension来处理，并用python set剔除相同单词。

def distinct_words(corpus: List[List[str]]) -> Tuple[List[str], int]:
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    ### SOLUTION BEGIN
    corpus_words = [word for text in corpus for word in text]
    corpus_words = sorted(list(set(corpus_words)))
    n_corpus_words = len(corpus_words)
    ### SOLUTION END
    return corpus_words, n_corpus_words

Question 1.2: Implement `compute_co_occurrence_matrix`

本问题要求我们构建共现矩阵，浅窗口为n（默认为4），这里使用np数组处理。

def compute_co_occurrence_matrix(corpus: List[List[str]], window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurrence matrix of word counts.
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, n_words = distinct_words(corpus)

    ### SOLUTION BEGIN
    word2ind = dict(zip(words, range(n_words)))
    matrix = [[0 for i in range(n_words)] for j in range(n_words)]
    for text in corpus:
        for i in range(len(text)):  # 特别注意中心词的左右窗口可能会超出列表索引
            if i-window_size >= 0:
                for word in text[(i-window_size):i]:
                    matrix[word2ind[text[i]]][word2ind[word]] += 1
            elif i>0:
                for word in text[:i]:
                    matrix[word2ind[text[i]]][word2ind[word]] += 1
            if i+window_size <= len(text)-1:
                for word in text[(i+1):(i+window_size+1)]:
                    matrix[word2ind[text[i]]][word2ind[word]] += 1
            elif i+1 <= len(text)-1:
                for word in text[(i+1):]:
                    matrix[word2ind[text[i]]][word2ind[word]] += 1
    M = np.array(matrix)
    ### SOLUTION END
    return M, word2ind

Question 1.3: Implement `reduce_to_k_dim`

下面我们将使用Truncated SVD对共现矩阵进行降维处理,输出矩阵形状为( $\mathcal{V}$ , embedding dimension)。

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurrence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurrence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensional word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
   
    ### SOLUTION BEGIN
    svd = TruncatedSVD(n_components=k)
    M_reduced = svd.fit_transform(M)
    ### SOLUTION END
    print("Done.")
    return M_reduced

Question 1.4: Implement `plot_embeddings`

本问题中我们把embedding dimension设为2，这样可以在平面直角坐标系中打印出各个词的位置。

def plot_embeddings(M_reduced, word2ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensional word embeddings
            word2ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    ### SOLUTION BEGIN
    x_coords = M_reduced[:, 0]
    y_coords = M_reduced[:, 1]
    for word in words:
        x = x_coords[word2ind[word]]
        y = y_coords[word2ind[word]]
        plt.scatter(x, y, marker='x', color='red')
        plt.text(x, y, word, fontsize=9)
    plt.show()
    ### SOLUTION END

Question 1.5: Co-Occurrence Plot Analysis

下面我们将整合上述各个函数，来打印出语料库中某些词的位置。这里我们将每个二维词向量单位化，这样这些点最终应该出现在单位圆上，词的相似性变为了径向的相似。

reuters_corpus = read_corpus()
M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['value', 'gold', 'platinum', 'reserves', 'silver', 'metals', 'copper', 'belgium', 'australia', 'china', 'grammes', "mine"]

plot_embeddings(M_normalized, word2ind_co_occurrence, words)

下面给出输出结果： Question1.5 result
我们发现基于频次的词向量方法已经有初步的聚类效果了，上图中copper、paltinum等金属相距很近，Australia、Belgium等国家相距很近；但不足的是，silver和其他金属相距很远，China和其他国家也相距很远。

Part 2: Prediction-Based Word Vectors

基于预测的词向量模型有word2vec与GloVe等，本部分不是复现过程，主要是gensim库的调用与体验。
首先我们加载GloVe词向量，词表400000，embedding dimension为200。

def load_embedding_model():
    """ Load GloVe Vectors
        Return:
            wv_from_bin: All 400000 embeddings, each lengh 200
    """
    import gensim.downloader as api
    wv_from_bin = api.load("glove-wiki-gigaword-200")
    print("Loaded vocab size %i" % len(list(wv_from_bin.index_to_key)))
    return wv_from_bin
    
wv_from_bin = load_embedding_model()

Reducing dimensionality of Word Embeddings

这里我们从GloVe词表中随机选出10000词，并减少至2维，主要是为了可以在平面直角坐标系中图形化表示。

def get_matrix_of_vectors(wv_from_bin, required_words):
    """ Put the GloVe vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 400000 GloVe vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 200) containing the vectors
            word2ind: dictionary mapping each word to its row number in M
    """
    import random
    words = list(wv_from_bin.index_to_key)  # 400,000
    print("Shuffling words ...")
    random.seed(225)
    random.shuffle(words)
    words = words[:10000]
    print("Putting %i words into word2ind and matrix M..." % len(words))
    word2ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.get_vector(w))  # 获取词向量
            word2ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        if w in words:
            continue
        try:
            M.append(wv_from_bin.get_vector(w))
            word2ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2ind

M, word2ind = get_matrix_of_vectors(wv_from_bin, words)
M_reduced = reduce_to_k_dim(M, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced, axis=1)
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting

Cosine Similarity

在n维欧氏空间中，我们可以使用余弦相似度来量化两个词汇的相似程度，两向量 $p$ 与 $q$ 的余弦相似度 $s$ 为

$\frac{p \cdot q}{||p|| ||q||}, \textrm{ where } s \in [-1, 1]$

Question 2.1: GloVe Plot Analysis

我们同样在坐标系中表示出这12个词，可以看出与PART1中结果有所不同。

words = ['value', 'gold', 'platinum', 'reserves', 'silver', 'metals', 'copper', 'belgium', 'australia', 'china', 'grammes', "mine"]
plot_embeddings(M_reduced_normalized, word2ind, words)

Question 2.2: Words with Multiple Meanings

无论在英文还是中文中，有许多词汇有多种含义，称之为多义词（polysemes）。我们调用most_similar函数可以查看与某个单词相似度最高的十个词（topn默认为10）。

wv_from_bin.most_similar('right')
# result
"""
[('left', 0.716508150100708), ('if', 0.6925000548362732), ("n't", 0.6774845719337463), ('back', 0.6770386099815369), ('just', 0.6740819811820984), ('but', 0.667771577835083), ('out', 0.6671877503395081), ('put', 0.665894091129303), ('hand', 0.6634083390235901), ('want', 0.6615420579910278)] 
"""

事实上，我们这里所想的right可能是”正确的“而不是”右边“，但是返回相似度最高的却是left，可见在处理多义词上存在一定的问题，对于一个多义词，可能需要多种向量表示，而不是一个词向量。

Question 2.3: Synonyms & Antonyms

调用distance函数，即余弦距离(cosine distance=1-cosine similarity)，这样同义词的余弦距离会大于反义词的余弦距离。

w1 = 'happy'
w2 = 'cheerful'
w3 = 'sad'
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))
# result
"""
Synonyms happy, cheerful have cosine distance: 0.5172466933727264
Antonyms happy, sad have cosine distance: 0.4040136933326721
"""

Question 2.4: Analogies with Word Vectors

这里我们进行一个类比的任务，例如"man : grandfather :: woman : x" (man is to grandfather as woman is to x)中x是什么，仍然调用most_similar函数，可发现相似度最大的词汇为grandmother。
用一个数学等式描述： $man - g r an df a t h er = w o man - g r an d m o t h er$ 。

wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man'])[0]
# result
"""
('grandmother', 0.7608445286750793)
"""

剩下部分问题与上述类似，大家可以自行尝试与解决。
在今后的规划中，我计划会出一期word2vec与glove词向量，或者是CS224N后续作业的解析博客，欢迎大家的指正。

樱吹雪_

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
CS224N Assignment 1: Exploring Word Vectors

本文中是笔者针对CS224N assignment1给出的个人解答。本次作业主要是初步探究词向量，词向量通常被用作下游 NLP 任务（如问题解答、文本生成、机器翻译等）的基本组成部分，本次作业中我们将探索两种类型的词向量：从共现矩阵中得出的词向量和通过 GloVe 得出的词向量。
复制链接

扫一扫