CS224N Assignment 1: Exploring Word Vectors

任务介绍

本文中是笔者在阅读Stanford / Winter 2023 CS224N的过程中,针对assignment1给出的个人解答。各个函数基本都有文档注释,帮助大家理解阅读,解法可能不是最优,如果各位读者有更为简洁精妙的解法,欢迎在评论区中提出。
在此我给出本作业的链接。Exploring Word Vectors
本次作业主要是初步探究词向量(word vectors),词向量通常被用作下游 NLP 任务(如问题解答、文本生成、机器翻译等)的基本组成部分,本次作业中我们将探索两种类型的词向量:从共现矩阵中得出的词向量和通过 GloVe 得出的词向量。
首先自然是先import该作业需要用到的包和库。

   from gensim.models import KeyedVectors
   from gensim.test.utils import datapath
   import pprint
   import matplotlib.pyplot as plt
   plt.rcParams['figure.figsize'] = [10, 5]
   import nltk
   nltk.download('reuters') #to specify download location, optionally add the argument: download_dir='/specify/desired/path/'
   from nltk.corpus import reuters
   import numpy as np
   import random
   import scipy as sp
   from sklearn.decomposition import TruncatedSVD
   from sklearn.decomposition import PCA
   from typing import *
   START_TOKEN = '<START>'
   END_TOKEN = '<END>'
   np.random.seed(0)
   random.seed(0)  # 使得随机数据可预测,当我们设置相同的seed,每次生成的随机数相同

Part 1: Count-Based Word Vectors

Co-Occurrence Word Embeddings

许多词向量的实现都是基于这样一种想法,即词的相似性 (word similarity)。类似的单词通常会被我们一同说出来或写出来。在此,我们将详细介绍其中一种策略——共现矩阵(co-occurrence matrix)。
这里词被我们分为两类,一类是中心词(center word),一类是背景词(context word)。在语料库中,当一个背景词出现在某个中心词的周围的频次越高,我们有理由相信这两个词在语义上有一定的联系。这个“周围”我们把它叫做浅窗口(Fixed Window of n),即中心词 w i w_{i} wi左右各n个词,words w i − n … w i − 1 w_{i-n} \dots w_{i-1} winwi1 w i + 1 … w i + n w_{i+1} \dots w_{i+n} wi+1wi+n。下面我们将构建共现矩阵M,which is a symmetric word-by-word matrix in which M i j M_{ij} Mij is the number of times w j w_j wj appears inside w i w_i wi’s window among all documents.

Example: Co-Occurrence with Fixed Window of n=1:

Document 1: “all that glitters is not gold”

Document 2: “all is well that ends well”

*<START>allthatglittersisnotgoldwellends<END>
<START>0200000000
all2010100000
that0101000110
glitters0010100000
is0101010100
not0000101000
gold0000010001
well0010100011
ends0010000100
<END>0000001100

Note: 我们把<START><END>作为句子或段落等开始与结束的标志,同时也算作token。 如"all that glitters is not gold" 我们将改写为 “<START> All that glitters is not gold <END>”。
共现矩阵的行列数即为语料库的词表数| V \mathcal{V} V|, V \mathcal{V} V为词表,这个值往往很大,所以我们需要进行降维处理(dimensionality reduction),本文中我们采用sklearn中Truncated SVD方法进行处理,或者是使用主成分分析PCA
本文中我们采用的数据集是路透社有关“gold”的语料库,下面读取数据集:

def read_corpus(category="gold") -> List[List[str]]:
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]  # 添加start与end token

我们尝试打印一条样本:

reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[0], compact=True, width=100)
# result
"""['<START>', 'western', 'mining', 'to', 'open', 'new', 'gold', 'mine', 'in', 'australia', 'western',
  'mining', 'corp', 'holdings', 'ltd', '&', 'lt', ';', 'wmng', '.', 's', '>', '(', 'wmc', ')',
  'said', 'it', 'will', 'establish', 'a', 'new', 'joint', 'venture', 'gold', 'mine', 'in', 'the',
  'northern', 'territory', 'at', 'a', 'cost', 'of', 'about', '21', 'mln', 'dlrs', '.', 'the',
  'mine', ',', 'to', 'be', 'known', 'as', 'the', 'goodall', 'project', ',', 'will', 'be', 'owned',
  '60', 'pct', 'by', 'wmc', 'and', '40', 'pct', 'by', 'a', 'local', 'w', '.', 'r', '.', 'grace',
  'and', 'co', '&', 'lt', ';', 'gra', '>', 'unit', '.', 'it', 'is', 'located', '30', 'kms', 'east',
  'of', 'the', 'adelaide', 'river', 'at', 'mt', '.', 'bundey', ',', 'wmc', 'said', 'in', 'a',
  'statement', 'it', 'said', 'the', 'open', '-', 'pit', 'mine', ',', 'with', 'a', 'conventional',
  'leach', 'treatment', 'plant', ',', 'is', 'expected', 'to', 'produce', 'about', '50', ',', '000',
  'ounces', 'of', 'gold', 'in', 'its', 'first', 'year', 'of', 'production', 'from', 'mid', '-',
  '1988', '.', 'annual', 'ore', 'capacity', 'will', 'be', 'about', '750', ',', '000', 'tonnes', '.',
  '<END>']"""

Question 1.1: Implement distinct_words

本问题我们需要生成该语料库的词表 V \mathcal{V} V,像英语字典一样按首字母排序。上述reuters_corpus是一个列表的列表,我们当然可以使用for loop的方式解决,但是这里我们使用list comprehension来处理,并用python set剔除相同单词。

def distinct_words(corpus: List[List[str]]) -> Tuple[List[str], int]:
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    ### SOLUTION BEGIN
    corpus_words = [word for text in corpus for word in text]
    corpus_words = sorted(list(set(corpus_words)))
    n_corpus_words = len(corpus_words)
    ### SOLUTION END
    return corpus_words, n_corpus_words

Question 1.2: Implement compute_co_occurrence_matrix

本问题要求我们构建共现矩阵,浅窗口为n(默认为4),这里使用np数组处理。

def compute_co_occurrence_matrix(corpus: List[List[str]], window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurrence matrix of word counts.
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, n_words = distinct_words(corpus)

    ### SOLUTION BEGIN
    word2ind = dict(zip(words, range(n_words)))
    matrix = [[0 for i in range(n_words)] for j in range(n_words)]
    for text in corpus:
        for i in range(len(text)):  # 特别注意中心词的左右窗口可能会超出列表索引
            if i-window_size >= 0:
                for word in text[(i-window_size):i]:
                    matrix[word2ind[text[i]]][word2ind[word]] += 1
            elif i>0:
                for word in text[:i]:
                    matrix[word2ind[text[i]]][word2ind[word]] += 1
            if i+window_size <= len(text)-1:
                for word in text[(i+1):(i+window_size+1)]:
                    matrix[word2ind[text[i]]][word2ind[word]] += 1
            elif i+1 <= len(text)-1:
                for word in text[(i+1):]:
                    matrix[word2ind[text[i]]][word2ind[word]] += 1
    M = np.array(matrix)
    ### SOLUTION END
    return M, word2ind

Question 1.3: Implement reduce_to_k_dim

下面我们将使用Truncated SVD对共现矩阵进行降维处理,输出矩阵形状为( V \mathcal{V} V, embedding dimension)。

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurrence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurrence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensional word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
   
    ### SOLUTION BEGIN
    svd = TruncatedSVD(n_components=k)
    M_reduced = svd.fit_transform(M)
    ### SOLUTION END
    print("Done.")
    return M_reduced

Question 1.4: Implement plot_embeddings

本问题中我们把embedding dimension设为2,这样可以在平面直角坐标系中打印出各个词的位置。

def plot_embeddings(M_reduced, word2ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensional word embeddings
            word2ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    ### SOLUTION BEGIN
    x_coords = M_reduced[:, 0]
    y_coords = M_reduced[:, 1]
    for word in words:
        x = x_coords[word2ind[word]]
        y = y_coords[word2ind[word]]
        plt.scatter(x, y, marker='x', color='red')
        plt.text(x, y, word, fontsize=9)
    plt.show()
    ### SOLUTION END

Question 1.5: Co-Occurrence Plot Analysis

下面我们将整合上述各个函数,来打印出语料库中某些词的位置。这里我们将每个二维词向量单位化,这样这些点最终应该出现在单位圆上,词的相似性变为了径向的相似。

reuters_corpus = read_corpus()
M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['value', 'gold', 'platinum', 'reserves', 'silver', 'metals', 'copper', 'belgium', 'australia', 'china', 'grammes', "mine"]

plot_embeddings(M_normalized, word2ind_co_occurrence, words)

下面给出输出结果:Question1.5 result
我们发现基于频次的词向量方法已经有初步的聚类效果了,上图中copper、paltinum等金属相距很近,Australia、Belgium等国家相距很近;但不足的是,silver和其他金属相距很远,China和其他国家也相距很远。

Part 2: Prediction-Based Word Vectors

基于预测的词向量模型有word2vecGloVe等,本部分不是复现过程,主要是gensim库的调用与体验。
首先我们加载GloVe词向量,词表400000,embedding dimension为200。

def load_embedding_model():
    """ Load GloVe Vectors
        Return:
            wv_from_bin: All 400000 embeddings, each lengh 200
    """
    import gensim.downloader as api
    wv_from_bin = api.load("glove-wiki-gigaword-200")
    print("Loaded vocab size %i" % len(list(wv_from_bin.index_to_key)))
    return wv_from_bin
    
wv_from_bin = load_embedding_model()

Reducing dimensionality of Word Embeddings

这里我们从GloVe词表中随机选出10000词,并减少至2维,主要是为了可以在平面直角坐标系中图形化表示。

def get_matrix_of_vectors(wv_from_bin, required_words):
    """ Put the GloVe vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 400000 GloVe vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 200) containing the vectors
            word2ind: dictionary mapping each word to its row number in M
    """
    import random
    words = list(wv_from_bin.index_to_key)  # 400,000
    print("Shuffling words ...")
    random.seed(225)
    random.shuffle(words)
    words = words[:10000]
    print("Putting %i words into word2ind and matrix M..." % len(words))
    word2ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.get_vector(w))  # 获取词向量
            word2ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        if w in words:
            continue
        try:
            M.append(wv_from_bin.get_vector(w))
            word2ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2ind

M, word2ind = get_matrix_of_vectors(wv_from_bin, words)
M_reduced = reduce_to_k_dim(M, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced, axis=1)
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting

Cosine Similarity

在n维欧氏空间中,我们可以使用余弦相似度来量化两个词汇的相似程度,两向量 p p p q q q 的余弦相似度 s s s

s = p ⋅ q ∣ ∣ p ∣ ∣ ∣ ∣ q ∣ ∣ ,  where  s ∈ [ − 1 , 1 ] s = \frac{p \cdot q}{||p|| ||q||}, \textrm{ where } s \in [-1, 1] s=∣∣p∣∣∣∣q∣∣pq, where s[1,1]

Question 2.1: GloVe Plot Analysis

我们同样在坐标系中表示出这12个词,可以看出与PART1中结果有所不同。

words = ['value', 'gold', 'platinum', 'reserves', 'silver', 'metals', 'copper', 'belgium', 'australia', 'china', 'grammes', "mine"]
plot_embeddings(M_reduced_normalized, word2ind, words)

Question 2.2: Words with Multiple Meanings

无论在英文还是中文中,有许多词汇有多种含义,称之为多义词(polysemes)。我们调用most_similar函数可以查看与某个单词相似度最高的十个词(topn默认为10)。

wv_from_bin.most_similar('right')
# result
"""
[('left', 0.716508150100708), ('if', 0.6925000548362732), ("n't", 0.6774845719337463), ('back', 0.6770386099815369), ('just', 0.6740819811820984), ('but', 0.667771577835083), ('out', 0.6671877503395081), ('put', 0.665894091129303), ('hand', 0.6634083390235901), ('want', 0.6615420579910278)] 
"""

事实上,我们这里所想的right可能是”正确的“而不是”右边“,但是返回相似度最高的却是left,可见在处理多义词上存在一定的问题,对于一个多义词,可能需要多种向量表示,而不是一个词向量。

Question 2.3: Synonyms & Antonyms

调用distance函数,即余弦距离(cosine distance=1-cosine similarity),这样同义词的余弦距离会大于反义词的余弦距离。

w1 = 'happy'
w2 = 'cheerful'
w3 = 'sad'
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))
# result
"""
Synonyms happy, cheerful have cosine distance: 0.5172466933727264
Antonyms happy, sad have cosine distance: 0.4040136933326721
"""

Question 2.4: Analogies with Word Vectors

这里我们进行一个类比的任务,例如"man : grandfather :: woman : x" (man is to grandfather as woman is to x)中x是什么,仍然调用most_similar函数,可发现相似度最大的词汇为grandmother。
用一个数学等式描述: m a n − g r a n d f a t h e r = w o m a n − g r a n d m o t h e r man-grandfather=woman-grandmother mangrandfather=womangrandmother

wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man'])[0]
# result
"""
('grandmother', 0.7608445286750793)
"""

剩下部分问题与上述类似,大家可以自行尝试与解决。
在今后的规划中,我计划会出一期word2vec与glove词向量,或者是CS224N后续作业的解析博客,欢迎大家的指正。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值