cs224n 2024春 课后作业a1

Part 1: Count-Based Word Vectors (10 points)

Question 1.1: Implement distinct_words [code] (2 points)

    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    n_corpus_words = -1
    
    # ------------------
     

    # ------------------

    return corpus_words, n_corpus_words

答案一:

corpus_words = sorted(set(sum(corpus, [])))
    n_corpus_words = len(corpus_words)   

答案二:

# 使用列表推导式和set去重,然后扁平化列表
    corpus_words = sorted(set(word for document in corpus for word in document))
    # 计算去重后的单词总数
    n_corpus_words = len(corpus_words)
    

第二个通过一个列表推导式直接完成了扁平化和去重的过程,在性能上略有优势,避免了使用 sum() 函数合并列表,这个操作在大规模数据上比较耗时。列表推导式直接在遍历过程中进行去重,减少了中间步骤。

Question 1.2: Implement compute_co_occurrence_matrix [code] (3 points)

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, n_words = distinct_words(corpus)
    M = None
    word2ind = {}
    
    # ------------------
    # Write your implementation here.
    
    # Word to index mapping and a marix with 0s
    

    # ------------------

    return M, word2ind

答案一:

  word2ind = dict(zip(words, range(n_words)))
    M = np.zeros((n_words, n_words))

    for document in corpus:
        for i, center in enumerate(document):
            for j, context in enumerate(document):
                if j != i and abs(j - i) <= window_size:
                    # Increment the co-occurance of word pair
                    M[word2ind[center]][word2ind[context]] += 1

答案二:

# Compute distinct words and initialize the matrix and dictionary
   
    M = np.zeros((n_words, n_words))
    word2ind = {word: i for i, word in enumerate(words)}

    # Iterate through each document and each word in the document
    for doc in corpus:
        for i, word in enumerate(doc):
            center_word_idx = word2ind[word]
            # Define the window range considering the start and end of the document
            start = max(i - window_size, 0)
            end = min(i + window_size + 1, len(doc))
            # Iterate through the window and update the co-occurrence matrix
            for j in range(start, end):
                if j != i:  # Avoid self-co-occurrence
                    co_word_idx = word2ind[doc[j]]
                    M[center_word_idx, co_word_idx] += 1

第一个版本中窗口的处理是通过两次遍历实现的,第一次确定中心词的索引,第二次遍历窗口中的其他词。第二个版本中窗口的处理更加直接,通过计算窗口的起始和结束位置一次完成,这可能在理解上更直观,尤其是在处理文档边界时。第二个版本使用了字典推导来创建 word2ind,并且窗口的处理逻辑更紧凑。

Question 1.3: Implement reduce_to_k_dim [code] (1 point)

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
    # ------------------
    # Write your implementation here.


    # ------------------

    print("Done.")
    return M_reduced

答案一:

 M_reduced = TruncatedSVD(n_components=k, n_iter=n_iters).fit_transform(M)

答案二:

 svd = TruncatedSVD(n_components=k, n_iter=n_iters, random_state=42)
    
    # Fit and transform the matrix M to reduce its dimensionality
    M_reduced = svd.fit_transform(M)

Question 1.4: Implement plot_embeddings [code] (1 point)

def plot_embeddings(M_reduced, word2ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
            word2ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.

   

    # ------------------

答案一:

 # Use a prettier seaborn grid style
    plt.style.use('seaborn-whitegrid')
    
    for i, word in enumerate(words):
        # Get the coordinates of the embedding
        [x, y] = M_reduced[word2ind[word], :]

        # Mark the coordinate and assign text to it
        plt.scatter(x, y, marker='x', color="red")
        plt.annotate(word, (x, y), xytext=(x, y+.05))
    
    # Show plot
    plt.show()

答案二:

 # Extract the x and y coordinates of word embeddings for the specified words
    x_coords = [M_reduced[word2ind[word]][0] for word in words]
    y_coords = [M_reduced[word2ind[word]][1] for word in words]
    
    # Create a scatter plot
    plt.figure(figsize=(8, 6))
    plt.scatter(x_coords, y_coords)

    # Annotate each point in the scatter plot with its corresponding word
    for word, x, y in zip(words, x_coords, y_coords):
        plt.annotate(word, (x, y), textcoords="offset points", xytext=(0,10), ha='center')
    
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.title('Word Embeddings')
    plt.grid(True)
    plt.show()

Question 1.5: Co-Occurrence Plot Analysis [written] (3 points) 

a. 嵌入空间中的聚类
聚类 1:金属和采矿
  • 词汇:"gold"(黄金), "mine"(矿山), "metals"(金属), "copper"(铜), "platinum"(铂金), "silver"(银)
  • 解释:这个聚类很可能代表与金属和采矿活动相关的词汇。"gold", "copper", "platinum", 和 "silver" 都是常见的金属,通常需要采矿提取,这解释了它们在词嵌入空间中的紧密关联。"mine" 直接关联到这些金属被提取的地点或过程,而 "metals" 是一个泛称,涵盖了所有这些具体的实例。
聚类 2:地理和政治实体
  • 词汇:"Australia"(澳大利亚), "Belgium"(比利时), "China"(中国)
  • 解释:这个聚类包括与地理位置和政治实体相关的词汇。"Australia", "Belgium", 和 "China" 都是国家名称,它们在嵌入空间中的聚集表明模型可能捕捉到了这些词汇在使用中的某些共同点,如在国际新闻、经济或文化上下文中的频繁共现。

b. 没有聚集在一起但可能应该的词汇
可能的例子 1: "reserves"
  • 解释:尽管 "reserves"(储备)这个词可能与金属或资源有关(如 "gold reserves" 金矿储备),在图中它并没有与 "gold" 或其他金属词汇聚集在一起。这可能表明在模型的训练数据中,"reserves" 被更频繁地用在其他上下文中(如自然保护区或其他类型的储备),导致其与金属相关的语义联系没有在词嵌入中显现出来。
可能的例子 2: "grammes"
  • 解释:从直觉上说,"grammes"(克)这个度量单位词汇可能与讨论具体金属(如黄金)的重量相关,因此有可能与 "gold" 或 "silver" 等词汇聚在一起。然而,在嵌入图中,"grammes" 没有与这些金属词汇形成明显的聚类,这可能反映了模型的训练数据中这类量度用词与金属不够密切相关,或者它们在更广泛的上下文中使用。

Part 2: Prediction-Based Word Vectors (15 points) 

 Question 2.1: GloVe Plot Analysis [written] (3 points)

Run the cell below to plot the 2D GloVe embeddings for ['value', 'gold', 'platinum', 'reserves', 'silver', 'metals', 'copper', 'belgium', 'australia', 'china', 'grammes', "mine"].

坐标数值变化的原因:

a. What is one way the plot is different from the one generated earlier from the co-occurrence matrix? What is one way it's similar?

与之前通过共现矩阵生成的图表相比,这张图表显示的单词之间的空间分布更紧密。这种差异源于使用GloVe模型,GloVe模型通常会在处理共现信息时考虑更多的统计信息,如单词共现的概率比而非仅仅是次数,从而导致词嵌入的空间结构不同。 

图中仍然可以观察到某些相关单词倾向于聚集在一起。例如,与金属和矿产相关的词汇如 "gold"(黄金)、"copper"(铜)和 "platinum"(铂金)在两种图表中都相对较近。这表明无论采用哪种技术,语义相关的单词在嵌入空间中的相对位置倾向于保持一致。

b. Why might the GloVe plot (question_2.1.png) differ from the plot generated earlier from the co-occurrence matrix (question_1.5.png)?

GloVe图表与共现矩阵图表的差异原因

差异原因:

  • 不同的信息处理:GloVe模型在生成词嵌入时,不仅考虑了单词的共现频率,还利用了这些频率的比例信息,以及整个语料库的统计信息。GloVe的优势在于它试图捕捉到单词间的共现概率,并通过这种方法优化词向量,使其能更好地捕捉和表达单词间的复杂关系,如类比关系。
  • 数学基础:共现矩阵通常只是简单记录单词在一定窗口内出现的次数,而GloVe还考虑了这些次数在不同上下文中的权重和重要性。这种差异使得由GloVe生成的嵌入可能在保持单词语义上更为精确和敏感。
  • 降维和优化方法:即使是同一共现矩阵,不同的降维方法(如PCA与t-SNE)也会导致最终可视化的差异。GloVe通过迭代优化过程来调整词向量,而传统的共现矩阵降维可能直接应用SVD或其他线性降维方法,未能考虑非线性关系。

总之,GloVe通过整合更多统计信息和优化词向量表示的方法,能够生成能反映更复杂语义关系的词嵌入。而直接从共现矩阵生成的嵌入可能更侧重于显示词汇的共现频率,而不是它们的共现质量或概率。因此,两种方法生成的图表在单词间的距离和分布上可能会有明显差异。

Question 2.2: Words with Multiple Meanings (1.5 points) [code + written] 

Note: You should use the wv_from_bin.most_similar(word) function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance, please check the GenSim documentation.

Please state the word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous or homonymic words you tried didn't work (i.e. the top-10 most similar words only contain one of the meanings of the words)?

 

 

Question 2.3: Synonyms & Antonyms (2 points) [code + written]

# ------------------
# Write your implementation here.
# Define the words
word = 'light'
synonym = 'illuminated'
antonym = 'dark'

# Calculate cosine distances
distance_to_synonym = wv_from_bin.distance(word, synonym)
distance_to_antonym = wv_from_bin.distance(word, antonym)
print(f"Cosine distance from '{word}' to '{synonym}': {distance_to_synonym}")
print(f"Cosine distance from '{word}' to '{antonym}': {distance_to_antonym}")

if distance_to_antonym < distance_to_synonym:
    print(f"Interestingly, '{word}' is closer to its antonym '{antonym}' than to its synonym '{synonym}'.")
else:
    print(f" '{word}' is closer to its synonym '{synonym}' than to its antonym '{antonym}'.")

# ------------------

 

“light” 确实比 “illuminated” 更接近 “dark”,

  • 上下文使用:词“light”可能在讨论其不存在或相反概念(黑暗)的上下文中更为频繁地使用,特别是在文学或表达方式中(例如,“光与暗”),这可能导致模型捕捉到“light”与“dark”之间更近的语义关系。
  • 模型训练:训练语料和文本的性质极大地影响了词的关联。如果训练数据包含许多关于光与暗概念的讨论(如在哲学或科学文本中),嵌入可能反映这种关系。
  • 多义性:词“light”有多重含义(例如,不重的或有大量照明的),根据训练数据中哪种含义更为主导,嵌入可能偏向于某一意义而不是另一意义。

Question 2.4: Analogies with Word Vectors [written] (1.5 points)

Word vectors have been shown to sometimes exhibit the ability to solve analogies.

As an example, for the analogy "man : grandfather :: woman : x" (read: man is to grandfather as woman is to x), what is x?

In the cell below, we show you how to use word vectors to find x using the most_similar function from the GenSim documentation. The function finds words that are most similar to the words in the positive list and most dissimilar from the words in the negative list (while omitting the input words, which are often the most similar; see this paper). The answer to the analogy will have the highest cosine similarity (largest returned numerical value).

Question 2.5: Finding Analogies [code + written] (1.5 points 

a. For the previous example, it's clear that "grandmother" completes the analogy. But give an intuitive explanation as to why the most_similar function gives us words like "granddaughter", "daughter", or "mother?

在多维的词向量空间中,与“woman”和“grandfather”相关的向量可能在某些维度上(如表示家庭关系的维度)表现出相似性,这些维度上的接近可能导致“mother”、“daughter”等词在结果中排名较高。词向量模型在处理复杂的语义关系时,既反映了词汇的直接语义联系,也展示了基于大规模文本数据的统计共现特性。这种现象提示我们,在使用词向量模型进行语义分析时,需要考虑到模型可能将语义上相关但非直接对应的词汇归类为相似,这既是词向量模型的一个特点,也是使用时需要注意的局限性。

b. Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.

Note: You may have to try many analogies to find one that works!

结果:

在词向量训练过程中,"dark"和"black"以及"light"和"white"之间的关系可能因为在文本中频繁共现而被模型学习到,从而在向量空间中彼此接近。这个类比成功地展示了词向量模型如何能够捕捉和再现基于常见感知和语言习惯的语义关系。

Question 2.6: Incorrect Analogy [code + written] (1.5 points)

  • 在模型的训练数据中,"foot"(脚)和 "square"(平方)可能常见于描述面积或房地产相关的上下文中(如“square feet”表示面积单位),这可能导致模型在解析"foot"时,其语义向与面积相关的方向偏移。
  • 因为模型在处理"foot"时可能更强烈地将其与面积相关的语境联系起来,而不是与穿戴(如袜子)相关的语境。
  • 模型的训练数据中“foot”与“square”共现频率远高于与穿戴用品(如“sock”)的共现,使得模型在没有明确上下文的情况下,默认将“foot”与面积相关词汇联系得更紧密。

 

Question 2.7: Guided Analysis of Bias in Word Vectors [written] (1 point) 

Question 2.8: Independent Analysis of Bias in Word Vectors [code + written] (1 point)

Question 2.9: Thinking About Bias [written] (2 points) 

a.

偏见通常通过训练数据进入词向量。如果训练数据中某些词频繁地与特定的属性或类别同时出现,这些词的向量会倾向于彼此接近。例如,如果在新闻报道或文学作品中,"nurse"(护士)一词经常与"she"(她)而不是"he"(他)共现,那么生成的词向量可能会将"nurse"更靠近"woman"(女性)而不是"man"(男性),从而反映出职业与性别的偏见。

b.

一种减轻词向量偏见的方法是应用去偏技术,如Hard Debiasing。这种方法包括在保持词向量有用性的同时,调整向量空间以减少不公平的性别、种族或其他偏见。例如,可以通过识别并调整使"man"和"woman"之间以及与它们相关的词汇(如"doctor"、"nurse")之间的距离相等,来实现性别偏见的减轻。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值