2022-CS224n-Assignment1-exploring_word_vectors

CS224N Assignment 1: Exploring Word Vectors

Part 1: Count-Based Word Vectors

Question 1.1: Implement distinct_words

思路:

corpus_words: 语料库中不同单词经过排序后的列表
1.根据提示使用列表推导将语料库中所有单词放入;corpus_words列表, 也可用双重for循环, 但速度会不如列表推导;
2.使用函数set()的特性, 进行一个去重操作, 注意前加list(), 转换为列表;
3.函数sorted()进行排序;

n_corpus_words: 语料库中不同单词的个数
1.求corpus_words的长度,使用函数len();

code
def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    n_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    corpus_words = [word for sen in corpus for word in sen]
    corpus_words = sorted(list(set(corpus_words)))
    n_corpus_words = len(corpus_words)
    # ------------------

    return corpus_words, n_corpus_words

Question 1.2: Implement compute_co_occurrence_matrix

思路:

wor2ind: 存储词语对应矩阵M中索引的字典
1.使用字典推导,遍历一遍words即可得到;

M: 共现矩阵
1.使用函数numpy.zeros(),创建一个空矩阵,形状依据n_words;
2.双重for循环,进入到语料库中的任意一条语句;
3.根据窗口的大小,再设置一次for,注意range()范围为左开右闭;
4.设置两个if,判断是否出边界;
5.求出两个单词在矩阵M中对应的索引号, 矩阵M的对应单词的出现次数+1;

code
def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, n_words = distinct_words(corpus)
    M = None
    word2ind = {}
    
    # ------------------
    # Write your implementation here.
    M = np.zeros((n_words, n_words))
    word2ind = {words[i]: i for i in range(len(words))}
    for sen in corpus:
        for word in range(len(sen)):
            for i in range(1, window_size+1):
                index_i = word2ind[sen[word]]
                if word - i >= 0:
                    index_j = word2ind[sen[word-i]]
                    M[index_i][index_j] += 1
                    
                if word + i <= len(sen)-1:
                    index_j = word2ind[sen[word+i]]
                    M[index_i][index_j] += 1
    # ------------------

    return M, word2ind

Question 1.3: Implement reduce_to_k_dim

思路:

M_reduced: 降维后的k维词嵌入矩阵
1.根据提示使用sklearn.decomposition.TruncatedSVD库函数计算,使用属性components_得到最后结果,注意转置;

code
def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
    # ------------------
    # Write your implementation here.
    from sklearn.decomposition import TruncatedSVD

    svd = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced = svd.fit(M).components_.T
    # ------------------

    print("Done.")
    return M_reduced

Question 1.4: Implement plot_embeddings

思路:

根据例子,改编代码,实现画图
1.观察测试样例和结果,画图的主要工作就是确定测试点所在的x, y轴坐标
2.每个测试点索引也对应着M_reduced矩阵的索引,那么该位置[0]即为x轴点,[1]即为y轴点
3.根据所给的代码进行修改

def plot_embeddings(M_reduced, word2ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
            word2ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    for i,type in enumerate(words):
        x = M_reduced[i][0]
        y = M_reduced[i][1]
        plt.scatter(x, y, marker='x', color='red')
        plt.text(x, y, type, fontsize=9)
    plt.show()
    # ------------------

在这里插入图片描述

Question 1.5: Co-Occurrence Plot Analysis [written]

在这里插入图片描述

grain和corn在二维空间中聚集在一起

grain和grains没有聚集在一起

Part 2: Prediction-Based Word Vectors

Question 2.1: GloVe Plot Analysis

在这里插入图片描述

written

不同点:
以(0,0)为圆心的单位圆来看,该图的词语分布更为均匀,且能明显看出圆的轮廓,一些近义词更加聚集;而在part1使用共现矩阵生成的图,所有单词都集中于单位圆的右侧,近义词不聚集;

Question 2.2: Words with Multiple Meanings

code
# ------------------
# Write your implementation here.
word = "code"
wv_from_bin.most_similar(word)
# ------------------
written

原因:可能与语料库有关,多义词的熟义出现频率比生义要高

Question 2.3: Synonyms & Antonyms

code
# ------------------
# Write your implementation here.
w1 = "advantage"
w2 = "virtue"
w3 = "disadvantage"

dis_1 = wv_from_bin.distance(w1, w2)
dis_2 = wv_from_bin.distance(w1, w3)

print("Cosine Distance ({},{}): {}".format(w1, w2, dis_1))
print("Cosine Distance ({},{}): {}".format(w1, w3, dis_2))
# ------------------
written

原因:可能是w1和w3单词上下文语境更为相符

Question 2.4: Analogies with Word Vectors

written

g g g + w w w - m m m

Question 2.5: Finding Analogies

code
# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))
# ------------------

Question 2.6: Incorrect Analogy

code
# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['leaf', 'flower'], negative=['tree']))
# ------------------

Question 2.7: Guided Analysis of Bias in Word Vectors

written

女孩的玩具中多出现娃娃
男孩的玩具中多出现机器人、制造

Question 2.8: Independent Analysis of Bias in Word Vectors

code
# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'engineer'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'engineer'], negative=['woman']))
# ------------------
[('technician', 0.5853330492973328),
 ('engineers', 0.5717717409133911),
 ('educator', 0.5450620055198669),
 ('engineering', 0.48699596524238586),
 ('contractor', 0.4856792092323303),
 ('nurse', 0.48517873883247375),
 ('schoolteacher', 0.4825061857700348),
 ('teacher', 0.47406384348869324),
 ('mechanic', 0.4704253673553467),
 ('married', 0.4676802158355713)]

[('engineers', 0.5697532892227173),
 ('engineering', 0.5532492995262146),
 ('mechanic', 0.537360429763794),
 ('technician', 0.47810807824134827),
 ('officer', 0.4660565257072449),
 ('inventor', 0.46498754620552063),
 ('scientist', 0.46378421783447266),
 ('worked', 0.46068844199180603),
 ('colonel', 0.45147472620010376),
 ('commander', 0.4491448998451233)]
written

选取了woman、man、engineer作为样例,在结果可以看到woman多集中于教育、护士和已婚方面, 而man多集中于各种技术类方面的职业

Question 2.9: Thinking About Bias

written

1.语料集的范围不够大,且重复的内容过多,但偏见主要来源于民众的认知

2.统计偏见词语在每篇文章的出现频率,降序排列

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值