2022-CS224n-Assignment1-exploring_word_vectors

WUNNAN

已于 2022-04-17 17:59:05 修改

阅读量781

点赞数 2

分类专栏： NLP 文章标签： python 自然语言处理

于 2022-04-16 19:25:58 首次发布

本文链接：https://blog.csdn.net/weixin_51154479/article/details/124213238

版权

词嵌入共现矩阵 TruncatedSVD 多义词性别偏见

关键词由CSDN通过智能技术生成

NLP 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

CS224N Assignment 1: Exploring Word Vectors

Part 1: Count-Based Word Vectors

Question 1.1: Implement `distinct_words`

思路：

corpus_words: 语料库中不同单词经过排序后的列表
1.根据提示使用列表推导将语料库中所有单词放入;corpus_words列表, 也可用双重for循环, 但速度会不如列表推导;
2.使用函数set()的特性, 进行一个去重操作, 注意前加list()，转换为列表;
3.函数sorted()进行排序;

n_corpus_words: 语料库中不同单词的个数
1.求corpus_words的长度，使用函数len();

code

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    n_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    corpus_words = [word for sen in corpus for word in sen]
    corpus_words = sorted(list(set(corpus_words)))
    n_corpus_words = len(corpus_words)
    # ------------------

    return corpus_words, n_corpus_words

Question 1.2: Implement `compute_co_occurrence_matrix`

思路:

wor2ind: 存储词语对应矩阵M中索引的字典
1.使用字典推导，遍历一遍words即可得到;

M: 共现矩阵
1.使用函数numpy.zeros()，创建一个空矩阵，形状依据n_words;
2.双重for循环，进入到语料库中的任意一条语句;
3.根据窗口的大小，再设置一次for，注意range()范围为左开右闭;
4.设置两个if，判断是否出边界;
5.求出两个单词在矩阵M中对应的索引号, 矩阵M的对应单词的出现次数+1;

code

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, n_words = distinct_words(corpus)
    M = None
    word2ind = {}
    
    # ------------------
    # Write your implementation here.
    M = np.zeros((n_words, n_words))
    word2ind = {words[i]: i for i in range(len(words))}
    for sen in corpus:
        for word in range(len(sen)):
            for i in range(1, window_size+1):
                index_i = word2ind[sen[word]]
                if word - i >= 0:
                    index_j = word2ind[sen[word-i]]
                    M[index_i][index_j] += 1
                    
                if word + i <= len(sen)-1:
                    index_j = word2ind[sen[word+i]]
                    M[index_i][index_j] += 1
    # ------------------

    return M, word2ind

Question 1.3: Implement `reduce_to_k_dim`

思路：

M_reduced: 降维后的k维词嵌入矩阵
1.根据提示使用sklearn.decomposition.TruncatedSVD库函数计算，使用属性components_得到最后结果，注意转置;

code

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
    # ------------------
    # Write your implementation here.
    from sklearn.decomposition import TruncatedSVD

    svd = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced = svd.fit(M).components_.T
    # ------------------

    print("Done.")
    return M_reduced

Question 1.4: Implement `plot_embeddings`

思路：

根据例子，改编代码，实现画图
1.观察测试样例和结果，画图的主要工作就是确定测试点所在的x, y轴坐标
2.每个测试点索引也对应着M_reduced矩阵的索引，那么该位置[0]即为x轴点，[1]即为y轴点
3.根据所给的代码进行修改

def plot_embeddings(M_reduced, word2ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
            word2ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    for i,type in enumerate(words):
        x = M_reduced[i][0]
        y = M_reduced[i][1]
        plt.scatter(x, y, marker='x', color='red')
        plt.text(x, y, type, fontsize=9)
    plt.show()
    # ------------------

在这里插入图片描述

Question 1.5: Co-Occurrence Plot Analysis [written]

在这里插入图片描述

grain和corn在二维空间中聚集在一起

grain和grains没有聚集在一起

Part 2: Prediction-Based Word Vectors

Question 2.1: GloVe Plot Analysis

在这里插入图片描述

written

不同点：
以(0,0)为圆心的单位圆来看，该图的词语分布更为均匀，且能明显看出圆的轮廓，一些近义词更加聚集；而在part1使用共现矩阵生成的图，所有单词都集中于单位圆的右侧，近义词不聚集；

Question 2.2: Words with Multiple Meanings

code

# ------------------
# Write your implementation here.
word = "code"
wv_from_bin.most_similar(word)
# ------------------

written

原因：可能与语料库有关，多义词的熟义出现频率比生义要高

Question 2.3: Synonyms & Antonyms

code

# ------------------
# Write your implementation here.
w1 = "advantage"
w2 = "virtue"
w3 = "disadvantage"

dis_1 = wv_from_bin.distance(w1, w2)
dis_2 = wv_from_bin.distance(w1, w3)

print("Cosine Distance ({},{}): {}".format(w1, w2, dis_1))
print("Cosine Distance ({},{}): {}".format(w1, w3, dis_2))
# ------------------

written

原因：可能是w1和w3单词上下文语境更为相符

Question 2.4: Analogies with Word Vectors

written

$g$ + $w$ - $m$

Question 2.5: Finding Analogies

code

# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))
# ------------------

Question 2.6: Incorrect Analogy

code

# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['leaf', 'flower'], negative=['tree']))
# ------------------

Question 2.7: Guided Analysis of Bias in Word Vectors

written

女孩的玩具中多出现娃娃
男孩的玩具中多出现机器人、制造

Question 2.8: Independent Analysis of Bias in Word Vectors

code

# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'engineer'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'engineer'], negative=['woman']))
# ------------------

[('technician', 0.5853330492973328),
 ('engineers', 0.5717717409133911),
 ('educator', 0.5450620055198669),
 ('engineering', 0.48699596524238586),
 ('contractor', 0.4856792092323303),
 ('nurse', 0.48517873883247375),
 ('schoolteacher', 0.4825061857700348),
 ('teacher', 0.47406384348869324),
 ('mechanic', 0.4704253673553467),
 ('married', 0.4676802158355713)]

[('engineers', 0.5697532892227173),
 ('engineering', 0.5532492995262146),
 ('mechanic', 0.537360429763794),
 ('technician', 0.47810807824134827),
 ('officer', 0.4660565257072449),
 ('inventor', 0.46498754620552063),
 ('scientist', 0.46378421783447266),
 ('worked', 0.46068844199180603),
 ('colonel', 0.45147472620010376),
 ('commander', 0.4491448998451233)]

written

选取了woman、man、engineer作为样例，在结果可以看到woman多集中于教育、护士和已婚方面, 而man多集中于各种技术类方面的职业

Question 2.9: Thinking About Bias

written

1.语料集的范围不够大，且重复的内容过多，但偏见主要来源于民众的认知

2.统计偏见词语在每篇文章的出现频率，降序排列

WUNNAN

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
2022-CS224n-Assignment1-exploring_word_vectors

CS224N Assignment 1: Exploring Word VectorsPart 1: Count-Based Word VectorsQuestion 1.1: Implement distinct_words思路：corpus_words: 语料库中不同单词经过排序后的列表1.根据提示使用列表推导将语料库中所有单词放入;corpus_words列表, 也可用双重for循环, 但速度会不如列表推导;2.使用函数set()的特性, 进行一个去重操作, 注意前加list()，转换
复制链接

扫一扫