CS224n_2019_Assignment1: Exploring Word Vectors Coding Solution

最新推荐文章于 2021-02-02 11:33:40 发布

席德

最新推荐文章于 2021-02-02 11:33:40 发布

阅读量1.1k

点赞数

分类专栏：機器學習文章标签： nlp cs224n

本文链接：https://blog.csdn.net/qq_44695832/article/details/103264174

版权

機器學習专栏收录该内容

3 篇文章 0 订阅

订阅专栏

A1不難，當熱身作業。

Word Vectors

Part 1: Count-Based Word Vectors (10 points)

Co-Occurrence

Question 1.1: Implement `distinct_words` [code] (2 points)

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    
    corpus_words = [word for corpus_list in corpus for word in corpus_list]  #[if word not in corpus_words] can replace set()
    corpus_words = sorted(list(set(corpus_words)))   
    num_corpus_words = len(corpus_words)
    
    # ------------------

    return corpus_words, num_corpus_words

# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# ---------------------

# Define toy corpus
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
test_corpus_words, num_corpus_words = distinct_words(test_corpus)

# Correct answers
ans_test_corpus_words = sorted(list(set(["START", "All", "ends", "that", "gold", "All's", "glitters", "isn't", "well", "END"])))
ans_num_corpus_words = len(ans_test_corpus_words)

# Test correct number of words
assert(num_corpus_words == ans_num_corpus_words), "Incorrect number of distinct words. Correct: {}. Yours: {}".format(ans_num_corpus_words, num_corpus_words)

# Test correct words
assert (test_corpus_words == ans_test_corpus_words), "Incorrect corpus_words.\nCorrect: {}\nYours:   {}".format(str(ans_test_corpus_words), str(test_corpus_words))

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.2: Implement `compute_co_occurrence_matrix` [code] (3 points)

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    
    M = np.zeros((num_words, num_words))
    word2Ind = {word: index for index, word in enumerate(words)}
    
    for list_of_word in corpus:
        for i,word in enumerate(list_of_word):
            for j in range(max(i-window_size,0), i):
                M[word2Ind[word], word2Ind[list_of_word[j]]] += 1
            for j in range(i+1, min(i+1+window_size, len(list_of_word))):
                M[word2Ind[word], word2Ind[list_of_word[j]]] += 1
                
    # ------------------

    return M, word2Ind

# ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness.
# ---------------------

# Define toy corpus and get student's co-occurrence matrix
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1) 

# Correct M and word2Ind
M_test_ans = np.array( 
    [[0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,],
     [0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,],
     [0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],
     [1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,],
     [0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],
     [0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],
     [0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,],
     [0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],
     [1., 0., 0., 0., 1., 1., 0., 0., 0., 1.,],
     [0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,]]
)
word2Ind_ans = {'All': 0, "All's": 1, 'END': 2, 'START': 3, 'ends': 4, 'glitters': 5, 'gold': 6, "isn't": 7, 'that': 8, 'well': 9}

# Test correct word2Ind
assert (word2Ind_ans == word2Ind_test), "Your word2Ind is incorrect:\nCorrect: {}\nYours: {}".format(word2Ind_ans, word2Ind_test)

# Test correct M shape
assert (M_test.shape == M_test_ans.shape), "M matrix has incorrect shape.\nCorrect: {}\nYours: {}".format(M_test.shape, M_test_ans.shape)

# Test correct M values
for w1 in word2Ind_ans.keys():
    idx1 = word2Ind_ans[w1]
    for w2 in word2Ind_ans.keys():
        idx2 = word2Ind_ans[w2]
        student = M_test[idx1, idx2]
        correct = M_test_ans[idx1, idx2]
        if student != correct:
            print("Correct M:")
            print(M_test_ans)
            print("Your M: ")
            print(M_test)
            raise AssertionError("Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.".format(idx1, idx2, w1, w2, student, correct))

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.3: Implement `reduce_to_k_dim` [code] (1 point)

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
        # ------------------
        # Write your implementation here.
        
    svd = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced = svd.fit_transform(M)
    
        # ------------------

    print("Done.")
    return M_reduced

# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness 
# In fact we only check that your M_reduced has the right dimensions.
# ---------------------

# Define toy corpus and run student code
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)
M_test_reduced = reduce_to_k_dim(M_test, k=2)

# Test proper dimensions
assert (M_test_reduced.shape[0] == 10), "M_reduced has {} rows; should have {}".format(M_test_reduced.shape[0], 10)
assert (M_test_reduced.shape[1] == 2), "M_reduced has {} columns; should have {}".format(M_test_reduced.shape[1], 2)

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

Running Truncated SVD over 10 words...
Done.
--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.4: Implement `plot_embeddings` [code] (1 point)

def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    # simulating a pandas df['type'] column

    for word in words:
        x = M_reduced[word2Ind[word],0]
        y = M_reduced[word2Ind[word],1]
        plt.scatter(x, y, marker='x', color='red')
        plt.text(x, y, word, fontsize=9)
    plt.show()

    # ------------------

# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# The plot produced should look like the "test solution plot" depicted below. 
# ---------------------

print ("-" * 80)
print ("Outputted Plot:")

M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])
word2Ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}
words = ['test1', 'test2', 'test3', 'test4', 'test5']
plot_embeddings(M_reduced_plot_test, word2Ind_plot_test, words)

print ("-" * 80)

--------------------------------------------------------------------------------
Outputted Plot:

在这里插入图片描述

--------------------------------------------------------------------------------

Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)

在这里插入图片描述

Write your answer here.

The graph is in the shape of a curve. And the distances between each word represent the closenesses.

- What clusters together in 2-dimensional embedding space?

‘petroleum’ and ‘industry’ cluster together because their meanings are correlated.
‘energy’ and ‘oil’ cluster together because oil is a type of energy.
‘ecuador’, ‘kuwait’ and ‘venezuela’ cluster because they are all country names.
‘output’ , ‘barrels’ and 'bpd’because both ‘barrels’ and ‘barrels per day’ measure the quantity of ‘output’.

- What doesn’t cluster together that you might think should have?

Cluster of ‘output’ , ‘barrels’ and ‘bpd’ are relatively far away from cluster of ‘energy’ and ‘oil’ or cluster of ‘petroleum’ and ‘industry’, which have related meaning of ‘oil’.

Part 2: Prediction-Based Word Vectors (15 points)

Question 2.1: Word2Vec Plot Analysis [written] (4 points)

在这里插入图片描述

Write your answer here.

- What clusters together in 2-dimensional embedding space?

‘barrels’ and ‘bpd’;
‘energy’ ,‘industry’, ‘kuwait’, ‘oil’, ‘petroleum’;
‘venezuela’,‘ecuador’ and ‘output’;

- What doesn’t cluster together that you might think should have?

‘oil’, ‘energy’ and ‘barrels’, ‘bpd’;
‘venezuela’, ‘ecuador’ and ‘kuwait’;

- How is the plot different from the one generated earlier from the co-occurrence matrix?

The distances of words in the plot do not represent the real correlations compared to those of the co-occurrence matrix. This may be caused by distortion of conpressing the high-dimensional space into 2-D dimension.

Cosine Similarity

Question 2.2: Polysemous Words (2 points) [code + written]

# ------------------
# Write your polysemous word exploration code here.
wv_from_bin.most_similar("bank")

# ------------------

[('banks', 0.7440759539604187),
 ('banking', 0.690161406993866),
 ('Bank', 0.6698698997497559),
 ('lender', 0.6342284679412842),
 ('banker', 0.6092953681945801),
 ('depositors', 0.6031531691551208),
 ('mortgage_lender', 0.5797975659370422),
 ('depositor', 0.5716428160667419),
 ('BofA', 0.5714625120162964),
 ('Citibank', 0.5589520335197449)]

wv_from_bin.most_similar("glass")

[('R._Mazzei_fused', 0.6665399074554443),
 ('Christian_Audigier_nightclub', 0.6632694602012634),
 ('copper_alloy_garnets', 0.6343655586242676),
 ('Nelmeus', 0.6274422407150269),
 ('fiber_fusion_splicing', 0.6229820251464844),
 ('Plexiglass', 0.585858941078186),
 ('slashing_Leonardo_DiCaprio', 0.5850011110305786),
 ('plexiglass', 0.5823023319244385),
 ('Plexiglas', 0.5803930759429932),
 ("#Q'##_unaudited", 0.5798528790473938)]

wv_from_bin.most_similar("fall")

[('falling', 0.6371318101882935),
 ('falls', 0.6107184290885925),
 ('drop', 0.5912517309188843),
 ('tumble', 0.569644570350647),
 ('rise', 0.5596301555633545),
 ('plummet', 0.5581283569335938),
 ('fell', 0.5548586845397949),
 ('spring', 0.541506826877594),
 ('Fall', 0.5406967401504517),
 ('sag', 0.5160202383995056)]

wv_from_bin.most_similar("green")

[('wearin_o', 0.57559734582901),
 ('greener', 0.5499471426010132),
 ('workers_differently_Corenthal', 0.5424154996871948),
 ('QEII_TJ_Marta', 0.5406073331832886),
 ('red', 0.5360240936279297),
 ('Leyritz_Ford_Expedition', 0.5289624929428101),
 ('greening', 0.5261998176574707),
 ('Pentwater_Civic', 0.5239853858947754),
 ('Kimmie_Mi_Hyun_Kim', 0.5214661955833435),
 ('eco_friendly', 0.5156992077827454)]

Write your answer here.

state the polysemous word you discover and the multiple meanings that occur in the top 10.

- Why do you think many of the polysemous words you tried didn’t work?

Because the vocabulary we use is from a news document and it does not cover a wide range of the words.

Question 2.3: Synonyms & Antonyms (2 points) [code + written]

# ------------------
# Write your synonym & antonym exploration code here.

w1 = "true"
w2 = "really"
w3 = "false"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

# ------------------

Synonyms true, really have cosine distance: 0.6886207759380341
Antonyms true, false have cosine distance: 0.6290567219257355

Write your answer here.

the difference of degrees of word meaning may enlarge the cosine distance.

Solving Analogies with Word Vectors

# Run this cell to answer the analogy -- man : king :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

Question 2.4: Finding Analogies [code + written] (2 Points)

# ------------------
# Write your analogy exploration code here.

# the analogy -- gymnastics : gymnast :: piano : x
pprint.pprint(wv_from_bin.most_similar(positive=['gymnast', 'piano'], negative=['gymnastics']))

# ------------------

[('pianist', 0.7164468169212341),
 ('violin', 0.7020024061203003),
 ('violinist', 0.681442379951477),
 ('cello', 0.6716278791427612),
 ('cellist', 0.6693906784057617),
 ('saxophone', 0.6536856889724731),
 ('clarinet', 0.6517087817192078),
 ('alto_saxophone', 0.6199961304664612),
 ('alto_sax', 0.6194193363189697),
 ('trombone', 0.6166050434112549)]

Write your answer here.

Question 2.5: Incorrect Analogy [code + written] (1 point)

# ------------------
# Write your incorrect analogy exploration code here.

# the analogy -- tomato : vegetable :: grape : x
pprint.pprint(wv_from_bin.most_similar(positive=['grape', 'vegetable'], negative=['tomato']))

# ------------------

[('grapes', 0.5778293013572693),
 ('Thompson_seedless_grapes', 0.5553233027458191),
 ('grape_cultivation', 0.5311552286148071),
 ('winegrapes', 0.5253068208694458),
 ('wine_grapes', 0.5252600908279419),
 ('winegrape', 0.5084421038627625),
 ('varietal', 0.5068702697753906),
 ('Bordeaux_grape_varieties', 0.5039265751838684),
 ('almond', 0.4998564124107361),
 ('Cabernet_Sauvignon_grapes', 0.49503305554389954)]

Write your answer here.

Question 2.6: Guided Analysis of Bias in Word Vectors [written] (1 point)

# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))

[('bosses', 0.5522644519805908),
 ('manageress', 0.49151360988616943),
 ('exec', 0.459408164024353),
 ('Manageress', 0.45598435401916504),
 ('receptionist', 0.4474116861820221),
 ('Jane_Danson', 0.44480547308921814),
 ('Fiz_Jennie_McAlpine', 0.44275766611099243),
 ('Coronation_Street_actress', 0.44275569915771484),
 ('supremo', 0.4409852921962738),
 ('coworker', 0.4398624897003174)]

[('supremo', 0.6097397804260254),
 ('MOTHERWELL_boss', 0.5489562153816223),
 ('CARETAKER_boss', 0.5375303626060486),
 ('Bully_Wee_boss', 0.5333974361419678),
 ('YEOVIL_Town_boss', 0.5321705341339111),
 ('head_honcho', 0.5281980037689209),
 ('manager_Stan_Ternent', 0.525971531867981),
 ('Viv_Busby', 0.5256163477897644),
 ('striker_Gabby_Agbonlahor', 0.5250812768936157),
 ('BARNSLEY_boss', 0.5238943099975586)]

Write your answer here.

Gender biases does appear in the word embeddings:
a) contains some jobs like ‘receptionist’ and ‘coworker’ which are not related to the same level of ‘boss’ title while most words in b) are related to ‘boss’, ‘head’, and ‘manager’.

Question 2.7: Independent Analysis of Bias in Word Vectors [code + written] (2 points)

# ------------------
# Write your bias exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['american', 'manager'], negative=['chinese']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['chinese', 'manager'], negative=['american']))

# ------------------

[('vice_president', 0.5888485908508301),
 ('vp', 0.5321600437164307),
 ('director', 0.5288139581680298),
 ('manger', 0.5164791941642761),
 ('mananger', 0.4895482063293457),
 ('vicepresident', 0.4752706289291382),
 ('coordinator', 0.4739943742752075),
 ('administrator', 0.4649930000305176),
 ('senior_vp', 0.46086063981056213),
 ('svp', 0.4597167670726776)]

[('manger', 0.5492449998855591),
 ('managing_director', 0.514661431312561),
 ('Manager', 0.509232759475708),
 ('General_Manager', 0.4886166453361511),
 ('supervisor', 0.48655033111572266),
 ('mananger', 0.47807249426841736),
 ('director', 0.4770866930484772),
 ('vice_president', 0.4578412175178528),
 ('General_Manger', 0.4420962929725647),
 ('Jialin', 0.42849200963974)]

Write your answer here.

Race biases does appear in the word embeddings:
a) contains 5 ‘vice_president’ and 3 ‘manager’(‘supervisor’/‘director’) while b) contains 7 ‘manager’(‘supervisor’/‘director’) and only 1 ‘vice_president’. It indicates that in this embedding american managers are more likely to be regarded as title with a high position like vice president than chinese managers.