Exploring Word Vectors
任务介绍
本文中是笔者在阅读Stanford / Winter 2023 CS224N的过程中,针对assignment1给出的个人解答。各个函数基本都有文档注释,帮助大家理解阅读,解法可能不是最优,如果各位读者有更为简洁精妙的解法,欢迎在评论区中提出。
在此我给出本作业的链接。Exploring Word Vectors
本次作业主要是初步探究词向量(word vectors),词向量通常被用作下游 NLP 任务(如问题解答、文本生成、机器翻译等)的基本组成部分,本次作业中我们将探索两种类型的词向量:从共现矩阵中得出的词向量和通过 GloVe 得出的词向量。
首先自然是先import该作业需要用到的包和库。
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters') #to specify download location, optionally add the argument: download_dir='/specify/desired/path/'
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from typing import *
START_TOKEN = '<START>'
END_TOKEN = '<END>'
np.random.seed(0)
random.seed(0) # 使得随机数据可预测,当我们设置相同的seed,每次生成的随机数相同
Part 1: Count-Based Word Vectors
Co-Occurrence Word Embeddings
许多词向量的实现都是基于这样一种想法,即词的相似性 (word similarity)。类似的单词通常会被我们一同说出来或写出来。在此,我们将详细介绍其中一种策略——共现矩阵(co-occurrence matrix)。
这里词被我们分为两类,一类是中心词(center word),一类是背景词(context word)。在语料库中,当一个背景词出现在某个中心词的周围的频次越高,我们有理由相信这两个词在语义上有一定的联系。这个“周围”我们把它叫做浅窗口(Fixed Window of n),即中心词
w
i
w_{i}
wi左右各n个词,words
w
i
−
n
…
w
i
−
1
w_{i-n} \dots w_{i-1}
wi−n…wi−1与
w
i
+
1
…
w
i
+
n
w_{i+1} \dots w_{i+n}
wi+1…wi+n。下面我们将构建共现矩阵M,which is a symmetric word-by-word matrix in which
M
i
j
M_{ij}
Mij is the number of times
w
j
w_j
wj appears inside
w
i
w_i
wi’s window among all documents.
Example: Co-Occurrence with Fixed Window of n=1:
Document 1: “all that glitters is not gold”
Document 2: “all is well that ends well”
* | <START> | all | that | glitters | is | not | gold | well | ends | <END> |
---|---|---|---|---|---|---|---|---|---|---|
<START> | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
all | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
that | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
glitters | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
is | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
not | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
gold | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
well | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
ends | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
<END> | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Note: 我们把<START>
与 <END>
作为句子或段落等开始与结束的标志,同时也算作token。 如"all that glitters is not gold" 我们将改写为 “<START>
All that glitters is not gold <END>
”。
共现矩阵的行列数即为语料库的词表数|
V
\mathcal{V}
V|,
V
\mathcal{V}
V为词表,这个值往往很大,所以我们需要进行降维处理(dimensionality reduction),本文中我们采用sklearn中Truncated SVD方法进行处理,或者是使用主成分分析PCA。
本文中我们采用的数据集是路透社有关“gold”的语料库,下面读取数据集:
def read_corpus(category="gold") -> List[List[str]]:
""" Read files from the specified Reuter's category.
Params:
category (string): category name
Return:
list of lists, with words from each of the processed files
"""
files = reuters.fileids(category)
return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files] # 添加start与end token
我们尝试打印一条样本:
reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[0], compact=True, width=100)
# result
"""['<START>', 'western', 'mining', 'to', 'open', 'new', 'gold', 'mine', 'in', 'australia', 'western',
'mining', 'corp', 'holdings', 'ltd', '&', 'lt', ';', 'wmng', '.', 's', '>', '(', 'wmc', ')',
'said', 'it', 'will', 'establish', 'a', 'new', 'joint', 'venture', 'gold', 'mine', 'in', 'the',
'northern', 'territory', 'at', 'a', 'cost', 'of', 'about', '21', 'mln', 'dlrs', '.', 'the',
'mine', ',', 'to', 'be', 'known', 'as', 'the', 'goodall', 'project', ',', 'will', 'be', 'owned',
'60', 'pct', 'by', 'wmc', 'and', '40', 'pct', 'by', 'a', 'local', 'w', '.', 'r', '.', 'grace',
'and', 'co', '&', 'lt', ';', 'gra', '>', 'unit', '.', 'it', 'is', 'located', '30', 'kms', 'east',
'of', 'the', 'adelaide', 'river', 'at', 'mt', '.', 'bundey', ',', 'wmc', 'said', 'in', 'a',
'statement', 'it', 'said', 'the', 'open', '-', 'pit', 'mine', ',', 'with', 'a', 'conventional',
'leach', 'treatment', 'plant', ',', 'is', 'expected', 'to', 'produce', 'about', '50', ',', '000',
'ounces', 'of', 'gold', 'in', 'its', 'first', 'year', 'of', 'production', 'from', 'mid', '-',
'1988', '.', 'annual', 'ore', 'capacity', 'will', 'be', 'about', '750', ',', '000', 'tonnes', '.',
'<END>']"""
Question 1.1: Implement distinct_words
本问题我们需要生成该语料库的词表 V \mathcal{V} V,像英语字典一样按首字母排序。上述reuters_corpus是一个列表的列表,我们当然可以使用for loop的方式解决,但是这里我们使用list comprehension来处理,并用python set剔除相同单词。
def distinct_words(corpus: List[List[str]]) -> Tuple[List[str], int]:
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): sorted list of distinct words across the corpus
n_corpus_words (integer): number of distinct words across the corpus
"""
### SOLUTION BEGIN
corpus_words = [word for text in corpus for word in text]
corpus_words = sorted(list(set(corpus_words)))
n_corpus_words = len(corpus_words)
### SOLUTION END
return corpus_words, n_corpus_words
Question 1.2: Implement compute_co_occurrence_matrix
本问题要求我们构建共现矩阵,浅窗口为n(默认为4),这里使用np数组处理。
def compute_co_occurrence_matrix(corpus: List[List[str]], window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).
Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.
For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
"All" will co-occur with "<START>", "that", "glitters", "is", and "not".
Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):
Co-occurrence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, n_words = distinct_words(corpus)
### SOLUTION BEGIN
word2ind = dict(zip(words, range(n_words)))
matrix = [[0 for i in range(n_words)] for j in range(n_words)]
for text in corpus:
for i in range(len(text)): # 特别注意中心词的左右窗口可能会超出列表索引
if i-window_size >= 0:
for word in text[(i-window_size):i]:
matrix[word2ind[text[i]]][word2ind[word]] += 1
elif i>0:
for word in text[:i]:
matrix[word2ind[text[i]]][word2ind[word]] += 1
if i+window_size <= len(text)-1:
for word in text[(i+1):(i+window_size+1)]:
matrix[word2ind[text[i]]][word2ind[word]] += 1
elif i+1 <= len(text)-1:
for word in text[(i+1):]:
matrix[word2ind[text[i]]][word2ind[word]] += 1
M = np.array(matrix)
### SOLUTION END
return M, word2ind
Question 1.3: Implement reduce_to_k_dim
下面我们将使用Truncated SVD对共现矩阵进行降维处理,输出矩阵形状为( V \mathcal{V} V, embedding dimension)。
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurrence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Params:
M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurrence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensional word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
print("Running Truncated SVD over %i words..." % (M.shape[0]))
### SOLUTION BEGIN
svd = TruncatedSVD(n_components=k)
M_reduced = svd.fit_transform(M)
### SOLUTION END
print("Done.")
return M_reduced
Question 1.4: Implement plot_embeddings
本问题中我们把embedding dimension设为2,这样可以在平面直角坐标系中打印出各个词的位置。
def plot_embeddings(M_reduced, word2ind, words):
""" Plot in a scatterplot the embeddings of the words specified in the list "words".
NOTE: do not plot all the words listed in M_reduced / word2ind.
Include a label next to each point.
Params:
M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensional word embeddings
word2ind (dict): dictionary that maps word to indices for matrix M
words (list of strings): words whose embeddings we want to visualize
"""
### SOLUTION BEGIN
x_coords = M_reduced[:, 0]
y_coords = M_reduced[:, 1]
for word in words:
x = x_coords[word2ind[word]]
y = y_coords[word2ind[word]]
plt.scatter(x, y, marker='x', color='red')
plt.text(x, y, word, fontsize=9)
plt.show()
### SOLUTION END
Question 1.5: Co-Occurrence Plot Analysis
下面我们将整合上述各个函数,来打印出语料库中某些词的位置。这里我们将每个二维词向量单位化,这样这些点最终应该出现在单位圆上,词的相似性变为了径向的相似。
reuters_corpus = read_corpus()
M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting
words = ['value', 'gold', 'platinum', 'reserves', 'silver', 'metals', 'copper', 'belgium', 'australia', 'china', 'grammes', "mine"]
plot_embeddings(M_normalized, word2ind_co_occurrence, words)
下面给出输出结果:
我们发现基于频次的词向量方法已经有初步的聚类效果了,上图中copper、paltinum等金属相距很近,Australia、Belgium等国家相距很近;但不足的是,silver和其他金属相距很远,China和其他国家也相距很远。
Part 2: Prediction-Based Word Vectors
基于预测的词向量模型有word2vec与GloVe等,本部分不是复现过程,主要是gensim库的调用与体验。
首先我们加载GloVe词向量,词表400000,embedding dimension为200。
def load_embedding_model():
""" Load GloVe Vectors
Return:
wv_from_bin: All 400000 embeddings, each lengh 200
"""
import gensim.downloader as api
wv_from_bin = api.load("glove-wiki-gigaword-200")
print("Loaded vocab size %i" % len(list(wv_from_bin.index_to_key)))
return wv_from_bin
wv_from_bin = load_embedding_model()
Reducing dimensionality of Word Embeddings
这里我们从GloVe词表中随机选出10000词,并减少至2维,主要是为了可以在平面直角坐标系中图形化表示。
def get_matrix_of_vectors(wv_from_bin, required_words):
""" Put the GloVe vectors into a matrix M.
Param:
wv_from_bin: KeyedVectors object; the 400000 GloVe vectors loaded from file
Return:
M: numpy matrix shape (num words, 200) containing the vectors
word2ind: dictionary mapping each word to its row number in M
"""
import random
words = list(wv_from_bin.index_to_key) # 400,000
print("Shuffling words ...")
random.seed(225)
random.shuffle(words)
words = words[:10000]
print("Putting %i words into word2ind and matrix M..." % len(words))
word2ind = {}
M = []
curInd = 0
for w in words:
try:
M.append(wv_from_bin.get_vector(w)) # 获取词向量
word2ind[w] = curInd
curInd += 1
except KeyError:
continue
for w in required_words:
if w in words:
continue
try:
M.append(wv_from_bin.get_vector(w))
word2ind[w] = curInd
curInd += 1
except KeyError:
continue
M = np.stack(M)
print("Done.")
return M, word2ind
M, word2ind = get_matrix_of_vectors(wv_from_bin, words)
M_reduced = reduce_to_k_dim(M, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced, axis=1)
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting
Cosine Similarity
在n维欧氏空间中,我们可以使用余弦相似度来量化两个词汇的相似程度,两向量 p p p 与 q q q 的余弦相似度 s s s为
s = p ⋅ q ∣ ∣ p ∣ ∣ ∣ ∣ q ∣ ∣ , where s ∈ [ − 1 , 1 ] s = \frac{p \cdot q}{||p|| ||q||}, \textrm{ where } s \in [-1, 1] s=∣∣p∣∣∣∣q∣∣p⋅q, where s∈[−1,1]
Question 2.1: GloVe Plot Analysis
我们同样在坐标系中表示出这12个词,可以看出与PART1中结果有所不同。
words = ['value', 'gold', 'platinum', 'reserves', 'silver', 'metals', 'copper', 'belgium', 'australia', 'china', 'grammes', "mine"]
plot_embeddings(M_reduced_normalized, word2ind, words)
Question 2.2: Words with Multiple Meanings
无论在英文还是中文中,有许多词汇有多种含义,称之为多义词(polysemes)。我们调用most_similar函数可以查看与某个单词相似度最高的十个词(topn默认为10)。
wv_from_bin.most_similar('right')
# result
"""
[('left', 0.716508150100708), ('if', 0.6925000548362732), ("n't", 0.6774845719337463), ('back', 0.6770386099815369), ('just', 0.6740819811820984), ('but', 0.667771577835083), ('out', 0.6671877503395081), ('put', 0.665894091129303), ('hand', 0.6634083390235901), ('want', 0.6615420579910278)]
"""
事实上,我们这里所想的right可能是”正确的“而不是”右边“,但是返回相似度最高的却是left,可见在处理多义词上存在一定的问题,对于一个多义词,可能需要多种向量表示,而不是一个词向量。
Question 2.3: Synonyms & Antonyms
调用distance函数,即余弦距离(cosine distance=1-cosine similarity),这样同义词的余弦距离会大于反义词的余弦距离。
w1 = 'happy'
w2 = 'cheerful'
w3 = 'sad'
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)
print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))
# result
"""
Synonyms happy, cheerful have cosine distance: 0.5172466933727264
Antonyms happy, sad have cosine distance: 0.4040136933326721
"""
Question 2.4: Analogies with Word Vectors
这里我们进行一个类比的任务,例如"man : grandfather :: woman : x" (man is to grandfather as woman is to x)中x是什么,仍然调用most_similar函数,可发现相似度最大的词汇为grandmother。
用一个数学等式描述:
m
a
n
−
g
r
a
n
d
f
a
t
h
e
r
=
w
o
m
a
n
−
g
r
a
n
d
m
o
t
h
e
r
man-grandfather=woman-grandmother
man−grandfather=woman−grandmother。
wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man'])[0]
# result
"""
('grandmother', 0.7608445286750793)
"""
剩下部分问题与上述类似,大家可以自行尝试与解决。
在今后的规划中,我计划会出一期word2vec与glove词向量,或者是CS224N后续作业的解析博客,欢迎大家的指正。