Task1:词向量的认识
1.对文本库进行分词
对语料库中的句子进行分词,用set()结构去重,再进行排序,得到去重词库的长度.
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
num_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
num_corpus_words = -1
# ------------------
# Write your implementation here.
# for a in corpus:
# for b in a:
# corpus_words.append(b)
[[corpus_words.append(a) for a in item] for item in corpus]
corpus_words = sorted(list(set(corpus_words)))
num_corpus_words = len(corpus_words)
# ------------------
return corpus_words, num_corpus_words
2.词向量表示
词向量通常用作下游NLP任务的基本组件,例如问题解答,文本生成,翻译等,因此重要的是要就其优缺点建立一些直觉。在这里,您将探索两种类型的词向量:从共现矩阵派生的词向量和通过word2vec派生的词向量。
# 1.首先介绍利用共生矩阵来派生词向量. 共生矩阵:设定一个窗口大小window_size,将当前词作为中心词,遍历文本库中的每一句包含当前词的话,分别对左边和右边的window_size个步长内的单词进行频数计算,将结果记录在矩阵当中
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).
Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.
For example, if we take the document "START All that glitters is not gold END" with window size of 4,
"All" will co-occur with "START", "that", "glitters", "is", and "not".
Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (numpy matrix of shape (number of corpus words, number of corpus words)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, num_words = distinct_words(corpus)
M = None
word2Ind = {}
# ------------------
# Write your implementation here.
# M = [[0] * num_words for i in range(num_words)]
# index只能找到第一个索引项,如果要找到所有的目标索引,则需要构造函数
def get_index(lst, item=''):
return [i for i in range(len(lst)) if lst[i] == item]
M = np.zeros((num_words, num_words))
n = 0
for a in words:
word2Ind[a] = n
n += 1
for a in words:
for item in corpus:
if a not in item:
continue
else:
index_sum = get_index(item, a)
for m in index_sum:
for i in range(m + 1, m + window_size + 1):
if i < len(item):
M[word2Ind[a]][word2Ind[item[i]]] += 1
for j in range(m - window_size, m):
if j >= 0:
M[word2Ind[a]][word2Ind[item[j]]] += 1
# ------------------
return M, word2Ind
# 对所得到的共生矩阵进行SVD奇异值分解,达到降维的目的,将维度从num_words降到k.奇异值分解用sklearn.decomposition.TruncatedSVD模块.
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Params:
M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
svd = TruncatedSVD(n_components=k, n_iter=n_iters, random_state=42)
M_reduced = svd.fit_transform(M)
# ------------------
print("Done.")
return M_reduced
# 2.载入vord2vec词向量库
def load_word2vec():
""" Load Word2Vec Vectors
Return:
wv_from_bin: All 3 million embeddings, each lengh 300
"""
import gensim.downloader as api
wv_from_bin = api.load("word2vec-google-news-300")
vocab = list(wv_from_bin.vocab.keys())
print("Loaded vocab size %i" % len(vocab))
return wv_from_bin
# 词向量库实例化
wv_from_bin = load_word2vec()