Task1:词向量的认识

Task1:词向量的认识

1.对文本库进行分词

对语料库中的句子进行分词,用set()结构去重,再进行排序,得到去重词库的长度.

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    # ------------------
    # Write your implementation here.
#     for a in corpus:
#         for b in a:
#             corpus_words.append(b)
    [[corpus_words.append(a) for a in item] for item in corpus]
    corpus_words = sorted(list(set(corpus_words)))
    num_corpus_words = len(corpus_words)
    # ------------------

    return corpus_words, num_corpus_words

2.词向量表示

词向量通常用作下游NLP任务的基本组件,例如问题解答,文本生成,翻译等,因此重要的是要就其优缺点建立一些直觉。在这里,您将探索两种类型的词向量:从共现矩阵派生的词向量和通过word2vec派生的词向量。

# 1.首先介绍利用共生矩阵来派生词向量. 共生矩阵:设定一个窗口大小window_size,将当前词作为中心词,遍历文本库中的每一句包含当前词的话,分别对左边和右边的window_size个步长内的单词进行频数计算,将结果记录在矩阵当中
def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
#     M = [[0] * num_words for i in range(num_words)]
# index只能找到第一个索引项,如果要找到所有的目标索引,则需要构造函数
    def get_index(lst, item=''):
        return [i for i in range(len(lst)) if lst[i] == item]
    M = np.zeros((num_words, num_words))
    n = 0
    for a in words:
        word2Ind[a] = n
        n += 1
    for a in words:
        for item in corpus:
            if a not in item:
                continue
            else:
                index_sum = get_index(item, a)
                for m in index_sum:
                    for i in range(m + 1, m + window_size + 1):
                        if i < len(item):
                            M[word2Ind[a]][word2Ind[item[i]]] += 1
                    for j in range(m - window_size, m):
                        if j >= 0:
                            M[word2Ind[a]][word2Ind[item[j]]] += 1
                
    # ------------------

    return M, word2Ind

# 对所得到的共生矩阵进行SVD奇异值分解,达到降维的目的,将维度从num_words降到k.奇异值分解用sklearn.decomposition.TruncatedSVD模块.
def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
        # ------------------
        # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters, random_state=42)
    M_reduced = svd.fit_transform(M)
    
    
        # ------------------

    print("Done.")
    return M_reduced

# 2.载入vord2vec词向量库
def load_word2vec():
    """ Load Word2Vec Vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
    """
    import gensim.downloader as api
    wv_from_bin = api.load("word2vec-google-news-300")
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin

# 词向量库实例化
wv_from_bin = load_word2vec()
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值