c++ vector向量直接赋值_词向量表征工具

最新推荐文章于 2022-11-20 11:13:20 发布

欲戴王冠必承其重颖宝

最新推荐文章于 2022-11-20 11:13:20 发布

阅读量211

点赞数

文章标签： c++ vector向量直接赋值

本文链接：https://blog.csdn.net/weixin_36211244/article/details/112100208

版权

本文对比了word2vec、LSA和Glove三种词嵌入方法，详细阐述了它们的工作原理、优缺点。word2vec通过预测context word优化模型，而LSA和Glove利用共现矩阵，Glove引入了衰减函数来考虑单词间的距离。实践中，Glove在某些设置下表现更优。

摘要由CSDN通过智能技术生成

主要是针对word2vec, LSA, Glove的词嵌入的方法进行差异介绍。

在传统的自然语言处理中，词被看成是离散的符号。词的表示也是一种localist representation形式。例如在文章中出现motel,hotel的词时，该词被表征

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000)
缺点：1）两两向量正交 2）没有自然意义上的相似性

Instead: learn to encode similarity in the vectors themselves 编码词与词的相似性

word2vec
以固定的滑动时间窗口，遍历1～T位置的词，以遍历词为center word预测context word出现的概率。典型的实现方法N-gram 和CBOW

Example windows and process for computing probability

# Given a list of words and a number n, return a list
# of n-grams.
def getNGrams(wordlist, n):
    return [wordlist[i:i+n] for i in range(len(wordlist)-(n-1))]

构建似然函数，最大化context word（左右）出现的概率

似然函数

prediction function

:w为center word时的vector

: w为context word时的vector

损失函数（negative sampling）k=5-10

在给定c的条件下，最大化o的出现概率，利用sofa Max定义预测函数，归一化因子在实际中由于corpus巨大往往采用negative sampling的方法简化计算，优化方法上采用 Stochastic gradient descent (SGD)，learning rate 采用衰减的方法更新参数。

word2vec最大缺点可以说是没有用到全局的统计信息，LSA和Glove在这点上都有优势。

LSA(Latent Semantic Analysis)
通过基于共现矩阵 co-occurrence，通过svd分解得到vector representation的方法。共现矩阵统计容易对频次高的word , 权重赋值大。

SVD分解

#corpus= "i like deep learing, i like nlp ,  i enjoy flying"
import numpy as np
la = np.linalg
words = ['i','like','enjoy','deep','learning','nlp','flying','.']

X = np.array([[0,2,1,0,0,0,0,0],
              [2,0,0,1,0,1,0,0],
              [1,0,0,0,0,0,1,0],
              [0,1,0,0,1,0,0,0],
              [0,0,0,1,0,0,0,1],
              [0,1,0,0,0,0,0,1],
              [0,0,1,0,0,0,0,1],
              [0,0,0,0,1,1,1,0]])

U,s,Vh = la.svd(X, full_matrices = False)

import matplotlib.pyplot as plt
for i in range(len(words)):
    plt.text(U[i,0],U[i,1],words[i])

Glove (Global Vectors for Word Representation)

也是基于co-occurrence matrix的方法，根据语料库（corpus）构建一个共现矩阵（Co-ocurrence Matrix）X，矩阵中的每一个元素Xij代表单词i和上下文单词j在特定大小的上下文窗口（context window）内共同出现的次数。比LSA改进的是：Glove根据两个单词在上下文窗口的距离d，提出了一个衰减函数（decreasing weighting）：decay=1/d用于计算权重，也就是说距离越远的两个单词所占总计数（total count）的权重越小。

单词j出现在单词i的上下文中的次数
单词i的上下文中所有单词出现的总次数
单词j出现在单词i的上下文中的概率

Loss function

其中，

和

是最终要求解的词向量；

和

分别是两个词向量的bias term。其中权重函数可以解释为：当共现次数超过一定阈值时，

= 1 ,其他情况呈现线性关系。

超参的分析

维度数据～300
最好是有上下文信息（单纯的上文信息效果可能不好）
Glove 来说 window_size = 8 是比较好

基于用户评价的word2vec的实践

#中文语料库 https://github.com/SophonPlus/ChineseNlpCorpus

 10 class  Gensim_embedding:
 11     def __init__(self,data_path):
 12         self.data_raw  =  {}
 13          #ignore_ = [',','?','-','=','"',''','<<','>>','...','。',':','!','!','(',')']
 14         id = 0
 15         for line in open(data_path,'r',encoding = 'utf-8'):
 16             lines = line.strip().split(',',1)
 17             if lines[0] == 'label' : continue
 18             self.data_raw[id] = lines[1]
 19             id += 1
 20
 21     def cut(self):
 22         data_cut={}
 23         all_data = []
 24         ignore_flag = ['.........','（','"',"（",',','?','-','=','"',''','<<','>>','...','。',':','!','!','(',')']
 25         ignore_word = ['我','我们','他','她','如','如果','着','喔','的','还']
 26         for row in self.data_raw:
 27             rows = jieba.cut(self.data_raw[row],cut_all=False)
 28             filter_word = []
 29             for field in list(rows):
 30                 if field in ignore_flag or field in ignore_word: continue
 31                 filter_word.append(field)
 32                 all_data += filter_word
 33             if len(filter_word) < 200: continue
 34             data_cut[row] = filter_word
 35         return data_cut ,list(set(all_data ))
 37     def run(self):
 38         save_model_file = 'fudan_embedding'
 39         self.id_data = {}
 40         self.id_data,all_data  = self.cut()
 41         logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 42         model = gensim.models.Word2Vec([all_data], min_count = 1 , size=4)  # 训练skip-gram模型; 默认window=5 , embedding_dimensions and hidden layers numbers  = 4
 43         model.save(save_model_file)
 44         model.wv.save_word2vec_format(save_model_file + ".bin", binary=True)   # 以二进制类型保存模型以便重用
 45 54 if __name__ == "__main__":
 55     work = Gensim_embedding('../chinese_data/ChnSentiCorp_htl_all.csv')
 56     work.run()

总结：

介绍了词的嵌入方法，word2vec, lsa 和glove的各自优缺点等。

references: