主要是针对word2vec, LSA, Glove的词嵌入的方法进行差异介绍。
在传统的自然语言处理中,词被看成是离散的符号。词的表示也是一种localist representation形式。例如在文章中出现motel,hotel的词时,该词被表征
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
Vector dimension = number of words in vocabulary (e.g., 500,000)
缺点:1)两两向量正交 2)没有自然意义上的相似性
Instead: learn to encode similarity in the vectors themselves 编码词与词的相似性
- word2vec
- 以固定的滑动时间窗口,遍历1~T位置的词,以遍历词为center word预测context word出现的概率。典型的实现方法N-gram 和CBOW
# Given a list of words and a number n, return a list
# of n-grams.
def getNGrams(wordlist, n):
return [wordlist[i:i+n] for i in range(len(wordlist)-(n-1))]
构建似然函数,最大化context word(左右)出现的概率
- prediction function
在给定c的条件下,最大化o的出现概率,利用sofa Max定义预测函数,归一化因子在实际中由于corpus巨大往往采用negative sampling的方法简化计算,优化方法上采用 Stochastic gradient descent (SGD),learning rate 采用衰减的方法更新参数。
word2vec最大缺点可以说是没有用到全局的统计信息,LSA和Glove在这点上都有优势。
- LSA(Latent Semantic Analysis)
通过基于共现矩阵 co-occurrence,通过svd分解得到vector representation的方法。共现矩阵统计容易对频次高的word , 权重赋值大。
#corpus= "i like deep learing, i like nlp , i enjoy flying"
import numpy as np
la = np.linalg
words = ['i','like','enjoy','deep','learning','nlp','flying','.']
X = np.array([[0,2,1,0,0,0,0,0],
[2,0,0,1,0,1,0,0],
[1,0,0,0,0,0,1,0],
[0,1,0,0,1,0,0,0],
[0,0,0,1,0,0,0,1],
[0,1,0,0,0,0,0,1],
[0,0,1,0,0,0,0,1],
[0,0,0,0,1,1,1,0]])
U,s,Vh = la.svd(X, full_matrices = False)
import matplotlib.pyplot as plt
for i in range(len(words)):
plt.text(U[i,0],U[i,1],words[i])
- Glove (Global Vectors for Word Representation)
也是基于co-occurrence matrix的方法,根据语料库(corpus)构建一个共现矩阵(Co-ocurrence Matrix)X,矩阵中的每一个元素Xij代表单词i和上下文单词j在特定大小的上下文窗口(context window)内共同出现的次数。比LSA改进的是:Glove根据两个单词在上下文窗口的距离d,提出了一个衰减函数(decreasing weighting):decay=1/d用于计算权重,也就是说距离越远的两个单词所占总计数(total count)的权重越小。
-
单词j出现在单词i的上下文中的次数
-
单词i的上下文中所有单词出现的总次数
-
单词j出现在单词i的上下文中的概率
其中,
- 超参的分析
- 维度数据 ~300
- 最好是有上下文信息(单纯的上文信息效果可能不好)
- Glove 来说 window_size = 8 是比较好
- 基于用户评价的word2vec的实践
#中文语料库 https://github.com/SophonPlus/ChineseNlpCorpus
10 class Gensim_embedding:
11 def __init__(self,data_path):
12 self.data_raw = {}
13 #ignore_ = [',','?','-','=','"',''','<<','>>','...','。',':','!','!','(',')']
14 id = 0
15 for line in open(data_path,'r',encoding = 'utf-8'):
16 lines = line.strip().split(',',1)
17 if lines[0] == 'label' : continue
18 self.data_raw[id] = lines[1]
19 id += 1
20
21 def cut(self):
22 data_cut={}
23 all_data = []
24 ignore_flag = ['.........','(','"',"(",',','?','-','=','"',''','<<','>>','...','。',':','!','!','(',')']
25 ignore_word = ['我','我们','他','她','如','如果','着','喔','的','还']
26 for row in self.data_raw:
27 rows = jieba.cut(self.data_raw[row],cut_all=False)
28 filter_word = []
29 for field in list(rows):
30 if field in ignore_flag or field in ignore_word: continue
31 filter_word.append(field)
32 all_data += filter_word
33 if len(filter_word) < 200: continue
34 data_cut[row] = filter_word
35 return data_cut ,list(set(all_data ))
37 def run(self):
38 save_model_file = 'fudan_embedding'
39 self.id_data = {}
40 self.id_data,all_data = self.cut()
41 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
42 model = gensim.models.Word2Vec([all_data], min_count = 1 , size=4) # 训练skip-gram模型; 默认window=5 , embedding_dimensions and hidden layers numbers = 4
43 model.save(save_model_file)
44 model.wv.save_word2vec_format(save_model_file + ".bin", binary=True) # 以二进制类型保存模型以便重用
45 54 if __name__ == "__main__":
55 work = Gensim_embedding('../chinese_data/ChnSentiCorp_htl_all.csv')
56 work.run()
总结:
介绍了词的嵌入方法,word2vec, lsa 和glove的各自优缺点等。
references:
- cs224n Natural Language Processing with Deep Learning