Word2Vec的使用和基础原理
Word2Vec模型背后的基本思想是对出现在上下文环境里的词进行预测。对于每一条输入文本,我们选取一个上下文窗口和一个中心词,并基于这个中心词去预测窗口里其他词出现的概率。因此,Word2Vec模型可以方便地从新增预料中学习到新增词的向量表达,是一种高效地在线学习方法。
本文主要通过代码的形式,介绍Word2Vec的使用和原理。
导入第三方模块
from gensim.models.word2vec import Word2Vec
import logging # 提供日志打印功能
import numpy as np
import random
import pandas as pd
import torch
这里安装torch费了点时间,一直报错无法安装,最后在官网输入机器配置及环境得到安装代码,然后在终端执行安装,代码如下:
pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
Word2Vec的主要思路
通过单词和上下文彼此预测,对应的两个算法为:
- Skip-grams(SG):预测上下文
- Continuous Bag Of Words(CBOW):预测目标单词
Word2vec模型实际上分了两个部分,第一部分是建模,第二部分是通过模型获取嵌入词向量:
- 建模过程:基于训练数据构建神经网络
- 获取嵌入词向量:模型训练好以后,获取通过训练数据所学得的参数,如隐层的权重矩阵等
Skip-grams(SG)过程
神经网络基于训练数据,将会输出一个概率分布,这些概率代表着词典中每个词作为input word的output word的可能性
模型的输出概率代表着我们词典中的每个词有多大可能性和input word同时出现
input word和out word都会进行one-hot编码,形成一个稀疏向量(实际上仅有一个位置是1)
为了节约计算资源,它会仅仅选择矩阵对应向量中维度值为1的索引行计算
Skip-grams训练
Word2Vec模型是一个超级大的神经网络(权重矩阵规模非常大)。
百万数量级的权重矩阵和亿万数量级的训练样本意味着训练灾难。
问题解决:
-
将常见的组合单词或词组作为单个’words’来处理
-
对高频词抽样来减少样本个数
-
对优化目标采用’negative sampling’方法,这样每个训练样本的训练只会更新一小部分模型权重,从而降低计算负担
3.1 负采样时,随机选择一小部分negative words来更新对应权重,同时对positive words更新权重
3.2 使用’一元模型分布’来选择’negative words’,个单词被选作negative sample的概率和它出现频次有关,频次越高越容易被选中
3.3 负采样代码中,有一个包含了一亿个元素的数组’unigram table’,数组由词汇表中每个单词的索引号填充。单次负采样的概率*1亿=单次在表中出现的次数;也就是说,进行负采样时,只需要在0-1亿范围内生成一个随机数,然后选择表中索引号为这个随机数的单次作为negative word即可;一个单词负采样概率越大,它在表中出现的次数越多,被选择的概率就越大 -
霍夫曼树:输入权值为(w1,w2…wn)的n个节点;输出对应的霍夫曼树,一般得到霍夫曼树后会对叶子节点进行霍夫曼编码,由于权重高的叶子节点靠近根节点,而权重低的叶子节点会远离根节点。 所以高权重节点编码值较短,而低权重值编码值较长,这保证了树的带权路径最短,也符合信息论:常用词拥有更短的编码
4.1.将(w1,w2…wn)看做是有n棵树的森林,每个数仅有一个节点
4.2.在森林中选择根节点权值最小的两个数合并,得到一棵新树,这两棵树分布作为新树的左右子树,新树根节点权重为左右子树根节点权重和
4.3.删除森林中权值最小的两棵树,并把合并后的新树加入森林
4.4.重复4.2与4.3,直到森林中只剩一棵树
4.5.在Word2Vec中,约定左子树编码为1,右子树编码为0,同时约定左子树的权重不小于右子树的权重 -
Hierarchical Softmax过程:为了避免计算所有词的softmax概率,Word2Vec采用了霍夫曼树代替从隐藏层到输出softmax层的映射。霍夫曼树的建立:
5.1.根据标签(label)和频率建立霍夫曼树(label出现的频率越高,Huffman树的路径越短)
5.2.Huffman树中每一叶子节点代表一个label
5.2.1. p - 从根节点出发到达w对应叶子节点的路径
5.2.2. l - 路径p中包含节点的个数
5.2.3. p1,p2,…pl - 路径p中的l个节点,其中p1表示根节点,p2表示词w对应的第二个节点
5.2.4. d2,d3,…dl∈{0,1} - 词w的Huffman编码,它有l-1位编码构成,dl表示路径p中第l个节点对应的编码(根节点无)
5.2.5. θ1,θ2,…θ(l-1)∈R - 路径p中非叶子节点对应的向量,θj表示路径p中第j个非叶子节点对应的向量
5.3.一棵Huffman树,是一个二分类树(二叉树)。再Word2Vec中,1表示负类,0表示正类,通过Sigmoid函数分类
尝试通过Word2Vec训练词向量
'''
model = Word2Vec(sentences, workers=num_workers, size=num_features)
参数详解:
sentences - 语料集,可以是一个list,对于大语料集,建议使用BrownCorpus,Text8Corpus,lineSentence构建
sg - 用于设置训练算法,默认为0,即CBOW算法;sg=1则采用skip-gram算法
size - 指定特征向量的维度,默认为100。大的size需要更多的训练数据,但是效果会更好
window - 指当前词与预测词在一个句子中的最大距离
alpha - 学习速率
seed - 随机种子
min_count - 可以对字典做截断,词频数少于min_count则被丢弃,默认为5
max_vocab_size - 设置词向量构建期间的RAM限制。
如果所有独立单词个数超过限制,则丢弃其中最不频繁的一个。每一千万个单词大约需要1GB的RAM
sample - 高频词汇的随机降采样的配置阈值,默认1乘e的-3次方,范围是0到1乘e的-5次方
workers - 参加控制训练的并行数
hs - hs=1采用Hierarchica_softmax技巧,hs=0采用negative_sampling(下采样)
iter - 迭代次数,默认5次
'''
定义输出日志参数
logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')
'''
level - 设置日志级别
format - 指定输出格式
%(asctime) - 打印日志的时间
%(levelname)s - 打印日志级别名称
%(levelno)s - 打印日志级别的数值
%(message)s - 打印日志信息
%(funcName)s - 打印日志的当前函数
%(lineno)d - 打印日志的当前行号
%(thread)d - 打印线程ID
%(process)d - 打印进程ID
'''
设置随机种子
seed = 2020
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
十折交叉验证
fold_num = 10
data_file = r'D:\Users\Felixteng\Documents\Pycharm Files\Nlp\data\train_set.csv'
定义分折函数
def all_data2fold(fold_num, num=10000):
fold_data = []
f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
# tolist()函数用于将数组或矩阵转化成列表,这里我只取10000个样本
texts = f['text'].tolist()[:num]
labels = f['label'].tolist()[:num]
# 统计有标签的样本树,理应为10000个
total = len(labels)
# 创建一个索引列表,包含10000个索引,从0到9999
index = list(range(total))
# 将索引列表随机打乱
np.random.shuffle(index)
all_texts = []
all_labels = []
# 按打乱后的索引列表顺序,重组texts和labels
for i in index:
all_texts.append(texts[i])
all_labels.append(labels[i])
label2id = {}
# range(total) - 0到9999
# 这一步将每类标签整合到一个字典中
for i in range(total):
label = str(all_labels[i])
if label not in label2id:
label2id[label] = [i]
else:
label2id[label].append(i)
# 创建fold_num个空列表,用于存放索引
all_index = [[] for _ in range(fold_num)]
for label, data in label2id.items():
# 可以发现这10000个样本中每个类别分布不均
print(label, len(data))
# 把每一类分成fold_num份,取整
batch_size = int(len(data) / fold_num)
# other是没有分完的
other = len(data) - batch_size * fold_num
for i in range(fold_num):
# cur_batch_size是每折按batch_size+1分配的数量,当余数不够分配的时候,给batch_size,即整数部分
cur_batch_size = batch_size + 1 if i < other else batch_size
print(cur_batch_size)
batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
all_index[i].extend(batch_data)
batch_size = int(total / fold_num)
other_texts = []
other_labels = []
other_num = 0
start = 0
for fold in range(fold_num):
num = len(all_index[fold])
# 将数据和标签分为fold份
texts = [all_texts[i] for i in all_index[fold]]
labels = [all_labels[i] for i in all_index[fold]]
if num > batch_size:
# 取一份当验证集,剩下的训练集
fold_texts = texts[:batch_size]
other_texts.extend(texts[batch_size:])
fold_labels = labels[:batch_size]
other_labels.extend(labels[batch_size:])
other_num += num - batch_size
elif num < batch_size:
end = start + batch_size - num
fold_texts = texts + other_texts[start: end]
fold_labels = labels + other_labels[start: end]
start = end
else:
fold_texts = texts
fold_labels = labels
assert batch_size == len(fold_labels)
# shuffle - 打乱分好的每一折
index = list(range(batch_size))
np.random.shuffle(index)
shuffle_fold_texts = []
shuffle_fold_labels = []
for i in index:
shuffle_fold_texts.append(fold_texts[i])
shuffle_fold_labels.append(fold_labels[i])
data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
fold_data.append(data)
logging.info('Fold lens %s', str([len(data['label']) for data in fold_data]))
return fold_data
拆分数据集为10份
fold_data = all_data2fold(fold_num=10, num=200000)
为Word2Vec创建训练集
fold_id = 9
train_texts = []
for i in range(0, fold_id):
data = fold_data[i]
train_texts.extend(data['text'])
logging.info('Total %d docs.' % len(train_texts))
2020-07-31 22:47:16,645 INFO: Fold lens [20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000]
2020-07-31 22:48:19,883 INFO: Total 180000 docs.
可以看到,训练集有18000个词汇。
开始训练
logging.info('Starting training...')
num_features = 100 # 词向量维数
num_works = 8 # 并发线程数
train_texts = list(map(lambda x: list(x.split()), train_texts))
'''
split()函数默认分割所有空字符,包括空格符,换行符,制表符
map()函数搭配lambda函数使用,把函数运用到传入可迭代对象的每一个元素中,并把结果作为新的可迭代对向返回
'''
2020-07-31 22:49:01,006 INFO: Starting training...
创建模型实例
model = Word2Vec(train_texts, workers=num_works, size=num_features)
2020-07-31 22:54:54,262 INFO: collecting all words and their counts
2020-07-31 22:54:54,270 INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-07-31 22:54:59,512 INFO: PROGRESS: at sentence #10000, processed 9148165 words, keeping 5311 word types
2020-07-31 22:55:04,945 INFO: PROGRESS: at sentence #20000, processed 18193591 words, keeping 5704 word types
2020-07-31 22:55:10,346 INFO: PROGRESS: at sentence #30000, processed 27309645 words, keeping 5908 word types
2020-07-31 22:55:15,596 INFO: PROGRESS: at sentence #40000, processed 36283313 words, keeping 6087 word types
2020-07-31 22:55:20,184 INFO: PROGRESS: at sentence #50000, processed 45410969 words, keeping 6225 word types
2020-07-31 22:55:26,618 INFO: PROGRESS: at sentence #60000, processed 54548917 words, keeping 6324 word types
2020-07-31 22:55:33,732 INFO: PROGRESS: at sentence #70000, processed 63656832 words, keeping 6391 word types
2020-07-31 22:55:39,213 INFO: PROGRESS: at sentence #80000, processed 72844840 words, keeping 6435 word types
2020-07-31 22:55:45,167 INFO: PROGRESS: at sentence #90000, processed 81847771 words, keeping 6481 word types
2020-07-31 22:55:50,762 INFO: PROGRESS: at sentence #100000, processed 91001853 words, keeping 6541 word types
2020-07-31 22:55:55,067 INFO: PROGRESS: at sentence #110000, processed 100092931 words, keeping 6586 word types
2020-07-31 22:56:00,453 INFO: PROGRESS: at sentence #120000, processed 109101979 words, keeping 6621 word types
2020-07-31 22:56:04,678 INFO: PROGRESS: at sentence #130000, processed 118112658 words, keeping 6672 word types
2020-07-31 22:56:10,503 INFO: PROGRESS: at sentence #140000, processed 126928025 words, keeping 6689 word types
2020-07-31 22:56:15,953 INFO: PROGRESS: at sentence #150000, processed 136177978 words, keeping 6718 word types
2020-07-31 22:56:20,138 INFO: PROGRESS: at sentence #160000, processed 145204090 words, keeping 6759 word types
2020-07-31 22:56:25,443 INFO: PROGRESS: at sentence #170000, processed 154259340 words, keeping 6788 word types
2020-07-31 22:56:30,140 INFO: collected 6826 word types from a corpus of 163331797 raw words and 180000 sentences
2020-07-31 22:56:30,148 INFO: Loading a fresh vocabulary
2020-07-31 22:56:30,588 INFO: effective_min_count=5 retains 5978 unique words (87% of original 6826, drops 848)
2020-07-31 22:56:30,588 INFO: effective_min_count=5 leaves 163330134 word corpus (99% of original 163331797, drops 1663)
2020-07-31 22:56:30,644 INFO: deleting the raw counts dictionary of 6826 items
2020-07-31 22:56:30,700 INFO: sample=0.001 downsamples 61 most-common words
2020-07-31 22:56:30,700 INFO: downsampling leaves estimated 140986777 word corpus (86.3% of prior 163330134)
2020-07-31 22:56:30,740 INFO: estimated required memory for 5978 words and 100 dimensions: 7771400 bytes
2020-07-31 22:56:30,740 INFO: resetting layer weights
2020-07-31 22:56:32,372 INFO: training model with 8 workers on 5978 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-07-31 22:56:33,476 INFO: EPOCH 1 - PROGRESS: at 0.71% examples, 965562 words/s, in_qsize 14, out_qsize 1
2020-07-31 22:56:34,493 INFO: EPOCH 1 - PROGRESS: at 1.35% examples, 908630 words/s, in_qsize 13, out_qsize 2
2020-07-31 22:56:35,501 INFO: EPOCH 1 - PROGRESS: at 2.03% examples, 927393 words/s, in_qsize 14, out_qsize 2
2020-07-31 22:56:36,513 INFO: EPOCH 1 - PROGRESS: at 2.66% examples, 923767 words/s, in_qsize 14, out_qsize 1
......
......
这个设定代表当前训练好的词向量为最终版,也可以加快模型的训练速度
model.init_sims(replace=True)
2020-07-31 23:13:39,065 INFO: precomputing L2-norms of word weight vectors
保存模型
'''
save保存的模型,载入之后可以继续在此基础上接着训练
而format_save保存的模型不能,但有个好处就是如果设置binary=False则保存后的结果可以直接打开查看
'''
model.save(r'D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin')
2020-07-31 23:14:08,564 INFO: saving Word2Vec object under D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin, separately None
2020-07-31 23:14:08,588 INFO: not storing attribute vectors_norm
2020-07-31 23:14:08,588 INFO: not storing attribute cum_table
2020-07-31 23:14:08,733 INFO: saved D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin
加载模型
model = Word2Vec.load(r'D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin')
2020-07-31 23:17:06,610 INFO: loading Word2Vec object from D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin
2020-07-31 23:17:06,682 INFO: loading wv recursively from D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin.wv.* with mmap=None
2020-07-31 23:17:06,682 INFO: setting ignored attribute vectors_norm to None
2020-07-31 23:17:06,682 INFO: loading vocabulary recursively from D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin.vocabulary.* with mmap=None
2020-07-31 23:17:06,682 INFO: loading trainables recursively from D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin.trainables.* with mmap=None
2020-07-31 23:17:06,682 INFO: setting ignored attribute cum_table to None
2020-07-31 23:17:06,682 INFO: loaded D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.bin
转化格式
'''
通过该方式保存的模型,能通过文本格式打开,也能通过设置binary是否保存为二进制文件
但该模型在保存时丢弃了树的保存形式(详情参加word2vec构建过程,以类似哈夫曼树的形式保存词)
所以在后续不能对模型进行追加训练
'''
model.wv.save_word2vec_format(r'D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.txt', binary=False)
2020-07-31 23:17:44,648 INFO: storing 5978x100 projection weights into D:\Users\Felixteng\Documents\Pycharm Files\Nlp\word2vec.txt
下一个任务会涉及到Bert的使用,包括pretrain和finetune.