简介
使用TensorFlow实现中文词向量的训练,并完成一些简单的语义任务
回顾
在全栈课程中介绍过如何使用gensim
训练中文词向量,即词嵌入(Word Embedding)
- http://study.163.com/course/courseLearn.htm?courseId=1003520028&lessonId=1004013764
- https://note.youdao.com/share/?id=2155cf875395e84d92ef80baeae7c3c0&type=notebook#/WEB5c4ea5da9d01cbfa7cd3e5bd9a748ac9
如果没有gensim则安装
pip install gensim
准备好语料,例如中文维基百科分词语料
加载库
# -*- coding: utf-8 -*-
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import time
训练模型并保存,在我的笔记本上训练共耗时1403秒
t0 = int(time.time())
sentences = LineSentence('wiki.zh.word.text')
model = Word2Vec(sentences, size=128, window=5, min_count=5, workers=4)
print('训练耗时 %d s' % (int(time.time()) - t0))
model.save('gensim_128')
加载模型并使用
model = Word2Vec.load('gensim_128')
# 相关词
items = model.wv.most_similar('数学')
for i, item in enumerate(items):
print(i, item[0], item[1])
# 语义类比
print('=' * 20)
items = model.wv.most_similar(positive=['纽约', '中国'], negative=['北京'])
for i, item in enumerate(items):
print(i, item[0], item[1])
# 不相关词
print('=' * 20)
print(model.wv.doesnt_match(['早餐', '午餐', '晚餐', '手机']))
# 计算相关度
print('=' * 20)
print(model.wv.similarity('男人', '女人'))
原理
词向量是对词语的一种表示(representation)
- 有了词向量之后,就可以将一句话表示成一个向量序列,即一个二维Tensor
- 如果是多个长度相等的句子,则可以表示为一个三维Tensor
说白了,词向量就是一个二维矩阵,维度为V*d
,V
是词的总个数,d
是词向量的维度
One-Hot
将每个词语表示为一个V
维向量,仅当前词语对应的维度为1,其他维度为0
词嵌入将One-Hot
表示的高维稀疏向量,映射为该词语对应的,低维稠密实值的词向量
词向量的训练主要有两种方法
- CBOW(Continuous Bag-of-Words):根据上下文词语预测当前词
- Skip-Gram:根据当前词预测上下文词语
这里我们主要讲一下Skip-Gram的原理
输入为一个词对应的整数id或One-Hot
表示,经过Embedding层后得到对应的词向量,经过一层映射和softmax处理后,得到每个词对应的输出概率
由于词汇表往往非常大,几万、几十万甚至几百万,因此直接在整个词汇表上进行多分类将会导致非常大的计算量
一个有效的解决方法是Negative Sampling,即每次随机采样一些负样本
假设词汇表大小为5W,对于某个输入词,已知对应的正确输出词,再随机从词汇表中选择N个词,这N个词刚好是正确输出词的概率非常低,因此可以认为是负样本
- 给你一张狗狗图片,判断出对应的种类名称
- 给你五张狗狗图片,判断出每一张是否是哈士奇
这样一来,就把一个5W分类的多分类问题,变成了N个二分类问题,同样提供了可学习的梯度,并且大大降低了计算量
在具体实现中,可以使用Noise-Contrastive Estimation
(NCE)作为损失函数,在TensorFlow中使用tf.nn.nce_loss()
即可
实现
加载库和语料,一共254419行
# -*- coding: utf-8 -*-
import pickle
import numpy as np
import tensorflow as tf
import collections
from tqdm import tqdm
with open('wiki.zh.word.text', 'rb') as fr:
lines = fr.readlines()
print('共%d行' % len(lines))
print(lines[0].decode('utf-8'))
一共有148134974个词
lines = [line.decode('utf-8') for line in lines]
words = ' '.join(lines)
words = words.replace('\n', '').split(' ')
print('共%d个词' % len(words))
定义词典
vocab_size = 50000
vocab = collections.Counter(words).most_common(vocab_size - 1)
词频统计
count = [['UNK', 0]]
count.extend(vocab)
print(count[:10])
词和id之间的相互映射
word2id = {}
id2word = {}
for i, w in enumerate(count):
word2id[w[0]] = i
id2word[i] = w[0]
print(id2word[100], word2id['数学'])
将语料转为id序列,一共有22385926个UNK
data = []
for i in tqdm(range(len(lines))):
line = lines[i].strip('\n').split(' ')
d = []
for word in line:
if word in word2id:
d.append(word2id[word])
else:
d.append(0)
count[0][1] += 1
data.append(d)
print('UNK数量%d' % count[0][1])
准备训练数据
X_train = []
Y_train = []
window = 3
for i in tqdm(range(len(data))):
d = data[i]
for j in range(len(d)):
start = j - window
end = j + window
if start < 0:
start = 0
if end >= len(d):
end = len(d) - 1
while start <= end:
if start == j:
start += 1
continue
else:
X_train.append(d[j])
Y_train.append(d[start])
start += 1
X_train = np.squeeze(np.array(X_train))
Y_train = np.squeeze(np.array(Y_train))
Y_train = np.expand_dims(Y_train, -1)
print(X_train.shape, Y_train.shape)
定义模型参数
batch_size = 128
embedding_size = 128
valid_size = 16
valid_range = 100
valid_examples = np.random.choice(valid_range, valid_size, replace=False)
num_negative_samples = 64
定义模型
X = tf.placeholder(tf.int32, shape=[batch_size], name='X')
Y = tf.placeholder(tf.int32, shape=[batch_size, 1], name='Y')
valid = tf.placeholder(tf.int32, shape=[None], name='valid')
embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, X)
nce_weights = tf.Variable(tf.truncated_normal([vocab_size, embedding_size], stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocab_size]))
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, labels=Y, inputs=embed, num_sampled=num_negative_samples, num_classes=vocab_size))
optimizer = tf.train.AdamOptimizer().minimize(loss)
将词向量归一化,并计算和给定词之间的相似度
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), axis=1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
训练模型
sess = tf.Session()
sess.run(tf.global_variables_initializer())
offset = 0
losses = []
for i in tqdm(range(1000000)):
if offset + batch_size >= X_train.shape[0]:
offset = (offset + batch_size) % X_train.shape[0]
X_batch = X_train[offset: offset + batch_size]
Y_batch = Y_train[offset: offset + batch_size]
_, loss_ = sess.run([optimizer, loss], feed_dict={X: X_batch, Y: Y_batch})
losses.append(loss_)
if i % 2000 == 0 and i > 0:
print('Iteration %d Average Loss %f' % (i, np.mean(losses)))
losses = []
if i % 10000 == 0:
sim = sess.run(similarity, feed_dict={valid: valid_examples})
for j in range(valid_size):
valid_word = id2word[valid_examples[j]]
top_k = 5
nearests = (-sim[j, :]).argsort()[1: top_k + 1]
s = 'Nearest to %s:' % valid_word
for k in range(top_k):
s += ' ' + id2word[nearests[k]]
print(s)
offset += batch_size
保存模型、最终词向量、映射字典
saver = tf.train.Saver()
saver.save(sess, './tf_128')
final_embeddings = sess.run(normalized_embeddings)
with open('tf_128.pkl', 'wb') as fw:
pickle.dump({'embeddings': final_embeddings, 'word2id': word2id, 'id2word': id2word}, fw, protocol=4)
在单机上使用训练好的模型和词向量
加载库和得到的词向量、映射字典
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import pickle
with open('tf_128.pkl', 'rb') as fr:
data = pickle.load(fr)
final_embeddings = data['embeddings']
word2id = data['word2id']
id2word = data['id2word']
获取频次最高的前200个非单字词,对其词向量进行tSNE降维可视化
word_indexs = []
count = 0
plot_only = 200
for i in range(1, len(id2word)):
if len(id2word[i]) > 1:
word_indexs.append(i)
count += 1
if count == plot_only:
break
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[word_indexs, :])
labels = [id2word[i] for i in word_indexs]
plt.figure(figsize=(15, 12))
for i, label in enumerate(labels):
x, y = two_d_embeddings[i, :]
plt.scatter(x, y)
plt.annotate(label, (x, y), ha='center', va='top', fontproperties='Microsoft YaHei')
plt.savefig('词向量降维可视化.png')
可以看到,语义相关的词确实都处于相近的位置
可以加载TensorFlow模型,给valid
指定一些词对应的id以获取相似词
sess = tf.Session()
sess.run(tf.global_variables_initializer())
saver = tf.train.import_meta_graph('tf_128.meta')
saver.restore(sess, tf.train.latest_checkpoint('.'))
graph = tf.get_default_graph()
valid = graph.get_tensor_by_name('valid:0')
similarity = graph.get_tensor_by_name('MatMul_1:0')
word = '数学'
sim = sess.run(similarity, feed_dict={valid: [word2id[word]]})
top_k = 10
nearests = (-sim[0, :]).argsort()[1: top_k + 1]
s = 'Nearest to %s:' % word
for k in range(top_k):
s += ' ' + id2word[nearests[k]]
print(s)
和数学最相关的10个词
Nearest to 数学: 理论 物理学 应用 物理 科学 化学 定义 哲学 生物学 天文学
使用词向量完成其他语义任务
# 计算相关度
def cal_sim(w1, w2):
return np.dot(final_embeddings[word2id[w1]], final_embeddings[word2id[w2]])
print(cal_sim('男人', '女人'))
# 相关词
word = '数学'
sim = [[id2word[i], cal_sim(word, id2word[i])] for i in range(len(id2word))]
sim.sort(key=lambda x:x[1], reverse=True)
top_k = 10
for i in range(top_k):
print(sim[i + 1])
# 不相关词
def find_mismatch(words):
vectors = [final_embeddings[word2id[word]] for word in words]
scores = {word: np.mean([cal_sim(word, w) for w in words]) for word in words}
scores = sorted(scores.items(), key=lambda x:x[1])
return scores[0][0]
print(find_mismatch(['早餐', '午餐', '晚餐', '手机']))
参考
- Efficient Estimation of Word Representations in Vector Space:https://arxiv.org/abs/1301.3781
- Distributed Representations of Words and Phrases and their Compositionality:https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
- Vector Representations of Words:https://www.tensorflow.org/tutorials/word2vec