博客内容将首发在微信公众号"跟我一起读论文啦啦",上面会定期分享机器学习、深度学习、数据挖掘、自然语言处理等高质量论文,欢迎关注!
一直以来,对
w
o
r
d
2
v
e
c
word2vec
word2vec,以及对
t
e
n
s
o
r
f
l
o
w
tensorflow
tensorflow 里面的
w
o
r
d
E
m
b
e
d
d
i
n
g
wordEmbedding
wordEmbedding底层实现原理一直模糊不清,由此决心阅读
w
o
r
d
2
V
e
c
word2Vec
word2Vec的两篇原始论文,
E
f
f
i
c
i
e
n
t
E
s
t
i
m
a
t
i
o
n
o
f
W
o
r
d
R
e
p
r
e
s
e
n
t
a
t
i
o
n
s
i
n
V
e
c
t
o
r
S
p
a
c
e
Efficient\ Estimation\ of\ Word\ Representations\ in\ Vector\ Space
Efficient Estimation of Word Representations in Vector Space,
D
i
s
t
r
i
b
u
t
e
d
R
e
p
r
e
s
e
n
t
a
t
i
o
n
s
o
f
W
o
r
d
s
a
n
d
P
h
r
a
s
e
s
a
n
d
t
h
e
i
r
C
o
m
p
o
s
i
t
i
o
n
a
l
i
t
y
Distributed\ Representations\ of\ Words\ and\ Phrases\ and\ their\ Compositionality
Distributed Representations of Words and Phrases and their Compositionality,看完以后还是有点半懂半不懂的感觉,于是又结合网上的一些比较好的讲解(Word2Vec Tutorial - The Skip-Gram Model),以及开源的实现代码理解了一遍,在此总结一下。
下面主要以 s k i p − g r a m skip-gram skip−gram 模型来介绍 w o r d 2 V e c word2Vec word2Vec。
word2vec工作流程
- w o r d 2 V e c word2Vec word2Vec只是一个 三层 的神经网络。
- 喂给模型一个 w o r d word word,然后用来预测它周边的词。
- 然后去掉最后一层,只保存 i n p u t _ l a y e r input\_layer input_layer 和 h i d d e n _ l a y e r hidden\_layer hidden_layer。
- 从词表中选取一个词,喂给模型,在 h i d d e n _ l a y e r hidden\_layer hidden_layer 将会给出该词的 e m b e d d i n g r e p e s e n t a t i o n embedding\ repesentation embedding repesentation。
import numpy as np
import tensorflow as tf
corpus_raw = 'He is the king . The king is royal . She is the royal queen '
# convert to lower case
corpus_raw = corpus_raw.lower()
上述代码非常简单和易懂,现在我们需要获取 i n p u t o u t p u t p a i r input\ output\ pair input output pair,假设我们现在有这样一个任务,喂给模型一个词,我们需要获取它周边的词,举例来说,就是获取该词前 n n n个和后 n n n个词,那么这个 n n n就是代码中的 w i n d o w _ s i z e window\_size window_size,例如下图:
注意:如果这个词是一个句子的开头或结尾, w i n d o w window window 忽略窗外的词。
我们需要对文本数据进行一个简单的预处理,创建一个 w o r d 2 i n t word2int word2int的字典和 i n t 2 w o r d int2word int2word的字典。
words = []
for word in corpus_raw.split():
if word != '.': # because we don't want to treat . as a word
words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
word2int[word] = i
int2word[i] = word
来看看这个字典有啥效果:
print(word2int['queen'])
-> 42 (say)
print(int2word[42])
-> 'queen'
好,现在可以获取训练数据啦
data = []
WINDOW_SIZE = 2
for sentence in sentences:
for word_index, word in enumerate(sentence):
for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] :
if nb_word != word:
data.append([word, nb_word])
上述代码就是切句子,然后切词,得出的一个个训练样本 [ w o r d , n b _ w o r d ] [word,\ nb\_word] [word, nb_word],其中 w o r d word word就是模型输入, n b _ w o r d nb\_word nb_word就是该词周边的某个单词。
把 d a t a data data打印出来看看?
print(data)
[['he', 'is'],
['he', 'the'],
['is', 'he'],
['is', 'the'],
['is', 'king'],
['the', 'he'],
['the', 'is'],
['the', 'king'],
.
.
.
]
现在我们有了训练数据了,但是需要将它转成模型可读可理解的形式,这时,上面的 w o r d 2 i n t word2int word2int字典的作用就来了。
来,我们更进一步的对 w o r d word word进行处理,并使其转成 o n e − h o t one-hot one−hot向量
i.e.,
say we have a vocabulary of 3 words : pen, pineapple, apple
where
word2int['pen'] -> 0 -> [1 0 0]
word2int['pineapple'] -> 1 -> [0 1 0]
word2int['apple'] -> 2 -> [0 0 1]
那么为啥是 o n e − h o t one-hot one−hot特征呢?稍后将解释。
# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
temp = np.zeros(vocab_size)
temp[data_point_index] = 1
return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)
利用 t e n s o r f l o w tensorflow tensorflow建立模型
# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))
由上图,我们可以看出,我们将 i n p u t input input转换成 e m b e d d i n g _ r e p r e s e n t a t i o n embedding\_representation embedding_representation,并且将 v o c a b S i z e vocabSize vocabSize维度降低到设定的 e m b e d d i n g _ d i m embedding\_dim embedding_dim。
EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)
接下来,我们需要使用 s o f t m a x softmax softmax函数来预测该 w o r d word word周边的词。
W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))
所以整体的过程如下:
input_one_hot ---> embedded repr. ---> predicted_neighbour_prob
predicted_prob will be compared against a one hot vector to correct it.
好了,来看看怎么训这个模型
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!
# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)
n_iters = 10000
# train for n_iter iterations
for _ in range(n_iters):
sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))
在训的过程中,你可以看到 l o s s loss loss的变化:
loss is : 2.73213
loss is : 2.30519
loss is : 2.11106
loss is : 1.9916
loss is : 1.90923
loss is : 1.84837
loss is : 1.80133
loss is : 1.76381
loss is : 1.73312
loss is : 1.70745
loss is : 1.68556
loss is : 1.66654
loss is : 1.64975
loss is : 1.63472
loss is : 1.62112
loss is : 1.6087
loss is : 1.59725
loss is : 1.58664
loss is : 1.57676
loss is : 1.56751
loss is : 1.55882
loss is : 1.55064
loss is : 1.54291
loss is : 1.53559
loss is : 1.52865
loss is : 1.52206
loss is : 1.51578
loss is : 1.50979
loss is : 1.50408
loss is : 1.49861
.
.
.
最终 l o s s loss loss会收敛,即使其 a c c u r a c y accuracy accuracy不能达到很高的水平,我们并不 c a r e care care这点,我们最终的目的是获取较好的 W 1 W1 W1和 b 1 b1 b1,也就是 h i d d e n _ r e p e s e n t a t i o n hidden\_repesentation hidden_repesentation。
为什么是 o n e − h o t one-hot one−hot?
当我们用 o n e − h o t one-hot one−hot向量乘以 W 1 W1 W1时,获取的是 W 1 W1 W1矩阵的某一行,所以 W 1 W1 W1扮演的是一个 l o o k u p t a b l e look\ up\ table look up table。
在我们这个代码例子中,可以看看 " q u e e n " "queen" "queen"在 W 1 W1 W1中的 r e p e s e t a t i o n repesetation repesetation。
print(vectors[ word2int['queen'] ])
# say here word2int['queen'] is 2
->
[-0.69424796 -1.67628145 3.07313657 -1.14802659 -1.2207377 ]
给定一个向量,我们可以获取与其最近的向量
def euclidean_dist(vec1, vec2):
return np.sqrt(np.sum((vec1-vec2)**2))
def find_closest(word_index, vectors):
min_dist = 10000 # to act like positive infinity
min_index = -1
query_vector = vectors[word_index]
for index, vector in enumerate(vectors):
if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
min_dist = euclidean_dist(vector, query_vector)
min_index = index
return min_index
我们来看看,与 " k i n g " 、 " q u e e n " 、 " r o y a l " "king"、"queen"、"royal" "king"、"queen"、"royal"最近的词:
print(int2word[find_closest(word2int['king'], vectors)])
print(int2word[find_closest(word2int['queen'], vectors)])
print(int2word[find_closest(word2int['royal'], vectors)])
->
queen
king
he
进阶
上面总结的主要是第一篇论文 E f f i c i e n t E s t i m a t i o n o f W o r d R e p r e s e n t a t i o n s i n V e c t o r S p a c e Efficient\ Estimation\ of\ Word\ Representations\ in\ Vector\ Space Efficient Estimation of Word Representations in Vector Space内的内容,虽然只是一个三层的神经网络,但是在海量训练数据的情况下,需要极大的计算资源来支撑整个过程,举例来说,我们设定的 e m b e d d i n g _ s i z e = 300 embedding\_size=300 embedding_size=300时,而 v o c a b _ s i z e = 10 , 000 vocab\_size=10,000 vocab_size=10,000时,这时 W 1 W1 W1矩阵的维度就达到了 10 , 000 ∗ 300 = 3 m i l l i o n 10,000*300=3 million 10,000∗300=3million!!,这个时候再用 S G D SGD SGD来优化训练过程就显得十分缓慢,但是有时候你必须使用大量的数据来训练模型来避免过拟合。论文 D i s t r i b u t e d R e p r e s e n t a t i o n s o f W o r d s a n d P h r a s e s a n d t h e i r C o m p o s i t i o n a l i t y Distributed\ Representations\ of\ Words\ and\ Phrases\ and\ their\ Compositionality Distributed Representations of Words and Phrases and their Compositionality介绍了几种解决办法。
-
采用下采样来降低训练样本数量
在 t e n s o r f l o w tensorflow tensorflow里面实现的 w o r d 2 V e c word2Vec word2Vec, v o c a b _ s z i e vocab\_szie vocab_szie并不是所有的 w o r d word word的数量,而且先统计了所有 w o r d word word的出现频次,然后选取出现频次最高的前 50000 50000 50000的词作为词袋。具体操作请看代码 tensorflow/examples/tutorials/word2vec/word2vec_basic.py,其余的词用 u n k unk unk代替。 -
采用一种所谓的"负采样"的操作,这种操作每次可以让一个样本只更新权重矩阵中一小部分,减小训练过程中的计算压力。
举例来说:一个 i n p u t o u t p u t p a i r input\ output\ pair input output pair 如: ( “ f o x ” , “ q u i c k ” ) (“fox”, “quick”) (“fox”,“quick”),由上面的分析可知,其 t r u e l a b e l true\ label true label为一个 o n e − h o t one-hot one−hot向量,并且该向量只是在 q u i c k quick quick的位置为1,其余的位置均为0,并且该向量的长度为 v o c a b s i z e vocab\ size vocab size,由此每个样本都缓慢能更新权重矩阵,而"负采样"操作只是随机选择其余的部分 w o r d word word,使得其在 t r u e l a b e l true\ label true label的位置为0,那么我们只更新对应位置的权重。例如我们如果选择负采样数量为5,则选取5个其余的 w o r d word word,使其对应的 o u t p u t output output为0,这个时候 o u t p u t output output只是6个神经元,本来我们一次需要更新 300 ∗ 10 , 000 300*10,000 300∗10,000参数,进行负采样操作以后只需要更新 300 ∗ 6 = 1800 300*6=1800 300∗6=1800个参数。 -
Hierarchical Softmax 是NLP中常用方法,详情可以查看Hierarchical Softmax 。其主要思想是以词频构建Huffman树,树的叶子节点为词表中的词,相应的高频词距离根结点更近。当需要计算生成某个词的概率时,不需要对所有词进行概率计算,而是选择在Huffman树中从根结点到该词所在结点的路径进行计算,得到生成该词的概率,时间复杂度从 O(N) 降低到 O(logN)(N个结点,则树的深度logN)
个人总结
- seq2seq模型,输入处都会乘以 e m b e d d i n g _ m a t r i x embedding\_matrix embedding_matrix,输出处都会乘以 e m b e d d i n g _ m a t r i x T embedding\_matrix^T embedding_matrixT,这两个embedding矩阵有时会共享,有时则不会。我认为 w o r d 2 V e c word2Vec word2Vec 其实就是 s e q 2 s e q seq2seq seq2seq 模型的原型,只不过应用到了不同的复杂场景中,根据场景需要,在内部加了 A t t e n t i o n Attention Attention 等机制,大致框架依然是 w o r d 2 V e c word2Vec word2Vec。
- w o r d 2 V e c word2Vec word2Vec 是当前自然语言处理领域的最基础知识,深刻理解 w o r d 2 v e c word2vec word2vec 原理非常重要。
个人感觉 w o r d 2 V e c word2Vec word2Vec了解到这个程度差不多了。
完整代码:
import tensorflow as tf
import numpy as np
corpus_raw = 'He is the king . The king is royal . She is the royal queen '
# convert to lower case
corpus_raw = corpus_raw.lower()
words = []
for word in corpus_raw.split():
if word != '.': # because we don't want to treat . as a word
words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
word2int[word] = i
int2word[i] = word
# raw sentences is a list of sentences.
raw_sentences = corpus_raw.split('.')
sentences = []
for sentence in raw_sentences:
sentences.append(sentence.split())
WINDOW_SIZE = 2
data = []
for sentence in sentences:
for word_index, word in enumerate(sentence):
for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] :
if nb_word != word:
data.append([word, nb_word])
# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
temp = np.zeros(vocab_size)
temp[data_point_index] = 1
return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)
# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))
EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)
W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!
# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)
n_iters = 10000
# train for n_iter iterations
for _ in range(n_iters):
sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))
vectors = sess.run(W1 + b1)
def euclidean_dist(vec1, vec2):
return np.sqrt(np.sum((vec1-vec2)**2))
def find_closest(word_index, vectors):
min_dist = 10000 # to act like positive infinity
min_index = -1
query_vector = vectors[word_index]
for index, vector in enumerate(vectors):
if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
min_dist = euclidean_dist(vector, query_vector)
min_index = index
return min_index
from sklearn.manifold import TSNE
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
vectors = model.fit_transform(vectors)
from sklearn import preprocessing
normalizer = preprocessing.Normalizer()
vectors = normalizer.fit_transform(vectors, 'l2')
print(vectors)
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
print(words)
for word in words:
print(word, vectors[word2int[word]][1])
ax.annotate(word, (vectors[word2int[word]][0],vectors[word2int[word]][1] ))
plt.show()