
本文深入探讨了word2vec的词义表示方法,如WordNet、One Hot Encoding、Bag of Words和Word2vec,以及损失函数、训练方法(跳字模型和连续词袋模型)、近似训练(层序SoftMax和负采样)。此外,还介绍了doc2vec的Distributed Memory (PV-DM) 和 Distributed Bag of Words (PV-DBOW) 训练算法,并讨论了doc2vec在实际应用中的优缺点和调参技巧。




在NLP中,最基础的问题就是如何表示一个词、句子(Represent the Meaning of a Word)。接下来介绍的几种方法各有优劣,不过也是不断进步的过程。


WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.


  1. can find synonyms. 方便寻找同义词


  1. missing new words (impossible to keep up to date). 缺少新词。
  2. subjective. 主观化。
  3. requires human labor to create and adapt. 需要耗费大量人力去整理。

One Hot Encoding

Discrete representation.


  1. dimension is extremely high. 维度爆炸。
  2. hard to compute accurate word similarity (all vectors are orthogonal). 无法计算词语相似度。

Bag of Words

Co-occurrence of words with variable window size.


  1. dimension is extremely high, will grow as dictionary grows. 维度爆炸,而且会随着字典大小的增大而增大,对下游的ML模型产生影响。


A neural probabilistic language model.

Distributional similarity based representations. Represent a word by means of its neighbors.上下文足以辅助理解一个词的意思。

We will build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context.

Distributional similarity & Distributed representation (dense vector)

There are certain differences between the two. The Distributional Similarity emphasizes that the meaning of a word shall be inferred from its context. Distributed Representation is opposite to One Hot Encoding, and vector representation is non-sparse. 两者有一定的区别。distributional similarity强调能够用上下文去表示某一个单词的意思,而distributed representation与one hot encoding相对,强调向量的表示是非稀疏的。


  1. can compute accurate word similarity. 可以计算词语相似度。


  1. The calculation is related word vector instead of semantic word vector, so the polysemous case cannot be solved (1 vector for each word instead of each meaning). 计算出来的是关联词向量,而不是语义词向量,所以无法解决一词多意的情况(每个单词而不是每个词意对应1个向量)。

损失函数Loss Function

Softmax function: map from R v R^v Rv to a probability distribution(从实数空间到概率分布的标准映射方法)。公式分子部分保证将这个数转化成一个正数,分母部分保证所有概率之和为1。

p i = e x p ( u i ) ∑ j e x p ( u j ) p_i = \frac {exp(u_i)} {\sum_{j} exp(u_j)} pi=jexp(uj)exp(ui)

我们在求出center/context word的概率分布之后,还需要使用交叉熵来得到loss。

L ( y ^ , y ) = − ∑ j = 1 V y j l o g ( y ^ j ) L(\hat y, y) = − \sum_{j=1}^V y_j log(\hat y_j) L(y^,y)=j=1Vyjlog(y^j). 根据公式,在完美预测的情况下,loss是0。


J = 1 − p ( w − t ∣ w t ) J = 1 - p(w_{-t} | w_t) J=1p(wtwt)

w − t w_{-t} wt代表 w t w_t wt的上下文(负号表示除了该词之外)。

p ( o ∣ c ) = e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) p(o|c) = \frac {exp(u_o^T v_c)} {\sum_{w=1}^V exp(u_w^T v_c)} p(oc)=w=1Vexp(uwTvc)exp(uoTvc)

o is the outside (or output) word index, c is the center word index. v c v_c vc and u o u_o uo are center and outside vectors of indices c and o. Softmax uses word c to obtain probability of word o.

According to this formula, the words in the text will be represented by two vectors. There’s one when it’s a center word, and there’s another when it’s a context. 根据这个公式,文中的单词会有两个向量表示。当它作为中心词的时候有一个,当它作为上下文的时候又有一个。

∂ ∂ v c p ( o ∣ c ) = ∂ ∂ v c l o g [ e x p ( u o T v c ) / ∑ w = 1 V e x p ( u w T v c ) ] = ∂ ∂ v c l o g [ e x p ( u o T v c ) ] ① − ∂ ∂ v c l o g [ ∑ w = 1 V e x p ( u w T v c ) ] ② = u o − ∑ x = 1 V p ( x ∣ c ) u x \frac{\partial} {\partial v_c} p(o|c) \\\\ = \frac{\partial} {\partial v_c} log[exp(u_o^T v_c) / \sum_{w=1}^V exp(u_w^T v_c)] \\\\ = \frac{\partial}{\partial v_c} log[exp(u_o^T v_c)] ① - \frac{\partial} {\partial v_c} log[\sum_{w=1}^V exp(u_w^T v_c)] ② \\\\ = u_o - \sum_{x=1}^V p(x|c)u_x vcp(oc)=vclog[exp(uoTvc)/w=1Vexp(uwTvc)]=vclog[exp(uoTvc)]vclog[w=1Vexp(uwTvc)]=uox=1Vp(xc)ux
① 表示的是observation,也就是context word实际是什么(true label)。② 表示的是expectation,也就是模型认为概率最高的应该是哪个词(prediction label)。所以,实际上我们就是希望最小化实际和预测之间的差值。
② = ∂ ∂ v c l o g [ ∑ w = 1 V e x p ( u w T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∂ ∂ v c [ ∑ x = 1 V e x p ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ ∂ ∂ v c e x p ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ e x p ( u x T v c ) ∂ ∂ v c ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ e x p ( u x T v c ) u x ] = ∑ x = 1 V e x p ( u x T v c ) ∑ w = 1 V e x p ( u w T v c ) u x = ∑ x = 1 V p ( x ∣ c ) u x ② = \frac{\partial} {\partial v_c} log[\sum_{w=1}^V exp(u_w^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \frac{\partial}{\partial v_c} [\sum_{x=1}^V exp(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [\frac{\partial}{\partial v_c} exp(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [exp(u_x^T v_c) \frac{\partial}{\partial v_c}(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [exp(u_x^T v_c) u_x] \\\\ = \sum_{x=1}^V \frac{exp(u_x^T v_c)}{\sum_{w=1}^V exp(u_w^T v_c)} u_x \\\\ = \sum_{x=1}^V p(x|c) u_x =vclog[w=1Vexp(uwTvc)]=w=1Vexp(uwTvc)1vc[x=1Vexp(uxTvc)]=w=1Vexp(uwTvc)

