更新:
9/4/20
做了关于SG模型最后计算loss的一些补充。对doc2vec损失计算部分出现的错误进行了订正。
11/17/20
补充了部分近似训练的内容。
词义表示
在NLP中,最基础的问题就是如何表示一个词、句子(Represent the Meaning of a Word)。接下来介绍的几种方法各有优劣,不过也是不断进步的过程。
WordNet
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
pros
- can find synonyms. 方便寻找同义词
cons
- missing new words (impossible to keep up to date). 缺少新词。
- subjective. 主观化。
- requires human labor to create and adapt. 需要耗费大量人力去整理。
One Hot Encoding
Discrete representation.
cons
- dimension is extremely high. 维度爆炸。
- hard to compute accurate word similarity (all vectors are orthogonal). 无法计算词语相似度。
Bag of Words
Co-occurrence of words with variable window size.
cons
- dimension is extremely high, will grow as dictionary grows. 维度爆炸,而且会随着字典大小的增大而增大,对下游的ML模型产生影响。
Word2vec
A neural probabilistic language model.
Distributional similarity based representations. Represent a word by means of its neighbors.上下文足以辅助理解一个词的意思。
We will build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context.
Distributional similarity & Distributed representation (dense vector)
There are certain differences between the two. The Distributional Similarity emphasizes that the meaning of a word shall be inferred from its context. Distributed Representation is opposite to One Hot Encoding, and vector representation is non-sparse. 两者有一定的区别。distributional similarity强调能够用上下文去表示某一个单词的意思,而distributed representation与one hot encoding相对,强调向量的表示是非稀疏的。
pros
- can compute accurate word similarity. 可以计算词语相似度。
cons
- The calculation is related word vector instead of semantic word vector, so the polysemous case cannot be solved (1 vector for each word instead of each meaning). 计算出来的是关联词向量,而不是语义词向量,所以无法解决一词多意的情况(每个单词而不是每个词意对应1个向量)。
损失函数Loss Function
Softmax function: map from R v R^v Rv to a probability distribution(从实数空间到概率分布的标准映射方法)。公式分子部分保证将这个数转化成一个正数,分母部分保证所有概率之和为1。
p i = e x p ( u i ) ∑ j e x p ( u j ) p_i = \frac {exp(u_i)} {\sum_{j} exp(u_j)} pi=∑jexp(uj)exp(ui)
我们在求出center/context word的概率分布之后,还需要使用交叉熵来得到loss。
L ( y ^ , y ) = − ∑ j = 1 V y j l o g ( y ^ j ) L(\hat y, y) = − \sum_{j=1}^V y_j log(\hat y_j) L(y^,y)=−∑j=1Vyjlog(y^j). 根据公式,在完美预测的情况下,loss是0。
此处以skip-gram的训练为例。
J = 1 − p ( w − t ∣ w t ) J = 1 - p(w_{-t} | w_t) J=1−p(w−t∣wt)
w − t w_{-t} w−t代表 w t w_t wt的上下文(负号表示除了该词之外)。
p ( o ∣ c ) = e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) p(o|c) = \frac {exp(u_o^T v_c)} {\sum_{w=1}^V exp(u_w^T v_c)} p(o∣c)=∑w=1Vexp(uwTvc)exp(uoTvc)
o is the outside (or output) word index, c is the center word index. v c v_c vc and u o u_o uo are center and outside vectors of indices c and o. Softmax uses word c to obtain probability of word o.
According to this formula, the words in the text will be represented by two vectors. There’s one when it’s a center word, and there’s another when it’s a context. 根据这个公式,文中的单词会有两个向量表示。当它作为中心词的时候有一个,当它作为上下文的时候又有一个。
loss的推导过程主要运用softmax的概率分布公式和微积分的链式法则。
∂ ∂ v c p ( o ∣ c ) = ∂ ∂ v c l o g [ e x p ( u o T v c ) / ∑ w = 1 V e x p ( u w T v c ) ] = ∂ ∂ v c l o g [ e x p ( u o T v c ) ] ① − ∂ ∂ v c l o g [ ∑ w = 1 V e x p ( u w T v c ) ] ② = u o − ∑ x = 1 V p ( x ∣ c ) u x \frac{\partial} {\partial v_c} p(o|c) \\\\ = \frac{\partial} {\partial v_c} log[exp(u_o^T v_c) / \sum_{w=1}^V exp(u_w^T v_c)] \\\\ = \frac{\partial}{\partial v_c} log[exp(u_o^T v_c)] ① - \frac{\partial} {\partial v_c} log[\sum_{w=1}^V exp(u_w^T v_c)] ② \\\\ = u_o - \sum_{x=1}^V p(x|c)u_x ∂vc∂p(o∣c)=∂vc∂log[exp(uoTvc)/w=1∑Vexp(uwTvc)]=∂vc∂log[exp(uoTvc)]①−∂vc∂log[w=1∑Vexp(uwTvc)]②=uo−x=1∑Vp(x∣c)ux
① 表示的是observation,也就是context word实际是什么(true label)。② 表示的是expectation,也就是模型认为概率最高的应该是哪个词(prediction label)。所以,实际上我们就是希望最小化实际和预测之间的差值。
② = ∂ ∂ v c l o g [ ∑ w = 1 V e x p ( u w T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∂ ∂ v c [ ∑ x = 1 V e x p ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ ∂ ∂ v c e x p ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ e x p ( u x T v c ) ∂ ∂ v c ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ e x p ( u x T v c ) u x ] = ∑ x = 1 V e x p ( u x T v c ) ∑ w = 1 V e x p ( u w T v c ) u x = ∑ x = 1 V p ( x ∣ c ) u x ② = \frac{\partial} {\partial v_c} log[\sum_{w=1}^V exp(u_w^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \frac{\partial}{\partial v_c} [\sum_{x=1}^V exp(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [\frac{\partial}{\partial v_c} exp(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [exp(u_x^T v_c) \frac{\partial}{\partial v_c}(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [exp(u_x^T v_c) u_x] \\\\ = \sum_{x=1}^V \frac{exp(u_x^T v_c)}{\sum_{w=1}^V exp(u_w^T v_c)} u_x \\\\ = \sum_{x=1}^V p(x|c) u_x ②=∂vc∂log[w=1∑Vexp(uwTvc)]=∑w=1Vexp(uwTvc)1∂vc∂[x=1∑Vexp(uxTvc)]=∑w=1Vexp(uwTvc)