Word2vec
one-hot representation → \to → local generalization
distributed representation → \to → global generalization
分布式是把词的信息分散到不同维度
词向量模型的假设: 类似的词会出现在相似的地方, You shall know a word by the company it keep
Skip-Gram Model
🎥266
用中间的词预测预测周围两个单词. CBOW是反过来, 用周围的词预测中间的词
We are working on NLP project, it is insteresting
_ _ working _ _ project, it is interesting
==> P(we|working)*P(are|working)*P(on|working)*P(NLP|working) 最大化
We _ _ on _ _, it is interesting
We are _ _ NLP _, _ is interesting
We are working _ _ project, _ _ interesting
We are working on _ _, it _ _
arg max Θ ∏ w ∈ T e x t ∏ c ∈ C o n t e x t ( w ) P ( c ∣ w ) → arg max Θ log ∏ w ∈ T e x t ∏ c ∈ C o n t e x t ( w ) P ( c ∣ w ; Θ ) = arg max Θ ∑ w ∈ T e x t ∑ c ∈ C o n t e x t ( w ) P ( c ∣ w ; Θ ) \begin{aligned} &\arg\max_\Theta\prod_{w\in Text}\prod_{c\in Context(w)}P(c|w)\\ \to&\arg\max_\Theta\log\prod_{w\in Text}\prod_{c\in Context(w)}P(c|w;\Theta)\\ =&\arg\max_\Theta\sum_{w\in Text}\sum_{c\in Context(w)}P(c|w;\Theta) \end{aligned} →=argΘmaxw∈Text∏c∈Context(w)∏P(c∣w)argΘmaxlogw∈Text∏c∈Context(w)∏P(c∣w;Θ)argΘmaxw∈Text∑c∈Context(w)∑P(c∣w;Θ)
模型的参数 Θ = [ U , V ] \Theta=[U,V] Θ=[U,V], U U U和 V V V都是vocab_size×embed_size的矩阵, 把 U U U定义为上下文矩阵, V V V是中心词的矩阵
w
w
w是遍历中心词,
c
c
c是遍历周围的单词
P
(
c
∣
w
;
Θ
)
=
exp
U
c
⋅
V
w
∑
c
′
∈
V
o
c
a
b
exp
U
c
′
⋅
V
w
P(c|w;\Theta)={\exp U_c\cdot V_w\over\sum\limits_{c'\in Vocab}\exp U_{c'}\cdot V_w}
P(c∣w;Θ)=c′∈Vocab∑expUc′⋅VwexpUc⋅Vw
arg max Θ ∑ w ∈ T e x t ∑ c ∈ C o n t e x t ( w ) log exp U c ⋅ V w ∑ c ′ ∈ V o c a b exp U c ′ ⋅ V w = arg max Θ ∑ w ∈ T e x t ∑ c ∈ C o n t e x t ( w ) [ U c ⋅ V w − log ∑ c ′ ∈ V o c a b exp U c ′ ⋅ V w ] \begin{aligned} &\arg\max_\Theta\sum_{w\in Text}\sum_{c\in Context(w)}\log \cfrac{\exp U_c\cdot V_w}{\sum\limits_{c'\in Vocab}\exp U_{c'}\cdot V_w}\\ =&\arg\max_\Theta\sum_{w\in Text}\sum_{c\in Context(w)}\left[U_c\cdot V_w-\log\sum_{c'\in Vocab}\exp U_{c'}\cdot V_w\right] \end{aligned} =argΘmaxw∈Text∑c∈Context(w)∑logc′∈Vocab∑expUc′⋅VwexpUc⋅VwargΘmaxw∈Text∑c∈Context(w)∑[Uc⋅Vw−logc′∈Vocab∑expUc′⋅Vw]
遍历 V o c a b Vocab Vocab是性能瓶颈, 复杂度太高了, 为了解决问题提出了两种方法
- 负采样
python word2vec.py -negative 5
- hierarchical softmax
训练词向量需要几个G的数据
https://github.com/dav/word2vec/blob/master/src/word2vec.c
负采样
引入新的条件概率: P ( D = 1 ∣ w 1 , w 3 ) P(D=1|w_1,w_3) P(D=1∣w1,w3) 是 w 1 w_1 w1和 w 3 w_3 w3作为上下文词出现的概率? 明明是中心词和上下文词.
例子
Today's weather is great..
正样本 D=[
(Today's weather)
(weather Today's)
(weather is)
(is weather)
(is great)
(great is)
]
负样本 ~D=[
(Today's is)
(Today's great)
(weather great)
(is Today's)
(great Today's)
(great weather)
]
目标:
arg
max
Θ
∏
(
w
,
c
)
∈
D
P
(
D
=
1
∣
w
,
c
;
Θ
)
∏
(
w
,
c
)
∈
D
~
P
(
D
=
0
∣
w
,
c
;
Θ
)
=
arg
max
Θ
∏
(
w
,
c
)
∈
D
1
1
+
exp
(
−
U
c
V
w
)
∏
(
w
,
c
)
∈
D
~
[
1
−
1
1
+
exp
−
U
c
V
w
]
→
arg
max
Θ
∑
(
w
,
c
)
∈
D
log
σ
(
U
c
⋅
V
w
)
+
∑
(
w
,
c
)
∈
D
~
log
σ
(
−
U
c
V
w
)
\begin{aligned} &\arg\max_\Theta \prod_{(w,c)\in D}P(D=1|w,c;\Theta)\prod_{(w,c)\in \tilde D}P(D=0|w,c;\Theta)\\ =&\arg\max_\Theta \prod_{(w,c)\in D}\cfrac{1}{1+\exp (-U_cV_w)}\prod_{(w,c)\in\tilde D}\left[1-\cfrac{1}{1+\exp -U_cV_w}\right]\\ \to&\arg\max_\Theta\sum_{(w,c)\in D}\log\sigma(U_c\cdot V_w)+\sum_{(w,c)\in\tilde D}\log\sigma(-U_cV_w)\\ \end{aligned}
=→argΘmax(w,c)∈D∏P(D=1∣w,c;Θ)(w,c)∈D~∏P(D=0∣w,c;Θ)argΘmax(w,c)∈D∏1+exp(−UcVw)1(w,c)∈D~∏[1−1+exp−UcVw1]argΘmax(w,c)∈D∑logσ(Uc⋅Vw)+(w,c)∈D~∑logσ(−UcVw)
基础知识:
σ ( x ) = 1 1 + exp ( − x ) \sigma(x)={1\over 1+\exp(-x)} σ(x)=1+exp(−x)1
σ ( − x ) = 1 − σ ( x ) \sigma(-x)=1-\sigma(x) σ(−x)=1−σ(x)
σ ′ ( x ) = σ ( x ) σ ( − x ) \sigma'(x)=\sigma(x)\sigma(-x) σ′(x)=σ(x)σ(−x)
随着文本和词库的增加, 负样本会比正样本多好几个级数. 所以需要对负样本采样
→
arg
max
Θ
∑
(
w
,
c
)
∈
D
[
log
σ
(
U
c
⋅
V
w
)
+
∑
c
′
∈
N
(
w
)
log
σ
(
−
U
c
′
V
w
)
]
=
:
arg
max
Θ
∑
(
w
,
c
)
∈
D
ℓ
(
Θ
)
\begin{aligned} &\to\arg\max_\Theta\sum_{(w,c)\in D}\left[\log\sigma(U_c\cdot V_w)+\sum_{c'\in N(w)}\log\sigma(-U_{c'}V_w)\right]\\ &=:\arg\max_\Theta\sum_{(w,c)\in D}\ell(\Theta)\\ \end{aligned}
→argΘmax(w,c)∈D∑⎣⎡logσ(Uc⋅Vw)+c′∈N(w)∑logσ(−Uc′Vw)⎦⎤=:argΘmax(w,c)∈D∑ℓ(Θ)
N
(
w
)
N(w)
N(w)是对中心词
w
w
w进行负样本采样. 例如
S= I like NLP, it is interesting.
Vocab = ...
第1组 正样本 (NLP like) 负样本 (NLP I) (NLP but)
第2组 正样本 (NLP it) 负样本 (NLP hard) (NLP I)
第3组 正样本 (it is) 负样本 (it interesting) (it hard)
第4组 正样本 (it NLP) 负样本 (it hard) (it I)
∂ ℓ ( Θ ) ∂ U c = σ ( U c V w ) [ 1 − σ ( U c ) ] V w σ ( U c V w ) = [ 1 − σ ( U c V w ) ] V w ∂ ℓ ( Θ ) ∂ U c ′ V w = σ ( − U c ′ V w ) [ 1 − σ ( − U c ′ V w ) ] ( − V w ) σ ( − U c ′ V w ) = [ σ ( − U c ′ V w ) − 1 ] V w ∂ ℓ ( Θ ) ∂ V w = σ ( U c V w ) [ 1 − σ ( U c V w ) ] U c σ ( U c V w ) + ∑ c ′ ∈ N ( w ) σ ( − U c ′ V w ) [ 1 − σ ( − U c ′ V w ) ] ( 1 − U c ′ ) σ ( − U c ′ V w ) = [ 1 − σ ( U c V w ) ] U c + ∑ c ′ ∈ N ( w ) [ σ ( − U c ′ V w ) − 1 ] U c ′ \begin{aligned} \cfrac{\partial \ell(\Theta)}{\partial U_c}&=\cfrac{\sigma(U_cV_w)[1-\sigma(U_c)]V_w}{\sigma(U_cV_w)}=[1-\sigma(U_cV_w)]V_w\\ \cfrac{\partial \ell(\Theta)}{\partial U_{c'}V_w}&=\cfrac{\sigma(-U_{c'}V_w)[1-\sigma(-U_{c'}V_w)](-V_w)}{\sigma(-U_{c'}V_w)}=[\sigma(-U_{c'}V_w)-1]V_w\\ \cfrac{\partial \ell(\Theta)}{\partial V_w}&=\cfrac{\sigma(U_cV_w)[1-\sigma(U_cV_w)]U_c}{\sigma(U_cV_w)}+\sum_{c'\in N(w)}\cfrac{\sigma(-U_{c'}V_w)[1-\sigma(-U_{c'}V_w)](1-U_{c'})}{\sigma(-U_{c'}V_w)}\\ &=[1-\sigma(U_cV_w)]U_c+\sum_{c'\in N(w)}[\sigma(-U_{c'}V_w)-1]U_{c'} \end{aligned} ∂Uc∂ℓ(Θ)∂Uc′Vw∂ℓ(Θ)∂Vw∂ℓ(Θ)=σ(UcVw)σ(UcVw)[1−σ(Uc)]Vw=[1−σ(UcVw)]Vw=σ(−Uc′Vw)σ(−Uc′Vw)[1−σ(−Uc′Vw)](−Vw)=[σ(−Uc′Vw)−1]Vw=σ(UcVw)σ(UcVw)[1−σ(UcVw)]Uc+c′∈N(w)∑σ(−Uc′Vw)σ(−Uc′Vw)[1−σ(−Uc′Vw)](1−Uc′)=[1−σ(UcVw)]Uc+c′∈N(w)∑[σ(−Uc′Vw)−1]Uc′
然后就随机梯度下降法
U
c
←
U
c
+
η
∂
ℓ
(
Θ
)
∂
U
c
U
c
′
←
η
∂
ℓ
(
Θ
)
∂
U
c
′
V
w
←
V
w
+
η
∂
ℓ
(
Θ
)
∂
V
w
U_c\leftarrow U_c+\eta\cfrac{\partial \ell(\Theta)}{\partial U_c}\\ U_{c'}\leftarrow \eta\cfrac{\partial \ell(\Theta)}{\partial U_{c'}}\\ V_w\leftarrow V_w+\eta\cfrac{\partial \ell(\Theta)}{\partial V_w}
Uc←Uc+η∂Uc∂ℓ(Θ)Uc′←η∂Uc′∂ℓ(Θ)Vw←Vw+η∂Vw∂ℓ(Θ)
词向量的评估
可视化: TSNE算法
sklearn
里也有
会把相似的词向量聚在一起
相似性比较
给定单词对, (football, basketball) 计算相关度
需要人工提供样本
类比
woman : man girl : ?
北京 : 上海 Washington D.C.: ?
方法: woman和man两个词向量相减等于girl减去谁
应用场景
把词向量的训练过程用于产品推荐
用户的浏览记录当成一串单词, 生成商品的词向量
但是会有一些其它修改 参见🎥273
Subword(FastText)
对低频词和新词
①忽略
②Subword信息
比如 reading 很明显可以拆分成 read 和 ing, 是进行时, 是阅读
所以把单词切分成更小的模块
相当于 charactor n-gram (设窗口大小n=4)
reader => ^rea, read, eade, ader, der$
studying => ^stu, stud, tudy, udyi, dyin, ying, ing$
going => ^goi, goin, oing, ing$
训练出 charactor n-gram的词向量, 然后单词的词向量=subword的词向量之和
[reader] = [^rea] + [read] + [eade] + [ader] + [der$] + [reader]
P ( r e a d e r ∣ s t u d y i n g ) = exp ( r e a d e r ) ⋅ ( s t u d y i n g ) ∑ c ′ exp ( c ′ ) ⋅ ( s t u d i n g ) P(reader|studying)=\cfrac{\exp{(reader)\cdot(studying)}}{\sum_{c'}\exp{(c')\cdot(studing)}} P(reader∣studying)=∑c′exp(c′)⋅(studing)exp(reader)⋅(studying)
①这个方法一般适用于有形态特征的语言, 比如英文. 中文不太实用, 可以考虑拆分成字根
②有些subword不太常用, 比如[reader]的[eade]部分, 可以考虑用类似去停用词的方式去除
③窗口大小 n 的推荐值是 3 到 6
④在数据少的情况下可以尝试
ELMo
🎥281
语言模型,多层循环神经网络 (Deep Bi-LSTM)
使用深度学习模型
LSTM 高层 : 语义特征
LSTM 中层 : 语法特征
LSTM的输入: 单词特征
没有处理过的语言模型是 P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 w 2 ) P ( w 4 ∣ w 1 . . w 3 ) P(w_2|w_1)P(w_3|w_1w_2)P(w_4|w_1..w_3) P(w2∣w1)P(w3∣w1w2)P(w4∣w1..w3)
训练的时候得到单词特征
高层的输出当成单词的动态特征, 输入的词向量是静态特征
Word Representation 总结
①矩阵分解
I back my car
I back the car
Back my car
上面的词典大小|V|=5, 建立5*5的矩阵, 表示相邻单词对出现几次
I back my car the
I 0 2 0 0 0
back 2 0 2 0 1
my 0 0 0 2 0
car 0 0 0 0 0
the 0 0 0 1 0
然后对上面的矩阵进行分解, 比如用SVD得到 A = U Σ V T A = U\Sigma V^T A=UΣVT, U U U和 V V V矩阵都是实体的特征.
②Glove
全局方法加入新文档时要全部重新计算, 局部方法不能学习全局信息. 所以Glove是MF+Skip-Gram的结合
③高斯嵌入
每一个词向量都是一个分布 e ∼ N ( ( 0.1 , 0.2 , 0.1 ) , [ 0.1 0 0 0 0.05 0 0 0 0.1 ] ) e\sim N\left((0.1,0.2,0.1),\begin{bmatrix}0.1&0&0\\0&0.05&0\\0&0&0.1\end{bmatrix}\right) e∼N⎝⎛(0.1,0.2,0.1),⎣⎡0.10000.050000.1⎦⎤⎠⎞