前面讲了Hierarchical softmax 模型,现在来说说Negative Sampling 模型的CBOW和Skip-gram的原理。它相对于Hierarchical softmax 模型来说,不再采用huffman树,这样可以大幅提高性能。
一、Negative Sampling
在负采样中,对于给定的词
w
w
w,如何生成它的负采样集合
N
E
G
(
w
)
NEG(w)
NEG(w)呢?已知一个词
w
w
w,它的上下文是
c
o
n
t
e
x
t
(
w
)
context(w)
context(w),那么词
w
w
w就是一个正例,其他词就是一个负例。但是负例样本太多了,我们怎么去选取呢?在语料库
C
\mathcal{C}
C中,各个词出现的频率是不一样的,我们采样的时候要求高频词选中的概率较大,而低频词选中的概率较小。这就是一个带权采样的问题。
设词典
D
\mathcal{D}
D中的每一个词
w
w
w对应线段的一个长度:
l
e
n
(
w
)
=
c
o
u
n
t
e
r
(
w
)
∑
u
∈
D
c
o
u
n
t
e
r
(
u
)
(
1
)
len(w) = \frac{counter(w)}{\sum_{u \in \mathcal{D}}counter(u)} (1)
len(w)=∑u∈Dcounter(u)counter(w)(1)
式(1)分母是为了归一化,Word2Vec中的具体做法是:记
l
0
=
0
,
l
k
=
∑
j
=
1
k
l
e
n
(
w
j
)
,
k
=
1
,
2
,
…
,
N
l_0 = 0, l_k = \sum_{j=1}^{k} len(w_j), k=1,2, \dots, N
l0=0,lk=∑j=1klen(wj),k=1,2,…,N,其中,
w
j
w_j
wj是词典
D
\mathcal{D}
D中的第
j
j
j个词,则以
{
l
j
}
j
=
0
N
\{l_j\}_{j=0}^{N}
{lj}j=0N为点构成了一个在区间[0,1]非等距离的划分。然后再加一个等距离划分,Word2Vec中选取
M
=
1
0
8
M=10^8
M=108,将M个点等距离的分布在区间[0,1]上,这样就构成了M到I之间的一个映射,如下图所示:
图例参考:http://www.cnblogs.com/neopenx/p/4571996.html ,建议大家读下这篇神作。
选取负例样本的时候,取 [ M 0 , M m − 1 ] [M_0, M_{m-1}] [M0,Mm−1]上的一个随机数,对应到I上就可以了。如果对于词 w i w_i wi,正好选到它自己,则跳过。负例样本集合 N E G ( w ) NEG(w) NEG(w)的大小在Word2Vec源码中默认选5.
二、CBOW
假定关于词
w
w
w的负例样本
N
E
G
(
w
)
NEG(w)
NEG(w)已经选出,定义标签
L
L
L如下,对于
∀
w
~
∈
D
\forall \widetilde{w} \in \mathcal{D}
∀w
∈D:
L
w
(
w
~
)
=
{
1
,
w
~
=
w
;
0
,
w
~
≠
w
;
L^w(\widetilde{w}) = \Bigg\{ \begin{array} {ll} 1, & \widetilde{w} = w ;\\ 0, & \widetilde{w} \ne w; \end{array}
Lw(w
)={1,0,w
=w;w
=w;
对于给定的一个正例样本
(
c
o
n
t
e
x
t
(
w
)
,
w
)
(context(w), w)
(context(w),w), 要求:
max
g
(
w
)
=
max
∏
u
∈
{
w
}
∪
u
∈
N
E
G
(
w
)
p
(
u
∣
c
o
n
t
e
x
t
(
w
)
)
\max g(w) = \max \prod_{u \in \{w\} \cup u \in NEG(w)} p(u|context(w))
maxg(w)=maxu∈{w}∪u∈NEG(w)∏p(u∣context(w))
其中,
p
(
u
∣
c
o
n
t
e
x
t
(
w
)
)
=
{
σ
(
x
w
T
θ
u
)
,
L
w
(
u
)
=
1
1
−
σ
(
x
w
T
θ
u
)
,
L
w
(
u
)
=
0
p(u|context(w)) = \Bigg \{ \begin{array}{ll} \sigma(\boldsymbol{x}_w^T \theta^u), & L^w(u) = 1\\ 1-\sigma(\boldsymbol{x}_w^T \theta^u), & L^w(u) = 0 \end{array}
p(u∣context(w))={σ(xwTθu),1−σ(xwTθu),Lw(u)=1Lw(u)=0
把它写成一个式子:
p
(
u
∣
c
o
n
t
e
x
t
(
w
)
)
=
σ
(
x
w
T
θ
u
)
L
w
(
u
)
+
(
1
−
σ
(
x
w
T
θ
u
)
)
1
−
L
w
(
u
)
p(u|context(w)) = \sigma(\boldsymbol{x}_w^T \theta^u)^{L^w(u)} + (1-\sigma(\boldsymbol{x}_w^T \theta^u))^{1-L^w(u)}
p(u∣context(w))=σ(xwTθu)Lw(u)+(1−σ(xwTθu))1−Lw(u)
下边解释为什么要最大化
g
(
w
)
g(w)
g(w),
g
(
w
)
=
∏
u
∈
{
w
}
∪
u
∈
N
E
G
(
w
)
p
(
u
∣
c
o
n
t
e
x
t
(
w
)
)
=
∏
u
∈
{
w
}
∪
u
∈
N
E
G
(
w
)
σ
(
x
w
T
θ
u
)
L
w
(
u
)
+
(
1
−
σ
(
x
w
T
θ
u
)
)
1
−
L
w
(
u
)
=
σ
(
x
w
T
θ
w
)
∏
u
∈
N
E
G
(
w
)
(
1
−
σ
(
x
w
T
θ
u
)
)
g(w) = \prod_{u \in \{w\} \cup u \in NEG(w)} p(u|context(w)) \\ =\prod_{u \in \{w\} \cup u \in NEG(w)} \sigma(\boldsymbol{x}_w^T \theta^u)^{L^w(u)} + (1-\sigma(\boldsymbol{x}_w^T \theta^u))^{1-L^w(u)} \\ =\sigma(\boldsymbol{x}_w^T \theta^w)\prod_{u \in NEG(w)} (1-\sigma(\boldsymbol{x}_w^T \theta^u))
g(w)=u∈{w}∪u∈NEG(w)∏p(u∣context(w))=u∈{w}∪u∈NEG(w)∏σ(xwTθu)Lw(u)+(1−σ(xwTθu))1−Lw(u)=σ(xwTθw)u∈NEG(w)∏(1−σ(xwTθu))
上式中连乘号前边的式子可以解释为最大化正例样本概率,连乘号后边解释为最小化负例样本概率。
同样的,针对于语料库,令:
G
=
∏
w
∈
C
g
(
w
)
\mathcal{G} = \prod_{w \in \mathcal{C}} g(w)
G=w∈C∏g(w)
可以将上式作为整体的优化目标函数,取上式的最大似然:
L
=
log
G
=
∑
w
∈
C
log
g
(
w
)
=
∑
w
∈
C
∑
u
∈
{
w
}
∪
u
∈
N
E
G
(
w
)
L
w
(
u
)
log
[
σ
(
x
w
T
θ
u
]
+
[
1
−
L
w
(
u
)
]
log
[
1
−
σ
(
x
w
T
θ
u
)
]
\mathcal{L} = \log\mathcal{G} = \sum_{w \in \mathcal{C}} \log g(w) \\ =\sum_{w \in \mathcal{C}} \sum_{u \in \{w\} \cup u \in NEG(w)}L^w(u)\log[\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u] + [1-L^w(u)] \log [1-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]
L=logG=w∈C∑logg(w)=w∈C∑u∈{w}∪u∈NEG(w)∑Lw(u)log[σ(xwTθu]+[1−Lw(u)]log[1−σ(xwTθu)]
和之前的计算过程一样,记
L
(
w
,
u
)
=
L
w
(
u
)
log
[
σ
(
x
w
T
θ
u
]
+
[
1
−
L
w
(
u
)
]
log
[
1
−
σ
(
x
w
T
θ
u
)
]
L(w,u) = L^w(u)\log[\sigma(\boldsymbol{x}_w^T \theta^u] + [1-L^w(u)]\log [1-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]
L(w,u)=Lw(u)log[σ(xwTθu]+[1−Lw(u)]log[1−σ(xwTθu)]
然后分别求:
∂
L
(
w
,
u
)
∂
X
w
\frac{\partial L(w,u)}{\partial\boldsymbol{X}_w}
∂Xw∂L(w,u)和
∂
L
(
w
,
u
)
∂
θ
u
\frac{\partial L(w,u)}{\partial\boldsymbol{\theta}^u}
∂θu∂L(w,u),求解过程略过:
∂
L
(
w
,
u
)
∂
X
w
=
[
L
w
(
u
)
−
σ
(
x
w
T
θ
u
)
]
θ
u
∂
L
(
w
,
u
)
∂
θ
u
=
[
L
w
(
u
)
−
σ
(
x
w
T
θ
u
)
]
X
w
\frac{\partial L(w,u)}{\partial\boldsymbol{X}_w} = [L^w(u)-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]\boldsymbol{\theta}^u \\ \frac{\partial L(w,u)}{\partial\boldsymbol{\theta}^u} = [L^w(u)-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]\boldsymbol{X}_w
∂Xw∂L(w,u)=[Lw(u)−σ(xwTθu)]θu∂θu∂L(w,u)=[Lw(u)−σ(xwTθu)]Xw
则,可得到如下更新公式:
θ
u
:
=
θ
u
+
η
[
L
w
(
u
)
−
σ
(
x
w
T
θ
u
)
]
X
w
v
(
w
~
)
:
=
v
(
w
~
)
+
∑
u
∈
{
w
}
∪
u
∈
N
E
G
(
w
)
[
L
w
(
u
)
−
σ
(
x
w
T
θ
u
)
]
θ
u
\boldsymbol{\theta}^u:=\boldsymbol{\theta}^u+\eta [L^w(u)-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]\boldsymbol{X}_w \\ v(\boldsymbol{\widetilde{w}}):=v(\boldsymbol{\widetilde{w}}) + \sum_{u \in \{w\} \cup u \in NEG(w)} [L^w(u)-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]\boldsymbol{\theta}^u
θu:=θu+η[Lw(u)−σ(xwTθu)]Xwv(w
):=v(w
)+u∈{w}∪u∈NEG(w)∑[Lw(u)−σ(xwTθu)]θu
其中,
w
~
∈
c
o
n
t
e
x
t
(
w
)
\boldsymbol{\widetilde{w}} \in context(w)
w
∈context(w).