词向量发展史
刚开始 词向量是将一个单词或者汉字用一个向量表示出来,一开始往往使用one-hot向量来表示,但是后来发现啊有以下的弊端:
- 两个单词之间是正交的(这也就意味着任意两个单词内积都为0,无法计算相似度)
- 如果有10000个汉字,那么每个汉字都需要10000维的向量来表示(这里面只有1个1,9999个0),极大的浪费了空间。
然后 出现了先计算共现矩阵,然后通过SVD降维得稠密矩阵。这种方法有一个优点,俩缺点:
-
优点1 考虑了全文的信息,非局部信息
-
缺点1:来了新样本,必须重新计算,不能从原有的基础上进行计算。
-
缺点2:SVD计算复杂度高,所以这种方式不能处理大规模的语料库
后来 word2vec出现了,word2vec可以说是NLP发展的重要里程碑,他用简单的思想把词向量表达出来,(虽然一开始的目的不是中间过程的词向量,但后来往往用word2vec来获取词向量),对NLP的意义不亚于CV领域的AlexNet。
word2vec是以滑动窗口的方式,扫描一遍全部的语料库,再扫描的时候分为一个中心词,2m个周围词,m为窗口大小。最大化似然函数的方式来获取词向量。
再后来 Globvec。。
word2vec的两个分支
-
CBOW: 用周围词去预测中心词。
-
Skip Gram:用中心词去预测周围词。
显然,ship-gram的方式同一段文字,有更多的计算最大化似然的次数,
例如:我今天没吃饭
cbow(window=1):P(我|今)、P(今|我,天)、P(天|今,没)、P(没|天,吃)、P(吃|没,饭)、P(饭|吃) 一共计算了len(sentence)次
skip-gram(window=1)😛(今|我)、P(我|今)、P(天|今)、P(今|天)、P(没|天)、… 一共计算(len(sentence)*window -window) 词
所以在大型预料库中优先选用sk来训练,这样得到的词向量更优。所以今天我们主要来讲解sk。
理论支撑
cbow和skip-gram通用
目标函数
Likelihoood
=
L
(
θ
)
=
∏
t
=
1
T
∏
−
m
≤
j
≤
m
j
≠
0
P
(
w
t
+
j
∣
w
t
;
θ
)
\text { Likelihoood }=L(\theta)=\prod_{t=1}^{T} \prod_{-m \leq j \leq m \atop j \neq 0} P\left(w_{t+j} \mid w_{t} ; \theta\right)
Likelihoood =L(θ)=t=1∏Tj=0−m≤j≤m∏P(wt+j∣wt;θ)
其中,
θ
\theta
θ 为所有需要优化的参数(词向量,也就是两个矩阵),现在有两个需要解决的问题。
- 连乘啊,这是千万啊,复杂度太高,且这么多相乘浮点早溢出,无法计算。
- 要整成具体的公式,这太抽象了
代价函数概览
运用log函数,累乘变累加
J
(
θ
)
=
−
1
T
log
L
(
θ
)
=
−
1
T
∑
t
=
1
T
∑
−
m
≤
j
≤
m
j
≠
0
log
P
(
w
t
+
j
∣
w
t
;
θ
)
J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m \atop j \neq 0} \log P\left(w_{t+j} \mid w_{t} ; \theta\right)
J(θ)=−T1logL(θ)=−T1t=1∑Tj=0−m≤j≤m∑logP(wt+j∣wt;θ)
代价函数具体
要搞清楚具体的代价函数怎么写,我们要知道上面的那个概览是什么意思,意思就是我给出中心词的情况下,能够把周围的几个词找出来(找出来是对于全局说的,并不是找出某一次训练时中的特定词)。也就是说,我要把中心词与全部的词内积。看一下这个内积在全部词水平中的大小程度。
我们规定
v
w
v_{w}
vw 当
w
w
w 是中心词时
u
w
u_{w}
uw 当
w
w
w 是上下文词时
P
(
o
∣
c
)
=
exp
(
u
o
T
v
c
)
∑
w
∈
V
exp
(
u
w
T
v
c
)
P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}
P(o∣c)=∑w∈Vexp(uwTvc)exp(uoTvc)
向量之间越相似,点乘结果越⼤,从⽽归⼀化后得到的概率值也越⼤。模型的训练正是为了使得具有相似上下⽂的单词,具有相似的向量。分母只是为了进行归一化。
我们现在对代价函数进行求导看看最后是什么?
对中心词求导
∂
∂
v
c
log
P
(
o
∣
c
)
=
∂
∂
v
c
log
exp
(
u
o
T
v
c
)
∑
w
∈
V
exp
(
u
w
T
v
c
)
=
∂
∂
v
c
(
log
exp
(
u
o
T
v
c
)
−
log
∑
w
∈
V
exp
(
u
w
T
v
c
)
)
=
∂
∂
v
c
(
u
o
T
v
c
−
log
∑
w
∈
V
exp
(
u
w
T
v
c
)
)
=
u
o
−
∑
w
∈
V
exp
(
u
w
T
v
c
)
u
w
∑
w
∈
V
exp
(
u
w
T
v
c
)
=
u
o
−
∑
w
∈
V
exp
(
u
w
T
v
c
)
∑
w
∈
V
exp
(
u
w
T
v
c
)
u
w
=
u
o
−
∑
w
∈
V
P
(
w
∣
c
)
u
w
\begin{aligned} \frac{\partial}{\partial v_{c}} \log P(o \mid c) &=\frac{\partial}{\partial v_{c}} \log \frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \\ &=\frac{\partial}{\partial v_{c}}\left(\log \exp \left(u_{o}^{T} v_{c}\right)-\log \sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)\right) \\ &=\frac{\partial}{\partial v_{c}}\left(u_{o}^{T} v_{c}-\log \sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)\right) \\ &=u_{o}-\frac{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right) u_{w}}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}\\ &=u_{o}-\sum_{w \in V} \frac{\exp \left(u_{w}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} u_{w} \\ &=u_{o}-\sum_{w \in V} P(w \mid c) u_{w} \end{aligned}
∂vc∂logP(o∣c)=∂vc∂log∑w∈Vexp(uwTvc)exp(uoTvc)=∂vc∂(logexp(uoTvc)−logw∈V∑exp(uwTvc))=∂vc∂(uoTvc−logw∈V∑exp(uwTvc))=uo−∑w∈Vexp(uwTvc)∑w∈Vexp(uwTvc)uw=uo−w∈V∑∑w∈Vexp(uwTvc)exp(uwTvc)uw=uo−w∈V∑P(w∣c)uw
我们对中心词求导,注意,此刻我们并不是梯度下降而是梯度上升(因为我们要最大化当前的概率)。
简单点,我们可以理解成,我们用中心词去预测下一个单词。第⼀项是当前下一个单词的词向量,第⼆项是我们预测的中心词和全部单词的内积全部单词的词向量,
so->我们加上
u
o
u_{o}
uo,期望我们的
v
o
v_o
vo与
u
o
u_{o}
uo更加相似,
同时我们减去中心词和全部单词的内积的概率全部单词。我们期望中心词与其余的单词不相似。
哈哈,就像a/(a+b)我们增大a,同时减小b,这样我们的值才会更大。然后使⽤迭代法,这样中心词就会更加准确了。
对上下文词求导
∂
∂
u
o
log
P
(
o
∣
c
)
=
∂
∂
u
o
log
exp
(
u
o
T
v
c
)
∑
w
∈
V
exp
(
u
w
T
v
c
)
=
∂
∂
u
o
(
log
exp
(
u
o
T
v
c
)
−
log
∑
w
∈
V
exp
(
u
w
T
v
c
)
)
=
∂
∂
u
o
(
u
o
T
v
c
−
log
∑
w
∈
V
exp
(
u
w
T
v
c
)
)
=
v
c
−
∑
∂
∂
u
o
exp
(
u
w
T
v
c
)
∑
w
∈
V
exp
(
u
w
T
v
c
)
=
v
c
−
exp
(
u
o
T
v
c
)
v
c
∑
w
∈
V
exp
(
u
w
T
v
c
)
=
v
c
−
exp
(
u
o
T
v
c
)
∑
w
∈
V
exp
(
u
w
T
v
c
)
v
c
=
v
c
−
P
(
o
∣
c
)
v
c
=
(
1
−
P
(
o
∣
c
)
)
v
c
\begin{aligned} \frac{\partial}{\partial u_{o}} \log P(o \mid c) &=\frac{\partial}{\partial u_{o}} \log \frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \\ &=\frac{\partial}{\partial u_{o}}\left(\log \exp \left(u_{o}^{T} v_{c}\right)-\log \sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)\right) \\ &=\frac{\partial}{\partial u_{o}}\left(u_{o}^{T} v_{c}-\log \sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)\right) \\ &=v_{c}-\frac{\sum \frac{\partial}{\partial u_{o}} \exp \left(u_{w}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \\ &=v_{c}-\frac{\exp \left(u_{o}^{T} v_{c}\right) v_{c}}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \\ &=v_{c}-\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} v_{c} \\ &=v_{c}-P(o \mid c) v_{c} \\ &=(1-P(o \mid c)) v_{c} \end{aligned}
∂uo∂logP(o∣c)=∂uo∂log∑w∈Vexp(uwTvc)exp(uoTvc)=∂uo∂(logexp(uoTvc)−logw∈V∑exp(uwTvc))=∂uo∂(uoTvc−logw∈V∑exp(uwTvc))=vc−∑w∈Vexp(uwTvc)∑∂uo∂exp(uwTvc)=vc−∑w∈Vexp(uwTvc)exp(uoTvc)vc=vc−∑w∈Vexp(uwTvc)exp(uoTvc)vc=vc−P(o∣c)vc=(1−P(o∣c))vc
也就是说,当我们的上下文词和中心词的词向量的足够好的时候,
P
(
o
∣
c
)
−
>
1
P(o|c)->1
P(o∣c)−>1的时候(哈哈,不可能),我们就不需要调整了。否则,我们就一点点调整到最优。
具体算法与过程
重点来了,我们之前提到过,sk和cbow只是两种不同的方式,我们这里将sk。同时我们发现,计算一个中心词,我们竟然需要与全部的词语进行内积并求指数。这太复杂了,所以,MK(word2vec作者)提出了很多优化方法,其中最重要的两个就是 负采样 和层次softmax,
背景:有1024个词的文章、窗口为2、sk模型.
最笨的做法: 一次扫描要进行1024*(22)-22次P(o|c).而每次P(o|c)需要计算中心词与所有词的内积。也就是说复杂度是 n*n级别的。前一个n无法优化,因为我们总要扫描整个语料库。我们用两种方法来优化第二个n。
负采样:
我们选择中心词和他的上下文词作为正样本,选择K个(中心词和非上下文词)作为负样本。这样我们的复杂度就变成n*(1+k)了,复杂度大大降低。
所以
θ
=
argmax
θ
∏
(
w
,
c
)
∈
D
P
(
D
=
1
∣
w
,
c
,
θ
)
∏
(
w
,
c
)
∈
D
~
P
(
D
=
0
∣
w
,
c
,
θ
)
=
argmax
θ
∏
(
w
,
c
)
∈
D
P
(
D
=
1
∣
w
,
c
,
θ
)
∏
(
w
,
c
)
∈
D
~
(
1
−
P
(
D
=
1
∣
w
,
c
,
θ
)
)
=
argmax
θ
∑
(
w
,
c
)
∈
D
log
P
(
D
=
1
∣
w
,
c
,
θ
)
+
∑
(
w
,
c
)
∈
D
~
log
(
1
−
P
(
D
=
1
∣
w
,
c
,
θ
)
)
=
arg
max
θ
∑
(
w
,
c
)
∈
D
log
1
1
+
exp
(
−
u
w
T
v
c
)
+
∑
(
w
,
c
)
∈
D
~
log
(
1
−
1
1
+
exp
(
−
u
w
T
v
c
)
)
=
arg
max
θ
∑
(
w
,
c
)
∈
D
log
1
1
+
exp
(
−
u
w
T
v
c
)
+
∑
(
w
,
c
)
∈
D
~
log
(
1
1
+
exp
(
u
w
T
v
c
)
)
\begin{aligned} \theta &=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 \mid w, c, \theta) \prod_{(w, c) \in \tilde{D}} P(D=0 \mid w, c, \theta) \\ &=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 \mid w, c, \theta) \prod_{(w, c) \in \tilde{D}}(1-P(D=1 \mid w, c, \theta)) \\ &=\underset{\theta}{\operatorname{argmax}} \sum_{(w, c) \in D} \log P(D=1 \mid w, c, \theta)+\sum_{(w, c) \in \tilde{D}} \log (1-P(D=1 \mid w, c, \theta)) \\ &=\underset{\theta}{\arg \max } \sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}+\sum_{(w, c) \in \widetilde{D}} \log \left(1-\frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}\right) \\ &=\underset{\theta}{\arg \max } \sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}+\sum_{(w, c) \in \widetilde{D}} \log \left(\frac{1}{1+\exp \left(u_{w}^{T} v_{c}\right)}\right) \end{aligned}
θ=θargmax(w,c)∈D∏P(D=1∣w,c,θ)(w,c)∈D~∏P(D=0∣w,c,θ)=θargmax(w,c)∈D∏P(D=1∣w,c,θ)(w,c)∈D~∏(1−P(D=1∣w,c,θ))=θargmax(w,c)∈D∑logP(D=1∣w,c,θ)+(w,c)∈D~∑log(1−P(D=1∣w,c,θ))=θargmax(w,c)∈D∑log1+exp(−uwTvc)1+(w,c)∈D
∑log(1−1+exp(−uwTvc)1)=θargmax(w,c)∈D∑log1+exp(−uwTvc)1+(w,c)∈D
∑log(1+exp(uwTvc)1)
D={
vo,u1
v1,u0
v1,u2
。。。
}
D
~
\widetilde{D}
D
={
v0,u100
v1,u278
v2,u975
。。。
}
注意最大化似然函数等同于最小化负对数似然:
J
=
−
∑
(
w
,
c
)
∈
D
log
1
1
+
exp
(
−
u
w
T
v
c
)
−
∑
(
w
,
c
)
∈
D
~
log
(
1
1
+
exp
(
u
w
T
v
c
)
)
J=-\sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}-\sum_{(w, c) \in \widetilde{D}} \log \left(\frac{1}{1+\exp \left(u_{w}^{T} v_{c}\right)}\right)
J=−(w,c)∈D∑log1+exp(−uwTvc)1−(w,c)∈D
∑log(1+exp(uwTvc)1)
对于 Skip-Gram 模型, 我们对给定中心词
c
c
c 来观察的上下文单词
c
−
m
+
j
c-m+j
c−m+j 的Skip-Gram可以编码的目标函数为
−
log
σ
(
u
c
−
m
+
j
T
⋅
v
c
)
−
∑
k
=
1
K
log
σ
(
−
u
~
k
T
⋅
v
c
)
-\log \sigma\left(u_{c-m+j}^{T} \cdot v_{c}\right)-\sum_{k=1}^{K} \log \sigma\left(-\tilde{u}_{k}^{T} \cdot v_{c}\right)
−logσ(uc−m+jT⋅vc)−k=1∑Klogσ(−u~kT⋅vc)
对 CBOW 模型, 我们对给定上下文向量
v
^
=
v
c
−
m
+
v
c
−
m
+
1
+
…
+
v
c
+
m
2
m
\hat{v}=\frac{v_{c-m}+v_{c-m+1}+\ldots+v_{c+m}}{2 m}
v^=2mvc−m+vc−m+1+…+vc+m 来观察中心词
u
c
u_{c}
uc 的目标函数为
−
log
σ
(
u
c
T
⋅
v
^
)
−
∑
k
=
1
K
log
σ
(
−
u
~
k
T
⋅
v
^
)
-\log \sigma\left(u_{c}^{T} \cdot \hat{v}\right)-\sum_{k=1}^{K} \log \sigma\left(-\widetilde{u}_{k}^{T} \cdot \hat{v}\right)
−logσ(ucT⋅v^)−k=1∑Klogσ(−u
kT⋅v^)
在上面的公式中,
{
u
~
k
∣
k
=
1
…
K
}
\left\{\widetilde{u}_{k} \mid k=1 \ldots K\right\}
{u
k∣k=1…K} 是从
P
n
(
w
)
P_{n}(w)
Pn(w) 中抽样。即根据某个词的出现的概率来决定被抽中的概率,但这样并不好,因为一些词的出现太频繁了(the、a),而一些词不经常出现,但是却有着重要的意义。所以我们要将原有出现的概率求3/4指数:
i
s
:
0.
9
3
/
4
=
0.92
Constitution
:
0.0
9
3
/
4
=
0.16
bombastic
:
0.0
1
3
/
4
=
0.032
\begin{aligned} i s: 0.9^{3 / 4} &=0.92 \\ \text { Constitution }: 0.09^{3 / 4} &=0.16 \\ \text { bombastic }: 0.01^{3 / 4} &=0.032 \end{aligned}
is:0.93/4 Constitution :0.093/4 bombastic :0.013/4=0.92=0.16=0.032
"Bombastic"现在被抽样的概率是之前的三倍, 而“is”只比之前的才提高了一点点。
层次Softmax
看一下这个的示例图,w2是一个词,从根节点到w2的路是加粗线。先记住这条线。
看层次softmax的计算
p
(
w
i
∣
w
)
=
∏
j
=
1
L
(
w
i
)
−
1
σ
(
[
n
(
w
i
,
j
+
1
)
=
ch
(
n
(
w
i
,
j
)
)
]
⋅
v
n
(
w
i
,
j
)
T
v
w
)
p\left(w_{i} \mid w\right)=\prod_{j=1}^{L(w_{i})-1} \sigma\left([n(w_{i}, j+1)=\operatorname{ch}(n(w_{i}, j))] \cdot v_{n(w_{i}, j)}^{T} v_{w}\right)
p(wi∣w)=j=1∏L(wi)−1σ([n(wi,j+1)=ch(n(wi,j))]⋅vn(wi,j)Tvw)
普通的softmax
P
(
o
∣
c
)
=
exp
(
u
o
T
v
c
)
∑
w
∈
V
exp
(
u
w
T
v
c
)
P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}
P(o∣c)=∑w∈Vexp(uwTvc)exp(uoTvc)
解释一下层次softmax各个都是啥意思。其实我们一直再求
p
(
w
i
∣
w
)
p\left(w_{i} \mid w\right)
p(wi∣w) 中
w
w
w和
w
i
w_i
wi的同时出现的概率。也就是求他俩一起出现在同一个窗口的概率。
L
(
w
i
)
L(w_{i})
L(wi)是我们刚才求的那根粗线上有几个结点,例如w2就有四个节点,就是我们需要拿着我们的中心词
v
w
v_{w}
vw去和其中一个周围词的路径上的节点进行内积。
[x]是如果他是左孩子就为1,右孩子为-1(也可以反过来).因为 sigmoid(-x)=1-sigmoid(x).这样可以保证我们的概率之和为1.也就是我们一个词向量*一个节点来确定是往左孩子走和往右孩子走的概率加起来为1.
还有一个要明确的是,我们计算中心词w到w2的概率
p
(
w
2
∣
w
)
p\left(w_{2} \mid w\right)
p(w2∣w),其实计算的是
p
(
w
2
∣
w
i
)
=
p
(
n
(
w
2
,
1
)
,
left
)
⋅
p
(
n
(
w
2
,
2
)
,
left
)
⋅
p
(
n
(
w
2
,
3
)
,
right
)
=
σ
(
v
n
(
w
2
,
1
)
T
v
w
i
)
⋅
σ
(
v
n
(
w
2
,
2
)
T
v
w
i
)
⋅
σ
(
−
v
n
(
w
2
,
3
)
T
v
w
i
)
\begin{aligned} p\left(w_{2} \mid w_{i}\right) &=p\left(n\left(w_{2}, 1\right), \text { left }\right) \cdot p\left(n\left(w_{2}, 2\right), \text { left }\right) \cdot p\left(n\left(w_{2}, 3\right), \text { right }\right) \\ &=\sigma\left(v_{n\left(w_{2}, 1\right)}^{T} v_{w_{i}}\right) \cdot \sigma\left(v_{n\left(w_{2}, 2\right)}^{T} v_{w_{i}}\right) \cdot \sigma\left(-v_{n\left(w_{2}, 3\right)}^{T} v_{w_{i}}\right) \end{aligned}
p(w2∣wi)=p(n(w2,1), left )⋅p(n(w2,2), left )⋅p(n(w2,3), right )=σ(vn(w2,1)Tvwi)⋅σ(vn(w2,2)Tvwi)⋅σ(−vn(w2,3)Tvwi)
也就是我们在整个过程中并没有用到w2的词向量,而是使用了w2叶节点到根节点的路径上的节点的"词向量",(“路径上的词向量什么也代表不了”) 和w的词向量。
这样我们求解两个词在一个窗口中的概率就从普通的计算n次,变成logn次。
把n次softmax 将成 logn次sigmoid。这个做法比较有意思,也有点难以理解。
代码支撑
首先看下sk的负采样。
详细代码参考
读取数据相关
import numpy as np
from collections import deque
class InputData:
def __init__(self,input_file_name,min_count):
self.input_file_name = input_file_name
self.index = 0
self.input_file = open(self.input_file_name,"r",encoding="utf-8")
self.min_count = min_count
self.wordid_frequency_dict = dict()
self.word_count = 0
self.word_count_sum = 0
self.sentence_count = 0
self.id2word_dict = dict()
self.word2id_dict = dict()
self._init_dict() # 初始化字典
self.sample_table = []
self._init_sample_table() # 初始化负采样映射表
self.get_wordId_list()
self.word_pairs_queue = deque()
# 结果展示
print('Word Count is:', self.word_count)
print('Word Count Sum is', self.word_count_sum)
print('Sentence Count is:', self.sentence_count)
def _init_dict(self):
word_freq = dict()
for line in self.input_file:
line = line.strip().split()
self.word_count_sum +=len(line)
self.sentence_count +=1
for i,word in enumerate(line):
if i%1000000==0:
print (i,len(line))
if word_freq.get(word)==None:
word_freq[word] = 1
else:
word_freq[word] += 1
for i,word in enumerate(word_freq):
if i % 100000 == 0:
print(i, len(word_freq))
if word_freq[word]<self.min_count:
self.word_count_sum -= word_freq[word]
continue
self.word2id_dict[word] = len(self.word2id_dict)
self.id2word_dict[len(self.id2word_dict)] = word
self.wordid_frequency_dict[len(self.word2id_dict)-1] = word_freq[word]
self.word_count =len(self.word2id_dict)
def _init_sample_table(self):
sample_table_size = 1e8
pow_frequency = np.array(list(self.wordid_frequency_dict.values())) ** 0.75
word_pow_sum = sum(pow_frequency)
ratio_array = pow_frequency / word_pow_sum
word_count_list = np.round(ratio_array * sample_table_size)
for word_index, word_freq in enumerate(word_count_list):
self.sample_table += [word_index] * int(word_freq)
self.sample_table = np.array(self.sample_table)
np.random.shuffle(self.sample_table)
def get_wordId_list(self):
self.input_file = open(self.input_file_name, encoding="utf-8")
sentence = self.input_file.readline()
wordId_list = [] # 一句中的所有word 对应的 id
sentence = sentence.strip().split(' ')
for i,word in enumerate(sentence):
if i%1000000==0:
print (i,len(sentence))
try:
word_id = self.word2id_dict[word]
wordId_list.append(word_id)
except:
continue
self.wordId_list = wordId_list
def get_batch_pairs(self,batch_size,window_size):
while len(self.word_pairs_queue) < batch_size:
for _ in range(1000):
if self.index == len(self.wordId_list):
self.index = 0
wordId_w = self.wordId_list[self.index]
for i in range(max(self.index - window_size, 0),
min(self.index + window_size + 1,len(self.wordId_list))):
wordId_v = self.wordId_list[i]
if self.index == i: # 上下文=中心词 跳过
continue
self.word_pairs_queue.append((wordId_w, wordId_v))
self.index+=1
result_pairs = [] # 返回mini-batch大小的正采样对
for _ in range(batch_size):
result_pairs.append(self.word_pairs_queue.popleft())
return result_pairs
# 获取负采样 输入正采样对数组 positive_pairs,以及每个正采样对需要的负采样数 neg_count 从采样表抽取负采样词的id
# (假设数据够大,不考虑负采样=正采样的小概率情况)
def get_negative_sampling(self, positive_pairs, neg_count):
neg_v = np.random.choice(self.sample_table, size=(len(positive_pairs), neg_count)).tolist()
return neg_v
# 估计数据中正采样对数,用于设定batch
def evaluate_pairs_count(self, window_size):
return self.word_count_sum * (2 * window_size) - self.sentence_count * (
1 + window_size) * window_size
模型代码
import torch
import torch.nn as nn
import torch.nn.functional as F
class SkipGramModel(nn.Module):
def __init__(self,vocab_size,embed_size):
super(SkipGramModel,self).__init__()
self.vocab_size = vocab_size
self.embed_size = embed_size
self.w_embeddings = nn.Embedding(vocab_size,embed_size)
self.v_embeddings = nn.Embedding(vocab_size, embed_size)
self._init_emb()
def _init_emb(self):
initrange = 0.5 / self.embed_size
self.w_embeddings.weight.data.uniform_(-initrange, initrange)
self.v_embeddings.weight.data.uniform_(-0, 0)
def forward(self, pos_w, pos_v, neg_v):
emb_w = self.w_embeddings(torch.LongTensor(pos_w)) # 转为tensor 大小 [ mini_batch_size * emb_dimension ]
emb_v = self.v_embeddings(torch.LongTensor(pos_v))
neg_emb_v = self.v_embeddings(torch.LongTensor(neg_v)) # 转换后大小 [ negative_sampling_number * mini_batch_size * emb_dimension ]
score = torch.mul(emb_w, emb_v)
score = torch.sum(score, dim=1)
score = torch.clamp(score, max=10, min=-10)
score = F.logsigmoid(score)
neg_score = torch.bmm(neg_emb_v, emb_w.unsqueeze(2))
neg_score = torch.clamp(neg_score, max=10, min=-10)
neg_score = F.logsigmoid(-1 * neg_score)
# L = log sigmoid (Xw.T * θv) + ∑neg(v) [log sigmoid (-Xw.T * θneg(v))]
loss = - torch.sum(score) - torch.sum(neg_score)
return loss
def save_embedding(self, id2word, file_name):
embedding_1 = self.w_embeddings.weight.data.cpu().numpy()
embedding_2 = self.v_embeddings.weight.data.cpu().numpy()
embedding = (embedding_1+embedding_2)/2
fout = open(file_name, 'w')
fout.write('%d %d\n' % (len(id2word), self.embed_size))
for wid, w in id2word.items():
e = embedding[wid]
e = ' '.join(map(lambda x: str(x), e))
fout.write('%s %s\n' % (w, e))
训练代码
from skip_gram_nge_model import SkipGramModel
from input_data import InputData
import torch.optim as optim
from tqdm import tqdm
import argparse
def ArgumentParser():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name', type=str, default="skip-gram", help="skip-gram or cbow")
parser.add_argument("--window_size",type=int,default=3,help="window size in word2vec")
parser.add_argument("--batch_size",type=int,default=256,help="batch size during training phase")
parser.add_argument("--min_count",type=int,default=3,help="min count of training word")
parser.add_argument("--embed_dimension",type=int,default=100,help="embedding dimension of word embedding")
parser.add_argument("--learning_rate",type=float,default=0.02,help="learning rate during training phase")
parser.add_argument("--neg_count",type=int,default=5,help="neg count of skip-gram")
return parser.parse_args()
args = ArgumentParser()
WINDOW_SIZE = args.window_size # 上下文窗口c
BATCH_SIZE = args.batch_size # mini-batch
MIN_COUNT = args.min_count # 需要剔除的 低频词 的频
EMB_DIMENSION = args.embed_dimension # embedding维度
LR = args.learning_rate # 学习率
NEG_COUNT = args.neg_count # 负采样数
class Word2Vec:
def __init__(self, input_file_name, output_file_name):
self.output_file_name = output_file_name
self.data = InputData(input_file_name, MIN_COUNT)
self.model = SkipGramModel(self.data.word_count, EMB_DIMENSION)
self.lr = LR
self.optimizer = optim.SGD(self.model.parameters(), lr=self.lr)
def train(self):
print("SkipGram Training......")
pairs_count = self.data.evaluate_pairs_count(WINDOW_SIZE)
print("pairs_count", pairs_count)
batch_count = pairs_count / BATCH_SIZE
print("batch_count", batch_count)
process_bar = tqdm(range(int(batch_count)))
for i in process_bar:
pos_pairs = self.data.get_batch_pairs(BATCH_SIZE, WINDOW_SIZE)
pos_w = [int(pair[0]) for pair in pos_pairs]
pos_v = [int(pair[1]) for pair in pos_pairs]
neg_v = self.data.get_negative_sampling(pos_pairs, NEG_COUNT)
self.optimizer.zero_grad()
loss = self.model.forward(pos_w, pos_v, neg_v)
loss.backward()
self.optimizer.step()
if i * BATCH_SIZE % 100000 == 0:
self.lr = self.lr * (1.0 - 1.0 * i / batch_count)
for param_group in self.optimizer.param_groups:
param_group['lr'] = self.lr
self.model.save_embedding(self.data.id2word_dict, self.output_file_name)
if __name__ == '__main__':
w2v = Word2Vec(input_file_name='../data/lxc.txt', output_file_name="skip_gram_neg.txt")
w2v.train()
层次softmax的代码
哈夫曼树的建立
class HuffmanNode:
def __init__(self, word_id, frequency):
self.word_id = word_id
self.frequency = frequency
self.left_child = None
self.right_child = None
self.father = None
self.Huffman_code = []
self.path = []
class HuffmanTree:
def __init__(self, wordid_frequency_dict):
self.word_count = len(wordid_frequency_dict)
self.wordid_code = dict()
self.wordid_path = dict()
self.root = None
unmerge_node_list = [HuffmanNode(wordid, frequency) for wordid, frequency in wordid_frequency_dict.items()]
self.huffman = [HuffmanNode(wordid, frequency) for wordid, frequency in wordid_frequency_dict.items()]
print("Building huffman tree...")
self.build_tree(unmerge_node_list)
print("Building tree finished")
# 生成huffman code
print("Generating huffman path...")
self.generate_huffman_code_and_path()
print("Generating huffman path finished")
def merge_node(self, node1, node2):
sum_frequency = node1.frequency + node2.frequency
mid_node_id = len(self.huffman)
father_node = HuffmanNode(mid_node_id, sum_frequency)
if node1.frequency >= node2.frequency:
father_node.left_child = node1
father_node.right_child = node2
else:
father_node.left_child = node2
father_node.right_child = node1
self.huffman.append(father_node)
return father_node
def build_tree(self, node_list):
while len(node_list) > 1:
node_list = sorted(node_list, key=lambda x: x.frequency)
i1 = node_list[0]
i2 = node_list[1]
node_list.remove(i1)
node_list.remove(i2)
father_node = self.merge_node(i1, i2) # 合并最小的两个节点
node_list.append(father_node) # 插入新节点
self.root = node_list[0]
def generate_huffman_code_and_path(self):
stack = [self.root]
while len(stack) > 0:
node = stack.pop()
# 顺着左子树走
while node.left_child or node.right_child:
code = node.Huffman_code
path = node.path
node.left_child.Huffman_code = code + [1]
node.right_child.Huffman_code = code + [0]
node.left_child.path = path + [node.word_id]
node.right_child.path = path + [node.word_id]
# 把没走过的右子树加入栈
stack.append(node.right_child)
node = node.left_child
word_id = node.word_id
word_code = node.Huffman_code
word_path = node.path
self.huffman[word_id].Huffman_code = word_code
self.huffman[word_id].path = word_path
# 把节点计算得到的霍夫曼码、路径 写入词典的数值中
self.wordid_code[word_id] = word_code
self.wordid_path[word_id] = word_path
# 获取所有词的正向节点id和负向节点id数组
def get_all_pos_and_neg_path(self):
positive = [] # 所有词的正向路径数组
negative = [] # 所有词的负向路径数组
for word_id in range(self.word_count):
pos_id = [] # 存放一个词 路径中的正向节点id
neg_id = [] # 存放一个词 路径中的负向节点id
for i, code in enumerate(self.huffman[word_id].Huffman_code):
if code == 1:
pos_id.append(self.huffman[word_id].path[i])
else:
neg_id.append(self.huffman[word_id].path[i])
positive.append(pos_id)
negative.append(neg_id)
return positive, negative
if __name__ == "__main__":
word_frequency = {0: 7, 1: 8, 2: 3, 3: 2, 4: 2}
print(word_frequency)
tree = HuffmanTree(word_frequency)
print(tree.wordid_code)
print(tree.wordid_path)
for i in range(len(word_frequency)):
print(tree.huffman[i].path)
print(tree.get_all_pos_and_neg_path())
输入数据
import numpy as np
import sys
sys.path.append("../Skip_Gram_HS")
from collections import deque
from huffman_tree import HuffmanTree
class InputData:
def __init__(self, input_file_name, min_count):
self.input_file_name = input_file_name
self.index = 0
self.input_file = open(self.input_file_name) # 数据文件
self.min_count = min_count # 要淘汰的低频数据的频度
self.wordId_frequency_dict = dict() # 词id-出现次数 dict
self.word_count = 0 # 单词数(重复的词只算1个)
self.word_count_sum = 0 # 单词总数 (重复的词 次数也累加)
self.sentence_count = 0 # 句子数
self.id2word_dict = dict() # 词id-词 dict
self.word2id_dict = dict() # 词-词id dict
self._init_dict() # 初始化字典
self.huffman_tree = HuffmanTree(self.wordId_frequency_dict) # 霍夫曼树
self.huffman_pos_path, self.huffman_neg_path = self.huffman_tree.get_all_pos_and_neg_path()
self.word_pairs_queue = deque()
# 结果展示
self.get_wordId_list()
print('Word Count is:', self.word_count)
print('Word Count Sum is', self.word_count_sum)
print('Sentence Count is:', self.sentence_count)
print('Tree Node is:', len(self.huffman_tree.huffman))
def _init_dict(self):
word_freq = dict()
# 统计 word_frequency
for line in self.input_file:
line = line.strip().split(' ') # 去首尾空格
self.word_count_sum += len(line)
self.sentence_count += 1
for i, word in enumerate(line):
if i % 1000000 == 0:
print(i, len(line))
try:
word_freq[word] += 1
except:
word_freq[word] = 1
word_id = 0
# 初始化 word2id_dict,id2word_dict, wordId_frequency_dict字典
for per_word, per_count in word_freq.items():
if per_count < self.min_count: # 去除低频
self.word_count_sum -= per_count
continue
self.id2word_dict[word_id] = per_word
self.word2id_dict[per_word] = word_id
self.wordId_frequency_dict[word_id] = per_count
word_id += 1
self.word_count = len(self.word2id_dict)
def get_wordId_list(self):
self.input_file = open(self.input_file_name, encoding="utf-8")
sentence = self.input_file.readline()
wordId_list = [] # 一句中的所有word 对应的 id
sentence = sentence.strip().split(' ')
for i, word in enumerate(sentence):
if i % 1000000 == 0:
print(i, len(sentence))
try:
word_id = self.word2id_dict[word]
wordId_list.append(word_id)
except:
continue
self.wordId_list = wordId_list
# 获取mini-batch大小的 正采样对 (w,v) w为目标词id,v为上下文中的一个词的id。上下文步长为window_size,即2c = 2*window_size
def get_batch_pairs(self, batch_size, window_size):
while len(self.word_pairs_queue) < batch_size:
for _ in range(1000):
if self.index == len(self.wordId_list):
self.index = 0
for i in range(max(self.index - window_size, 0),
min(self.index + window_size + 1, len(self.wordId_list))):
wordId_w = self.wordId_list[self.index]
wordId_v = self.wordId_list[i]
if self.index == i: # 上下文=中心词 跳过
continue
self.word_pairs_queue.append((wordId_w, wordId_v))
self.index += 1
result_pairs = [] # 返回mini-batch大小的正采样对
for _ in range(batch_size):
result_pairs.append(self.word_pairs_queue.popleft())
return result_pairs
def get_pairs(self, pos_pairs):
neg_word_pair = []
pos_word_pair = []
for pair in pos_pairs:
pos_word_pair += zip([pair[0]] * len(self.huffman_pos_path[pair[1]]), self.huffman_pos_path[pair[1]])
neg_word_pair += zip([pair[0]] * len(self.huffman_neg_path[pair[1]]), self.huffman_neg_path[pair[1]])
return pos_word_pair, neg_word_pair
# 估计数据中正采样对数,用于设定batch
def evaluate_pairs_count(self, window_size):
return self.word_count_sum * (2 * window_size - 1) - (self.sentence_count - 1) * (1 + window_size) * window_size
模型
import torch
import torch.nn as nn
import torch.nn.functional as F
class SkipGramModel(nn.Module):
def __init__(self, vocab_size, emb_size):
super(SkipGramModel, self).__init__()
self.vocab_size = vocab_size
self.emb_size = emb_size
self.w_embeddings = nn.Embedding(2*vocab_size-1, emb_size, sparse=True)
self.v_embeddings = nn.Embedding(2*vocab_size-1, emb_size, sparse=True)
self._init_emb()
def _init_emb(self):
initrange = 0.5 / self.emb_size
self.w_embeddings.weight.data.uniform_(-initrange, initrange)
self.v_embeddings.weight.data.uniform_(-0, 0)
def forward(self, pos_w, pos_v,neg_w, neg_v):
emb_w = self.w_embeddings(torch.LongTensor(pos_w)) # 转为tensor 大小 [ mini_batch_size * emb_dimension ]
neg_emb_w = self.w_embeddings(torch.LongTensor(neg_w))
emb_v = self.v_embeddings(torch.LongTensor(pos_v))
neg_emb_v = self.v_embeddings(torch.LongTensor(neg_v)) # 转换后大小 [ negative_sampling_number * mini_batch_size * emb_dimension ]
score = torch.mul(emb_w, emb_v).squeeze()
score = torch.sum(score, dim=1)
score = torch.clamp(score, max=10, min=-10)
score = F.logsigmoid(score)
neg_score = torch.mul(neg_emb_w, neg_emb_v).squeeze()
neg_score = torch.sum(neg_score, dim=1)
neg_score = torch.clamp(neg_score, max=10, min=-10)
neg_score = F.logsigmoid(-neg_score)
# L = log sigmoid (Xw.T * θv) + [log sigmoid (-Xw.T * θv)]
loss = -1 * (torch.sum(score) + torch.sum(neg_score))
return loss
def save_embedding(self, id2word, file_name):
embedding = self.w_embeddings.weight.data.cpu().numpy()
fout = open(file_name, 'w')
fout.write('%d %d\n' % (len(id2word), self.emb_size))
for wid, w in id2word.items():
e = embedding[wid]
e = ' '.join(map(lambda x: str(x), e))
fout.write('%s %s\n' % (w, e))
训练代码
import sys
sys.path.append("../Skip_Gram_HS")
from skip_gram_hs_model import SkipGramModel
from input_data import InputData
import torch.optim as optim
from tqdm import tqdm
import argparse
def ArgumentParser():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name', type=str, default="skip-gram", help="skip-gram or cbow")
parser.add_argument("--window_size",type=int,default=3,help="window size in word2vec")
parser.add_argument("--batch_size",type=int,default=256,help="batch size during training phase")
parser.add_argument("--min_count",type=int,default=3,help="min count of training word")
parser.add_argument("--embed_dimension",type=int,default=100,help="embedding dimension of word embedding")
parser.add_argument("--learning_rate",type=float,default=0.02,help="learning rate during training phase")
return parser.parse_args()
args = ArgumentParser()
WINDOW_SIZE = args.window_size # 上下文窗口c
BATCH_SIZE = args.batch_size # mini-batch
MIN_COUNT = args.min_count # 需要剔除的 低频词 的频
EMB_DIMENSION = args.embed_dimension # embedding维度
LR = args.learning_rate # 学习率
class Word2Vec:
def __init__(self, input_file_name, output_file_name):
self.output_file_name = output_file_name
self.data = InputData(input_file_name, MIN_COUNT)
self.model = SkipGramModel(self.data.word_count, EMB_DIMENSION)
self.lr = LR
self.optimizer = optim.SGD(self.model.parameters(), lr=self.lr)
def train(self):
print("SkipGram Training......")
pairs_count = self.data.evaluate_pairs_count(WINDOW_SIZE)
print("pairs_count", pairs_count)
batch_count = pairs_count / BATCH_SIZE
print("batch_count", batch_count)
process_bar = tqdm(range(int(batch_count)))
for i in process_bar:
pos_pairs = self.data.get_batch_pairs(BATCH_SIZE, WINDOW_SIZE)
pos_pairs,neg_pairs = self.data.get_pairs(pos_pairs)
pos_u = [pair[0] for pair in pos_pairs]
pos_v = [int(pair[1]) for pair in pos_pairs]
neg_u = [pair[0] for pair in neg_pairs]
neg_v = [int(pair[1]) for pair in neg_pairs]
self.optimizer.zero_grad()
loss = self.model.forward(pos_u, pos_v, neg_u,neg_v)
loss.backward()
self.optimizer.step()
if i * BATCH_SIZE % 100000 == 0:
self.lr = self.lr * (1.0 - 1.0 * i / batch_count)
for param_group in self.optimizer.param_groups:
param_group['lr'] = self.lr
self.model.save_embedding(self.data.id2word_dict, self.output_file_name)
if __name__ == '__main__':
w2v = Word2Vec(input_file_name='../data/lxc.txt', output_file_name="word_embedding.txt")
w2v.train()
ref:推导过程参考cs224.代码参考深度课堂