word2vec最全理论和代码

本文深入探讨了词向量模型word2vec的发展历程,包括从one-hot向量到word2vec的转变,以及CBOW和Skip-Gram两种模型。重点介绍了word2vec中的负采样和层次Softmax优化技术,以降低计算复杂度并提高训练效率。负采样通过随机抽样减少计算次数,层次Softmax利用哈夫曼树结构加速softmax计算。文章还提供了相关代码实现,帮助理解这两个优化方法的工作原理。
摘要由CSDN通过智能技术生成

词向量发展史

刚开始 词向量是将一个单词或者汉字用一个向量表示出来,一开始往往使用one-hot向量来表示,但是后来发现啊有以下的弊端:

  • 两个单词之间是正交的(这也就意味着任意两个单词内积都为0,无法计算相似度)
  • 如果有10000个汉字,那么每个汉字都需要10000维的向量来表示(这里面只有1个1,9999个0),极大的浪费了空间。

然后 出现了先计算共现矩阵,然后通过SVD降维得稠密矩阵。这种方法有一个优点,俩缺点:

  • 优点1 考虑了全文的信息,非局部信息

  • 缺点1:来了新样本,必须重新计算,不能从原有的基础上进行计算。

  • 缺点2:SVD计算复杂度高,所以这种方式不能处理大规模的语料库

后来 word2vec出现了,word2vec可以说是NLP发展的重要里程碑,他用简单的思想把词向量表达出来,(虽然一开始的目的不是中间过程的词向量,但后来往往用word2vec来获取词向量),对NLP的意义不亚于CV领域的AlexNet。
word2vec是以滑动窗口的方式,扫描一遍全部的语料库,再扫描的时候分为一个中心词,2m个周围词,m为窗口大小。最大化似然函数的方式来获取词向量。
再后来 Globvec。。

word2vec的两个分支

  1. CBOW: 用周围词去预测中心词。
    在这里插入图片描述

  2. Skip Gram:用中心词去预测周围词。
    在这里插入图片描述
    显然,ship-gram的方式同一段文字,有更多的计算最大化似然的次数,

例如:我今天没吃饭
cbow(window=1):P(我|今)、P(今|我,天)、P(天|今,没)、P(没|天,吃)、P(吃|没,饭)、P(饭|吃) 一共计算了len(sentence)次
skip-gram(window=1)😛(今|我)、P(我|今)、P(天|今)、P(今|天)、P(没|天)、… 一共计算(len(sentence)*window -window) 词
所以在大型预料库中优先选用sk来训练,这样得到的词向量更优。所以今天我们主要来讲解sk。

理论支撑

cbow和skip-gram通用
目标函数
 Likelihoood  = L ( θ ) = ∏ t = 1 T ∏ − m ≤ j ≤ m j ≠ 0 P ( w t + j ∣ w t ; θ ) \text { Likelihoood }=L(\theta)=\prod_{t=1}^{T} \prod_{-m \leq j \leq m \atop j \neq 0} P\left(w_{t+j} \mid w_{t} ; \theta\right)  Likelihoood =L(θ)=t=1Tj=0mjmP(wt+jwt;θ)
其中, θ \theta θ 为所有需要优化的参数(词向量,也就是两个矩阵),现在有两个需要解决的问题。

  1. 连乘啊,这是千万啊,复杂度太高,且这么多相乘浮点早溢出,无法计算。
  2. 要整成具体的公式,这太抽象了

代价函数概览
运用log函数,累乘变累加
J ( θ ) = − 1 T log ⁡ L ( θ ) = − 1 T ∑ t = 1 T ∑ − m ≤ j ≤ m j ≠ 0 log ⁡ P ( w t + j ∣ w t ; θ ) J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m \atop j \neq 0} \log P\left(w_{t+j} \mid w_{t} ; \theta\right) J(θ)=T1logL(θ)=T1t=1Tj=0mjmlogP(wt+jwt;θ)
代价函数具体
要搞清楚具体的代价函数怎么写,我们要知道上面的那个概览是什么意思,意思就是我给出中心词的情况下,能够把周围的几个词找出来(找出来是对于全局说的,并不是找出某一次训练时中的特定词)。也就是说,我要把中心词与全部的词内积。看一下这个内积在全部词水平中的大小程度。
我们规定
v w v_{w} vw w w w 是中心词时
u w u_{w} uw w w w 是上下文词时

P ( o ∣ c ) = exp ⁡ ( u o T v c ) ∑ w ∈ V exp ⁡ ( u w T v c ) P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} P(oc)=wVexp(uwTvc)exp(uoTvc)
向量之间越相似,点乘结果越⼤,从⽽归⼀化后得到的概率值也越⼤。模型的训练正是为了使得具有相似上下⽂的单词,具有相似的向量。分母只是为了进行归一化。
我们现在对代价函数进行求导看看最后是什么?
对中心词求导

∂ ∂ v c log ⁡ P ( o ∣ c ) = ∂ ∂ v c log ⁡ exp ⁡ ( u o T v c ) ∑ w ∈ V exp ⁡ ( u w T v c ) = ∂ ∂ v c ( log ⁡ exp ⁡ ( u o T v c ) − log ⁡ ∑ w ∈ V exp ⁡ ( u w T v c ) ) = ∂ ∂ v c ( u o T v c − log ⁡ ∑ w ∈ V exp ⁡ ( u w T v c ) ) = u o − ∑ w ∈ V exp ⁡ ( u w T v c ) u w ∑ w ∈ V exp ⁡ ( u w T v c ) = u o − ∑ w ∈ V exp ⁡ ( u w T v c ) ∑ w ∈ V exp ⁡ ( u w T v c ) u w = u o − ∑ w ∈ V P ( w ∣ c ) u w \begin{aligned} \frac{\partial}{\partial v_{c}} \log P(o \mid c) &=\frac{\partial}{\partial v_{c}} \log \frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \\ &=\frac{\partial}{\partial v_{c}}\left(\log \exp \left(u_{o}^{T} v_{c}\right)-\log \sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)\right) \\ &=\frac{\partial}{\partial v_{c}}\left(u_{o}^{T} v_{c}-\log \sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)\right) \\ &=u_{o}-\frac{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right) u_{w}}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}\\ &=u_{o}-\sum_{w \in V} \frac{\exp \left(u_{w}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} u_{w} \\ &=u_{o}-\sum_{w \in V} P(w \mid c) u_{w} \end{aligned} vclogP(oc)=vclogwVexp(uwTvc)exp(uoTvc)=vc(logexp(uoTvc)logwVexp(uwTvc))=vc(uoTvclogwVexp(uwTvc))=uowVexp(uwTvc)wVexp(uwTvc)uw=uowVwVexp(uwTvc)exp(uwTvc)uw=uowVP(wc)uw
我们对中心词求导,注意,此刻我们并不是梯度下降而是梯度上升(因为我们要最大化当前的概率)。
简单点,我们可以理解成,我们用中心词去预测下一个单词。第⼀项是当前下一个单词的词向量,第⼆项是我们预测的中心词和全部单词的内积全部单词的词向量,
so->我们加上 u o u_{o} uo,期望我们的 v o v_o vo u o u_{o} uo更加相似,
同时我们减去中心词和全部单词的内积的概率
全部单词。我们期望中心词与其余的单词不相似。
哈哈,就像a/(a+b)我们增大a,同时减小b,这样我们的值才会更大。然后使⽤迭代法,这样中心词就会更加准确了。

对上下文词求导
∂ ∂ u o log ⁡ P ( o ∣ c ) = ∂ ∂ u o log ⁡ exp ⁡ ( u o T v c ) ∑ w ∈ V exp ⁡ ( u w T v c ) = ∂ ∂ u o ( log ⁡ exp ⁡ ( u o T v c ) − log ⁡ ∑ w ∈ V exp ⁡ ( u w T v c ) ) = ∂ ∂ u o ( u o T v c − log ⁡ ∑ w ∈ V exp ⁡ ( u w T v c ) ) = v c − ∑ ∂ ∂ u o exp ⁡ ( u w T v c ) ∑ w ∈ V exp ⁡ ( u w T v c ) = v c − exp ⁡ ( u o T v c ) v c ∑ w ∈ V exp ⁡ ( u w T v c ) = v c − exp ⁡ ( u o T v c ) ∑ w ∈ V exp ⁡ ( u w T v c ) v c = v c − P ( o ∣ c ) v c = ( 1 − P ( o ∣ c ) ) v c \begin{aligned} \frac{\partial}{\partial u_{o}} \log P(o \mid c) &=\frac{\partial}{\partial u_{o}} \log \frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \\ &=\frac{\partial}{\partial u_{o}}\left(\log \exp \left(u_{o}^{T} v_{c}\right)-\log \sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)\right) \\ &=\frac{\partial}{\partial u_{o}}\left(u_{o}^{T} v_{c}-\log \sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)\right) \\ &=v_{c}-\frac{\sum \frac{\partial}{\partial u_{o}} \exp \left(u_{w}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \\ &=v_{c}-\frac{\exp \left(u_{o}^{T} v_{c}\right) v_{c}}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \\ &=v_{c}-\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} v_{c} \\ &=v_{c}-P(o \mid c) v_{c} \\ &=(1-P(o \mid c)) v_{c} \end{aligned} uologP(oc)=uologwVexp(uwTvc)exp(uoTvc)=uo(logexp(uoTvc)logwVexp(uwTvc))=uo(uoTvclogwVexp(uwTvc))=vcwVexp(uwTvc)uoexp(uwTvc)=vcwVexp(uwTvc)exp(uoTvc)vc=vcwVexp(uwTvc)exp(uoTvc)vc=vcP(oc)vc=(1P(oc))vc
也就是说,当我们的上下文词和中心词的词向量的足够好的时候, P ( o ∣ c ) − > 1 P(o|c)->1 P(oc)>1的时候(哈哈,不可能),我们就不需要调整了。否则,我们就一点点调整到最优。

具体算法与过程

重点来了,我们之前提到过,sk和cbow只是两种不同的方式,我们这里将sk。同时我们发现,计算一个中心词,我们竟然需要与全部的词语进行内积并求指数。这太复杂了,所以,MK(word2vec作者)提出了很多优化方法,其中最重要的两个就是 负采样层次softmax
背景:有1024个词的文章、窗口为2、sk模型.
最笨的做法: 一次扫描要进行1024*(22)-22次P(o|c).而每次P(o|c)需要计算中心词与所有词的内积。也就是说复杂度是 n*n级别的。前一个n无法优化,因为我们总要扫描整个语料库。我们用两种方法来优化第二个n。

负采样:

我们选择中心词和他的上下文词作为正样本,选择K个(中心词和非上下文词)作为负样本。这样我们的复杂度就变成n*(1+k)了,复杂度大大降低。
所以
θ = argmax ⁡ θ ∏ ( w , c ) ∈ D P ( D = 1 ∣ w , c , θ ) ∏ ( w , c ) ∈ D ~ P ( D = 0 ∣ w , c , θ ) = argmax ⁡ θ ∏ ( w , c ) ∈ D P ( D = 1 ∣ w , c , θ ) ∏ ( w , c ) ∈ D ~ ( 1 − P ( D = 1 ∣ w , c , θ ) ) = argmax ⁡ θ ∑ ( w , c ) ∈ D log ⁡ P ( D = 1 ∣ w , c , θ ) + ∑ ( w , c ) ∈ D ~ log ⁡ ( 1 − P ( D = 1 ∣ w , c , θ ) ) = arg ⁡ max ⁡ θ ∑ ( w , c ) ∈ D log ⁡ 1 1 + exp ⁡ ( − u w T v c ) + ∑ ( w , c ) ∈ D ~ log ⁡ ( 1 − 1 1 + exp ⁡ ( − u w T v c ) ) = arg ⁡ max ⁡ θ ∑ ( w , c ) ∈ D log ⁡ 1 1 + exp ⁡ ( − u w T v c ) + ∑ ( w , c ) ∈ D ~ log ⁡ ( 1 1 + exp ⁡ ( u w T v c ) ) \begin{aligned} \theta &=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 \mid w, c, \theta) \prod_{(w, c) \in \tilde{D}} P(D=0 \mid w, c, \theta) \\ &=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 \mid w, c, \theta) \prod_{(w, c) \in \tilde{D}}(1-P(D=1 \mid w, c, \theta)) \\ &=\underset{\theta}{\operatorname{argmax}} \sum_{(w, c) \in D} \log P(D=1 \mid w, c, \theta)+\sum_{(w, c) \in \tilde{D}} \log (1-P(D=1 \mid w, c, \theta)) \\ &=\underset{\theta}{\arg \max } \sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}+\sum_{(w, c) \in \widetilde{D}} \log \left(1-\frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}\right) \\ &=\underset{\theta}{\arg \max } \sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}+\sum_{(w, c) \in \widetilde{D}} \log \left(\frac{1}{1+\exp \left(u_{w}^{T} v_{c}\right)}\right) \end{aligned} θ=θargmax(w,c)DP(D=1w,c,θ)(w,c)D~P(D=0w,c,θ)=θargmax(w,c)DP(D=1w,c,θ)(w,c)D~(1P(D=1w,c,θ))=θargmax(w,c)DlogP(D=1w,c,θ)+(w,c)D~log(1P(D=1w,c,θ))=θargmax(w,c)Dlog1+exp(uwTvc)1+(w,c)D log(11+exp(uwTvc)1)=θargmax(w,c)Dlog1+exp(uwTvc)1+(w,c)D log(1+exp(uwTvc)1)
D={
vo,u1
v1,u0
v1,u2
。。。
}
D ~ \widetilde{D} D ={
v0,u100
v1,u278
v2,u975
。。。
}
注意最大化似然函数等同于最小化负对数似然:
J = − ∑ ( w , c ) ∈ D log ⁡ 1 1 + exp ⁡ ( − u w T v c ) − ∑ ( w , c ) ∈ D ~ log ⁡ ( 1 1 + exp ⁡ ( u w T v c ) ) J=-\sum_{(w, c) \in D} \log \frac{1}{1+\exp \left(-u_{w}^{T} v_{c}\right)}-\sum_{(w, c) \in \widetilde{D}} \log \left(\frac{1}{1+\exp \left(u_{w}^{T} v_{c}\right)}\right) J=(w,c)Dlog1+exp(uwTvc)1(w,c)D log(1+exp(uwTvc)1)

对于 Skip-Gram 模型, 我们对给定中心词 c c c 来观察的上下文单词 c − m + j c-m+j cm+jSkip-Gram可以编码的目标函数
− log ⁡ σ ( u c − m + j T ⋅ v c ) − ∑ k = 1 K log ⁡ σ ( − u ~ k T ⋅ v c ) -\log \sigma\left(u_{c-m+j}^{T} \cdot v_{c}\right)-\sum_{k=1}^{K} \log \sigma\left(-\tilde{u}_{k}^{T} \cdot v_{c}\right) logσ(ucm+jTvc)k=1Klogσ(u~kTvc)
对 CBOW 模型, 我们对给定上下文向量 v ^ = v c − m + v c − m + 1 + … + v c + m 2 m \hat{v}=\frac{v_{c-m}+v_{c-m+1}+\ldots+v_{c+m}}{2 m} v^=2mvcm+vcm+1++vc+m 来观察中心词 u c u_{c} uc 的目标函数为
− log ⁡ σ ( u c T ⋅ v ^ ) − ∑ k = 1 K log ⁡ σ ( − u ~ k T ⋅ v ^ ) -\log \sigma\left(u_{c}^{T} \cdot \hat{v}\right)-\sum_{k=1}^{K} \log \sigma\left(-\widetilde{u}_{k}^{T} \cdot \hat{v}\right) logσ(ucTv^)k=1Klogσ(u kTv^)
在上面的公式中, { u ~ k ∣ k = 1 … K } \left\{\widetilde{u}_{k} \mid k=1 \ldots K\right\} {u kk=1K} 是从 P n ( w ) P_{n}(w) Pn(w) 中抽样。即根据某个词的出现的概率来决定被抽中的概率,但这样并不好,因为一些词的出现太频繁了(the、a),而一些词不经常出现,但是却有着重要的意义。所以我们要将原有出现的概率求3/4指数:
i s : 0. 9 3 / 4 = 0.92  Constitution  : 0.0 9 3 / 4 = 0.16  bombastic  : 0.0 1 3 / 4 = 0.032 \begin{aligned} i s: 0.9^{3 / 4} &=0.92 \\ \text { Constitution }: 0.09^{3 / 4} &=0.16 \\ \text { bombastic }: 0.01^{3 / 4} &=0.032 \end{aligned} is:0.93/4 Constitution :0.093/4 bombastic :0.013/4=0.92=0.16=0.032
"Bombastic"现在被抽样的概率是之前的三倍, 而“is”只比之前的才提高了一点点。

层次Softmax

看一下这个的示例图,w2是一个词,从根节点到w2的路是加粗线。先记住这条线。
在这里插入图片描述
看层次softmax的计算
p ( w i ∣ w ) = ∏ j = 1 L ( w i ) − 1 σ ( [ n ( w i , j + 1 ) = ch ⁡ ( n ( w i , j ) ) ] ⋅ v n ( w i , j ) T v w ) p\left(w_{i} \mid w\right)=\prod_{j=1}^{L(w_{i})-1} \sigma\left([n(w_{i}, j+1)=\operatorname{ch}(n(w_{i}, j))] \cdot v_{n(w_{i}, j)}^{T} v_{w}\right) p(wiw)=j=1L(wi)1σ([n(wi,j+1)=ch(n(wi,j))]vn(wi,j)Tvw)
普通的softmax
P ( o ∣ c ) = exp ⁡ ( u o T v c ) ∑ w ∈ V exp ⁡ ( u w T v c ) P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} P(oc)=wVexp(uwTvc)exp(uoTvc)
解释一下层次softmax各个都是啥意思。其实我们一直再求 p ( w i ∣ w ) p\left(w_{i} \mid w\right) p(wiw) w w w w i w_i wi的同时出现的概率。也就是求他俩一起出现在同一个窗口的概率。
L ( w i ) L(w_{i}) L(wi)是我们刚才求的那根粗线上有几个结点,例如w2就有四个节点,就是我们需要拿着我们的中心词 v w v_{w} vw去和其中一个周围词的路径上的节点进行内积。
[x]是如果他是左孩子就为1,右孩子为-1(也可以反过来).因为 sigmoid(-x)=1-sigmoid(x).这样可以保证我们的概率之和为1.也就是我们一个词向量*一个节点来确定是往左孩子走和往右孩子走的概率加起来为1.


还有一个要明确的是,我们计算中心词w到w2的概率 p ( w 2 ∣ w ) p\left(w_{2} \mid w\right) p(w2w),其实计算的是
p ( w 2 ∣ w i ) = p ( n ( w 2 , 1 ) ,  left  ) ⋅ p ( n ( w 2 , 2 ) ,  left  ) ⋅ p ( n ( w 2 , 3 ) ,  right  ) = σ ( v n ( w 2 , 1 ) T v w i ) ⋅ σ ( v n ( w 2 , 2 ) T v w i ) ⋅ σ ( − v n ( w 2 , 3 ) T v w i ) \begin{aligned} p\left(w_{2} \mid w_{i}\right) &=p\left(n\left(w_{2}, 1\right), \text { left }\right) \cdot p\left(n\left(w_{2}, 2\right), \text { left }\right) \cdot p\left(n\left(w_{2}, 3\right), \text { right }\right) \\ &=\sigma\left(v_{n\left(w_{2}, 1\right)}^{T} v_{w_{i}}\right) \cdot \sigma\left(v_{n\left(w_{2}, 2\right)}^{T} v_{w_{i}}\right) \cdot \sigma\left(-v_{n\left(w_{2}, 3\right)}^{T} v_{w_{i}}\right) \end{aligned} p(w2wi)=p(n(w2,1), left )p(n(w2,2), left )p(n(w2,3), right )=σ(vn(w2,1)Tvwi)σ(vn(w2,2)Tvwi)σ(vn(w2,3)Tvwi)

也就是我们在整个过程中并没有用到w2的词向量,而是使用了w2叶节点到根节点的路径上的节点的"词向量",(“路径上的词向量什么也代表不了”) 和w的词向量。


这样我们求解两个词在一个窗口中的概率就从普通的计算n次,变成logn次。
把n次softmax 将成 logn次sigmoid。这个做法比较有意思,也有点难以理解。

代码支撑

首先看下sk的负采样。
详细代码参考

Gitee代码链接

读取数据相关


import numpy as np
from collections import deque

class InputData:
    def __init__(self,input_file_name,min_count):
        self.input_file_name = input_file_name
        self.index = 0
        self.input_file = open(self.input_file_name,"r",encoding="utf-8")
        self.min_count = min_count
        self.wordid_frequency_dict = dict()
        self.word_count = 0
        self.word_count_sum = 0
        self.sentence_count = 0
        self.id2word_dict = dict()
        self.word2id_dict = dict()
        self._init_dict()  # 初始化字典
        self.sample_table = []
        self._init_sample_table()  # 初始化负采样映射表
        self.get_wordId_list()
        self.word_pairs_queue = deque()
        # 结果展示
        print('Word Count is:', self.word_count)
        print('Word Count Sum is', self.word_count_sum)
        print('Sentence Count is:', self.sentence_count)
    def _init_dict(self):
        word_freq = dict()
        for line in self.input_file:
            line = line.strip().split()
            self.word_count_sum +=len(line)
            self.sentence_count +=1
            for i,word in enumerate(line):
                if i%1000000==0:
                    print (i,len(line))
                if word_freq.get(word)==None:
                    word_freq[word] = 1
                else:
                    word_freq[word] += 1
        for i,word in enumerate(word_freq):
            if i % 100000 == 0:
                print(i, len(word_freq))
            if word_freq[word]<self.min_count:
                self.word_count_sum -= word_freq[word]
                continue
            self.word2id_dict[word] = len(self.word2id_dict)
            self.id2word_dict[len(self.id2word_dict)] = word
            self.wordid_frequency_dict[len(self.word2id_dict)-1] = word_freq[word]
        self.word_count =len(self.word2id_dict)
    def _init_sample_table(self):
        sample_table_size = 1e8
        pow_frequency = np.array(list(self.wordid_frequency_dict.values())) ** 0.75
        word_pow_sum = sum(pow_frequency)
        ratio_array = pow_frequency / word_pow_sum
        word_count_list = np.round(ratio_array * sample_table_size)
        for word_index, word_freq in enumerate(word_count_list):
            self.sample_table += [word_index] * int(word_freq)
        self.sample_table = np.array(self.sample_table)
        np.random.shuffle(self.sample_table)
    def get_wordId_list(self):
        self.input_file = open(self.input_file_name, encoding="utf-8")
        sentence = self.input_file.readline()
        wordId_list = []  # 一句中的所有word 对应的 id
        sentence = sentence.strip().split(' ')
        for i,word in enumerate(sentence):
            if i%1000000==0:
                print (i,len(sentence))
            try:
                word_id = self.word2id_dict[word]
                wordId_list.append(word_id)
            except:
                continue
        self.wordId_list = wordId_list
    def get_batch_pairs(self,batch_size,window_size):
        while len(self.word_pairs_queue) < batch_size:
            for _ in range(1000):
                if self.index == len(self.wordId_list):
                    self.index = 0
                wordId_w = self.wordId_list[self.index]
                for i in range(max(self.index - window_size, 0),
                                         min(self.index + window_size + 1,len(self.wordId_list))):

                    wordId_v = self.wordId_list[i]
                    if self.index == i:  # 上下文=中心词 跳过
                        continue
                    self.word_pairs_queue.append((wordId_w, wordId_v))
                self.index+=1
        result_pairs = []  # 返回mini-batch大小的正采样对
        for _ in range(batch_size):
            result_pairs.append(self.word_pairs_queue.popleft())
        return result_pairs


    # 获取负采样 输入正采样对数组 positive_pairs,以及每个正采样对需要的负采样数 neg_count 从采样表抽取负采样词的id
    # (假设数据够大,不考虑负采样=正采样的小概率情况)
    def get_negative_sampling(self, positive_pairs, neg_count):
        neg_v = np.random.choice(self.sample_table, size=(len(positive_pairs), neg_count)).tolist()
        return neg_v

    # 估计数据中正采样对数,用于设定batch
    def evaluate_pairs_count(self, window_size):
        return self.word_count_sum * (2 * window_size) - self.sentence_count * (
                    1 + window_size) * window_size

模型代码

import torch
import torch.nn as nn
import torch.nn.functional as F

class SkipGramModel(nn.Module):
    def __init__(self,vocab_size,embed_size):
        super(SkipGramModel,self).__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.w_embeddings = nn.Embedding(vocab_size,embed_size)
        self.v_embeddings = nn.Embedding(vocab_size, embed_size)
        self._init_emb()

    def _init_emb(self):
        initrange = 0.5 / self.embed_size
        self.w_embeddings.weight.data.uniform_(-initrange, initrange)
        self.v_embeddings.weight.data.uniform_(-0, 0)

    def forward(self, pos_w, pos_v, neg_v):
        emb_w = self.w_embeddings(torch.LongTensor(pos_w))  # 转为tensor 大小 [ mini_batch_size * emb_dimension ]
        emb_v = self.v_embeddings(torch.LongTensor(pos_v))
        neg_emb_v = self.v_embeddings(torch.LongTensor(neg_v))  # 转换后大小 [ negative_sampling_number * mini_batch_size * emb_dimension ]
        score = torch.mul(emb_w, emb_v)

        score = torch.sum(score, dim=1)
        score = torch.clamp(score, max=10, min=-10)
        score = F.logsigmoid(score)

        neg_score = torch.bmm(neg_emb_v, emb_w.unsqueeze(2))
        neg_score = torch.clamp(neg_score, max=10, min=-10)
        neg_score = F.logsigmoid(-1 * neg_score)
        # L = log sigmoid (Xw.T * θv) + ∑neg(v) [log sigmoid (-Xw.T * θneg(v))]
        loss = - torch.sum(score) - torch.sum(neg_score)
        return loss


    def save_embedding(self, id2word, file_name):
        embedding_1 = self.w_embeddings.weight.data.cpu().numpy()
        embedding_2 = self.v_embeddings.weight.data.cpu().numpy()
        embedding = (embedding_1+embedding_2)/2
        fout = open(file_name, 'w')
        fout.write('%d %d\n' % (len(id2word), self.embed_size))
        for wid, w in id2word.items():
            e = embedding[wid]
            e = ' '.join(map(lambda x: str(x), e))
            fout.write('%s %s\n' % (w, e))

训练代码

from skip_gram_nge_model import SkipGramModel
from input_data import InputData
import torch.optim as optim
from tqdm import tqdm


import argparse

def ArgumentParser():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name', type=str, default="skip-gram", help="skip-gram or cbow")
    parser.add_argument("--window_size",type=int,default=3,help="window size in word2vec")
    parser.add_argument("--batch_size",type=int,default=256,help="batch size during training phase")
    parser.add_argument("--min_count",type=int,default=3,help="min count of training word")
    parser.add_argument("--embed_dimension",type=int,default=100,help="embedding dimension of word embedding")
    parser.add_argument("--learning_rate",type=float,default=0.02,help="learning rate during training phase")
    parser.add_argument("--neg_count",type=int,default=5,help="neg count of skip-gram")
    return parser.parse_args()

args = ArgumentParser()

WINDOW_SIZE = args.window_size  # 上下文窗口c
BATCH_SIZE = args.batch_size  # mini-batch
MIN_COUNT = args.min_count  # 需要剔除的 低频词 的频
EMB_DIMENSION = args.embed_dimension  # embedding维度
LR = args.learning_rate  # 学习率
NEG_COUNT = args.neg_count  # 负采样数


class Word2Vec:
    def __init__(self, input_file_name, output_file_name):
        self.output_file_name = output_file_name
        self.data = InputData(input_file_name, MIN_COUNT)
        self.model = SkipGramModel(self.data.word_count, EMB_DIMENSION)
        self.lr = LR
        self.optimizer = optim.SGD(self.model.parameters(), lr=self.lr)

    def train(self):
        print("SkipGram Training......")
        pairs_count = self.data.evaluate_pairs_count(WINDOW_SIZE)
        print("pairs_count", pairs_count)
        batch_count = pairs_count / BATCH_SIZE
        print("batch_count", batch_count)
        process_bar = tqdm(range(int(batch_count)))
        for i in process_bar:
            pos_pairs = self.data.get_batch_pairs(BATCH_SIZE, WINDOW_SIZE)
            pos_w = [int(pair[0]) for pair in pos_pairs]
            pos_v = [int(pair[1]) for pair in pos_pairs]
            neg_v = self.data.get_negative_sampling(pos_pairs, NEG_COUNT)

            self.optimizer.zero_grad()
            loss = self.model.forward(pos_w, pos_v, neg_v)
            loss.backward()
            self.optimizer.step()

            if i * BATCH_SIZE % 100000 == 0:
                self.lr = self.lr * (1.0 - 1.0 * i / batch_count)
                for param_group in self.optimizer.param_groups:
                    param_group['lr'] = self.lr

        self.model.save_embedding(self.data.id2word_dict, self.output_file_name)


if __name__ == '__main__':
    w2v = Word2Vec(input_file_name='../data/lxc.txt', output_file_name="skip_gram_neg.txt")
    w2v.train()

层次softmax的代码

哈夫曼树的建立

class HuffmanNode:
    def __init__(self, word_id, frequency):
        self.word_id = word_id
        self.frequency = frequency
        self.left_child = None
        self.right_child = None
        self.father = None
        self.Huffman_code = []
        self.path = []



class HuffmanTree:
    def __init__(self, wordid_frequency_dict):
        self.word_count = len(wordid_frequency_dict)
        self.wordid_code = dict()
        self.wordid_path = dict()
        self.root = None
        unmerge_node_list = [HuffmanNode(wordid, frequency) for wordid, frequency in wordid_frequency_dict.items()]
        self.huffman = [HuffmanNode(wordid, frequency) for wordid, frequency in wordid_frequency_dict.items()]
        print("Building huffman tree...")
        self.build_tree(unmerge_node_list)
        print("Building tree finished")
        # 生成huffman code
        print("Generating huffman path...")
        self.generate_huffman_code_and_path()
        print("Generating huffman path finished")

    def merge_node(self, node1, node2):
        sum_frequency = node1.frequency + node2.frequency
        mid_node_id = len(self.huffman)
        father_node = HuffmanNode(mid_node_id, sum_frequency)
        if node1.frequency >= node2.frequency:
            father_node.left_child = node1
            father_node.right_child = node2
        else:
            father_node.left_child = node2
            father_node.right_child = node1
        self.huffman.append(father_node)
        return father_node

    def build_tree(self, node_list):

        while len(node_list) > 1:
            node_list = sorted(node_list, key=lambda x: x.frequency)
            i1 = node_list[0]
            i2 = node_list[1]
            node_list.remove(i1)
            node_list.remove(i2)
            father_node = self.merge_node(i1, i2)  # 合并最小的两个节点
            node_list.append(father_node)  # 插入新节点

        self.root = node_list[0]

    def generate_huffman_code_and_path(self):
        stack = [self.root]
        while len(stack) > 0:
            node = stack.pop()
            # 顺着左子树走
            while node.left_child or node.right_child:
                code = node.Huffman_code
                path = node.path
                node.left_child.Huffman_code = code + [1]
                node.right_child.Huffman_code = code + [0]
                node.left_child.path = path + [node.word_id]
                node.right_child.path = path + [node.word_id]
                # 把没走过的右子树加入栈
                stack.append(node.right_child)
                node = node.left_child
            word_id = node.word_id
            word_code = node.Huffman_code
            word_path = node.path
            self.huffman[word_id].Huffman_code = word_code
            self.huffman[word_id].path = word_path
            # 把节点计算得到的霍夫曼码、路径  写入词典的数值中
            self.wordid_code[word_id] = word_code
            self.wordid_path[word_id] = word_path

    # 获取所有词的正向节点id和负向节点id数组
    def get_all_pos_and_neg_path(self):
        positive = []  # 所有词的正向路径数组
        negative = []  # 所有词的负向路径数组
        for word_id in range(self.word_count):
            pos_id = []  # 存放一个词 路径中的正向节点id
            neg_id = []  # 存放一个词 路径中的负向节点id
            for i, code in enumerate(self.huffman[word_id].Huffman_code):
                if code == 1:
                    pos_id.append(self.huffman[word_id].path[i])
                else:
                    neg_id.append(self.huffman[word_id].path[i])
            positive.append(pos_id)
            negative.append(neg_id)
        return positive, negative


if __name__ == "__main__":
    word_frequency = {0: 7, 1: 8, 2: 3, 3: 2, 4: 2}
    print(word_frequency)
    tree = HuffmanTree(word_frequency)
    print(tree.wordid_code)
    print(tree.wordid_path)
    for i in range(len(word_frequency)):
        print(tree.huffman[i].path)
    print(tree.get_all_pos_and_neg_path())

输入数据

import numpy as np
import sys

sys.path.append("../Skip_Gram_HS")
from collections import deque
from huffman_tree import HuffmanTree


class InputData:
    def __init__(self, input_file_name, min_count):
        self.input_file_name = input_file_name
        self.index = 0
        self.input_file = open(self.input_file_name)  # 数据文件
        self.min_count = min_count  # 要淘汰的低频数据的频度
        self.wordId_frequency_dict = dict()  # 词id-出现次数 dict
        self.word_count = 0  # 单词数(重复的词只算1个)
        self.word_count_sum = 0  # 单词总数 (重复的词 次数也累加)
        self.sentence_count = 0  # 句子数
        self.id2word_dict = dict()  # 词id-词 dict
        self.word2id_dict = dict()  # 词-词id dict
        self._init_dict()  # 初始化字典
        self.huffman_tree = HuffmanTree(self.wordId_frequency_dict)  # 霍夫曼树
        self.huffman_pos_path, self.huffman_neg_path = self.huffman_tree.get_all_pos_and_neg_path()
        self.word_pairs_queue = deque()
        # 结果展示
        self.get_wordId_list()
        print('Word Count is:', self.word_count)
        print('Word Count Sum is', self.word_count_sum)
        print('Sentence Count is:', self.sentence_count)
        print('Tree Node is:', len(self.huffman_tree.huffman))

    def _init_dict(self):
        word_freq = dict()
        # 统计 word_frequency
        for line in self.input_file:
            line = line.strip().split(' ')  # 去首尾空格
            self.word_count_sum += len(line)
            self.sentence_count += 1
            for i, word in enumerate(line):
                if i % 1000000 == 0:
                    print(i, len(line))
                try:
                    word_freq[word] += 1
                except:
                    word_freq[word] = 1
        word_id = 0
        # 初始化 word2id_dict,id2word_dict, wordId_frequency_dict字典
        for per_word, per_count in word_freq.items():
            if per_count < self.min_count:  # 去除低频
                self.word_count_sum -= per_count
                continue
            self.id2word_dict[word_id] = per_word
            self.word2id_dict[per_word] = word_id
            self.wordId_frequency_dict[word_id] = per_count
            word_id += 1
        self.word_count = len(self.word2id_dict)

    def get_wordId_list(self):
        self.input_file = open(self.input_file_name, encoding="utf-8")
        sentence = self.input_file.readline()
        wordId_list = []  # 一句中的所有word 对应的 id
        sentence = sentence.strip().split(' ')
        for i, word in enumerate(sentence):
            if i % 1000000 == 0:
                print(i, len(sentence))
            try:
                word_id = self.word2id_dict[word]
                wordId_list.append(word_id)
            except:
                continue
        self.wordId_list = wordId_list

    # 获取mini-batch大小的 正采样对 (w,v) w为目标词id,v为上下文中的一个词的id。上下文步长为window_size,即2c = 2*window_size
    def get_batch_pairs(self, batch_size, window_size):
        while len(self.word_pairs_queue) < batch_size:
            for _ in range(1000):
                if self.index == len(self.wordId_list):
                    self.index = 0
                for i in range(max(self.index - window_size, 0),
                               min(self.index + window_size + 1, len(self.wordId_list))):
                    wordId_w = self.wordId_list[self.index]
                    wordId_v = self.wordId_list[i]
                    if self.index == i:  # 上下文=中心词 跳过
                        continue
                    self.word_pairs_queue.append((wordId_w, wordId_v))
                self.index += 1
        result_pairs = []  # 返回mini-batch大小的正采样对
        for _ in range(batch_size):
            result_pairs.append(self.word_pairs_queue.popleft())
        return result_pairs

    def get_pairs(self, pos_pairs):
        neg_word_pair = []
        pos_word_pair = []
        for pair in pos_pairs:
            pos_word_pair += zip([pair[0]] * len(self.huffman_pos_path[pair[1]]), self.huffman_pos_path[pair[1]])
            neg_word_pair += zip([pair[0]] * len(self.huffman_neg_path[pair[1]]), self.huffman_neg_path[pair[1]])
        return pos_word_pair, neg_word_pair

    # 估计数据中正采样对数,用于设定batch
    def evaluate_pairs_count(self, window_size):
        return self.word_count_sum * (2 * window_size - 1) - (self.sentence_count - 1) * (1 + window_size) * window_size

模型

import torch
import torch.nn as nn
import torch.nn.functional as F


class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super(SkipGramModel, self).__init__()
        self.vocab_size = vocab_size
        self.emb_size = emb_size
        self.w_embeddings = nn.Embedding(2*vocab_size-1, emb_size, sparse=True)
        self.v_embeddings = nn.Embedding(2*vocab_size-1, emb_size, sparse=True)
        self._init_emb()

    def _init_emb(self):
        initrange = 0.5 / self.emb_size
        self.w_embeddings.weight.data.uniform_(-initrange, initrange)
        self.v_embeddings.weight.data.uniform_(-0, 0)

    def forward(self, pos_w, pos_v,neg_w, neg_v):
        emb_w = self.w_embeddings(torch.LongTensor(pos_w))  # 转为tensor 大小 [ mini_batch_size * emb_dimension ]
        neg_emb_w = self.w_embeddings(torch.LongTensor(neg_w))
        emb_v = self.v_embeddings(torch.LongTensor(pos_v))
        neg_emb_v = self.v_embeddings(torch.LongTensor(neg_v))  # 转换后大小 [ negative_sampling_number * mini_batch_size * emb_dimension ]
        score = torch.mul(emb_w, emb_v).squeeze()
        score = torch.sum(score, dim=1)
        score = torch.clamp(score, max=10, min=-10)
        score = F.logsigmoid(score)
        neg_score = torch.mul(neg_emb_w, neg_emb_v).squeeze()
        neg_score = torch.sum(neg_score, dim=1)
        neg_score = torch.clamp(neg_score, max=10, min=-10)
        neg_score = F.logsigmoid(-neg_score)
        # L = log sigmoid (Xw.T * θv) + [log sigmoid (-Xw.T * θv)]
        loss = -1 * (torch.sum(score) + torch.sum(neg_score))
        return loss

    def save_embedding(self, id2word, file_name):
        embedding = self.w_embeddings.weight.data.cpu().numpy()
        fout = open(file_name, 'w')
        fout.write('%d %d\n' % (len(id2word), self.emb_size))
        for wid, w in id2word.items():
            e = embedding[wid]
            e = ' '.join(map(lambda x: str(x), e))
            fout.write('%s %s\n' % (w, e))


训练代码

import sys
sys.path.append("../Skip_Gram_HS")
from skip_gram_hs_model import SkipGramModel
from input_data import InputData
import torch.optim as optim
from tqdm import tqdm

import argparse

def ArgumentParser():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name', type=str, default="skip-gram", help="skip-gram or cbow")
    parser.add_argument("--window_size",type=int,default=3,help="window size in word2vec")
    parser.add_argument("--batch_size",type=int,default=256,help="batch size during training phase")
    parser.add_argument("--min_count",type=int,default=3,help="min count of training word")
    parser.add_argument("--embed_dimension",type=int,default=100,help="embedding dimension of word embedding")
    parser.add_argument("--learning_rate",type=float,default=0.02,help="learning rate during training phase")
    return parser.parse_args()

args = ArgumentParser()
WINDOW_SIZE = args.window_size  # 上下文窗口c
BATCH_SIZE = args.batch_size  # mini-batch
MIN_COUNT = args.min_count  # 需要剔除的 低频词 的频
EMB_DIMENSION = args.embed_dimension  # embedding维度
LR = args.learning_rate  # 学习率


class Word2Vec:
    def __init__(self, input_file_name, output_file_name):
        self.output_file_name = output_file_name
        self.data = InputData(input_file_name, MIN_COUNT)
        self.model = SkipGramModel(self.data.word_count, EMB_DIMENSION)
        self.lr = LR
        self.optimizer = optim.SGD(self.model.parameters(), lr=self.lr)

    def train(self):
        print("SkipGram Training......")
        pairs_count = self.data.evaluate_pairs_count(WINDOW_SIZE)
        print("pairs_count", pairs_count)
        batch_count = pairs_count / BATCH_SIZE
        print("batch_count", batch_count)
        process_bar = tqdm(range(int(batch_count)))
        for i in process_bar:
            pos_pairs = self.data.get_batch_pairs(BATCH_SIZE, WINDOW_SIZE)
            pos_pairs,neg_pairs = self.data.get_pairs(pos_pairs)
            pos_u = [pair[0] for pair in pos_pairs]
            pos_v = [int(pair[1]) for pair in pos_pairs]
            neg_u = [pair[0] for pair in neg_pairs]
            neg_v = [int(pair[1]) for pair in neg_pairs]
            self.optimizer.zero_grad()
            loss = self.model.forward(pos_u, pos_v, neg_u,neg_v)
            loss.backward()
            self.optimizer.step()

            if i * BATCH_SIZE % 100000 == 0:
                self.lr = self.lr * (1.0 - 1.0 * i / batch_count)
                for param_group in self.optimizer.param_groups:
                    param_group['lr'] = self.lr

        self.model.save_embedding(self.data.id2word_dict, self.output_file_name)


if __name__ == '__main__':
    w2v = Word2Vec(input_file_name='../data/lxc.txt', output_file_name="word_embedding.txt")
    w2v.train()

ref:推导过程参考cs224.代码参考深度课堂

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值