Word2Vec模型总结

1.Huffman树的构造
解析:给定n个权值作为n个叶子节点,构造一棵二叉树,若它的带权路径长度达到最小,则称这样的二叉树为最优二叉树,也称Huffman树。数的带权路径长度规定为所有叶子节点的带权路径长度之和。Huffman树构造,如下所示:
[1]将 { w 1 , w 2 , . . . , w 3 } \{w_1,w_2,...,w_3\} {w1,w2,...,w3}看成是有n颗树的森林;
[2]在森林中选出两个根节点的权值最小的树合并,作为一棵新树的左、右子树,且新树的根节点权值为其左、右子树根节点权值之和;
[3]从森林中删除选取的两颗树,并将新树加入森林;
[4]重复[2][3]步,直到森林中只剩一棵树为止,该树即为所求的Huffman树。
说明:利用Huffman树设计的二进制前缀编码,称为Huffman编码,它既能满足前缀编码条件,又能保证报文编码总长最短。

2.基于Hierarchical Softmax的模型[CBOW模型]
解析:
在这里插入图片描述
其中参数的物理意义,如下所示:
[1] X w = ∑ i = 1 2 c v ( C o n t e x t ( w ) i ) ∈ R m {{\bf{X}}_w} = \sum\limits_{i = 1}^{2c} {{\bf{v}}\left( {Context{{\left( w \right)}_i}} \right) \in {\rm{R}^m}} Xw=i=12cv(Context(w)i)Rm
[2] d j w d_j^w djw表示路径 p w {p^w} pw中第 j j j结点对应的编码[根结点不对应编码]
[3] θ j w \theta _j^w θjw表示路径 p w {p^w} pw中第 j j j非叶子结点对应的向量
[4] p w {p^w} pw表示从根结点出发到达 w w w对应叶子结点的路径。
[5] l w {l^w} lw表示路径 p w {p^w} pw中包含结点的个数。
Hierarchical Softmax基本思想,如下所示:
p ( w ∣ C o n t e x t ( w ) ) = ∏ j = 2 l w p ( d j w ∣ x w , θ j − 1 w ) p\left( {w|Context\left( w \right)} \right) = \prod\limits_{j = 2}^{{l^w}} {p\left( {d_j^w|{{\bf{x}}_w},\theta _{j - 1}^w} \right)} p(wContext(w))=j=2lwp(djwxw,θj1w)
p ( d j w ∣ x w , θ j − 1 w ) = [ σ ( x w T θ j − 1 w ) ] 1 − d j w ⋅ [ 1 − σ ( x w T θ j − 1 w ) ] d j w p\left( {d_j^w|{{\bf{x}}_w},\theta _{j - 1}^w} \right) = {\left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]^{1 - d_j^w}} \cdot {\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]^{d_j^w}} p(djwxw,θj1w)=[σ(xwTθj1w)]1djw[1σ(xwTθj1w)]djw
对于word2vec中基于Hierarchical Softmax的CBOW模型,优化的目标函数,如下所示:
L = ∑ w ∈ C log ⁡ p ( w ∣ C o n t e x t ( w ) ) L = \sum\limits_{w \in C} {\log p\left( {w|Context\left( w \right)} \right)} L=wClogp(wContext(w))
这样得到对数似然函数,如下所示:
L = ∑ w ∈ C log ⁡ ∏ j = 2 l w { [ σ ( x w T θ j − 1 w ) ] 1 − d j w ⋅ [ 1 − σ ( x w T θ j − 1 w ) ] d j w } = ∑ w ∈ C ∑ j = 2 l w { ( 1 − d j w ) ⋅ log ⁡ [ σ ( x w T θ j − 1 w ) ] + d j w ⋅ log ⁡ [ 1 − σ ( x w T θ j − 1 w ) ] } L = \sum\limits_{w \in C} {\log \prod\limits_{j = 2}^{{l^w}} {\left\{ {{{\left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]}^{1 - d_j^w}}\cdot{{\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]}^{d_j^w}}} \right\}} } \\ = \sum\limits_{w \in C} {\sum\limits_{j = 2}^{{l^w}} {\left\{ {\left( {1 - d_j^w} \right)\cdot\log \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right] + d_j^w\cdot\log \left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]} \right\}} } L=wClogj=2lw{[σ(xwTθj1w)]1djw[1σ(xwTθj1w)]djw}=wCj=2lw{(1djw)log[σ(xwTθj1w)]+djwlog[1σ(xwTθj1w)]}
将花括号中的内容简记为 L ( w , j ) L\left( {w,j} \right) L(w,j),如下所示:
L ( w , j ) = ( 1 − d j w ) ⋅ log ⁡ [ σ ( x w T θ j − 1 w ) ] + d j w ⋅ log ⁡ [ 1 − σ ( x w T θ j − 1 w ) ] L\left( {w,j} \right) = \left( {1 - d_j^w} \right) \cdot \log \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right] + d_j^w \cdot \log \left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right] L(w,j)=(1djw)log[σ(xwTθj1w)]+djwlog[1σ(xwTθj1w)]
使用随机梯度上升法对 θ j − 1 w \theta _{j - 1}^w θj1w求偏导,如下所示:
∂ L ( w , j ) ∂ θ j − 1 w = ∂ ∂ θ j − 1 w { ( 1 − d j w ) ⋅ log ⁡ [ σ ( x w T θ j − 1 w ) ] + d j w ⋅ log ⁡ [ 1 − σ ( x w T θ j − 1 w ) ] } = ( 1 − d j w ) ⋅ [ 1 − σ ( x w T θ j − 1 w ) ] x w − d j w ⋅ σ ( x w T θ j − 1 w ) x w = { ( 1 − d j w ) ⋅ [ 1 − σ ( x w T θ j − 1 w ) ] − d j w ⋅ σ ( x w T θ j − 1 w ) } x w = [ 1 − d j w − σ ( x w T θ j − 1 w ) ] x w \begin{array}{l} \frac{{\partial L\left( {w,j} \right)}}{{\partial \theta _{j - 1}^w}} = \frac{\partial }{{\partial \theta _{j - 1}^w}}\left\{ {\left( {1 - d_j^w} \right)\cdot\log \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right] + d_j^w\cdot\log \left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]} \right\} \\ = \left( {1 - d_j^w} \right)\cdot\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]{{\bf{x}}_w} - d_j^w\cdot\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right){{\bf{x}}_w} \\ {\rm{ = }}\left\{ {\left( {1 - d_j^w} \right)\cdot\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right] - d_j^w\cdot\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right\}{{\bf{x}}_w} \\ {\rm{ = }}\left[ {1 - d_j^w{\rm{ - }}\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]{{\bf{x}}_w} \\ \end{array} θj1wL(w,j)=θj1w{(1djw)log[σ(xwTθj1w)]+djwlog[1σ(xwTθj1w)]}=(1djw)[1σ(xwTθj1w)]xwdjwσ(xwTθj1w)xw={(1djw)[1σ(xwTθj1w)]djwσ(xwTθj1w)}xw=[1djwσ(xwTθj1w)]xw
θ j − 1 w \theta_{j-1}^w θj1w的更新方程,如下所示:
θ j − 1 w : = θ j − 1 w + η [ 1 − d j w − σ ( x w T θ j − 1 w ) ] x w \theta _{j - 1}^w: = \theta _{j - 1}^w + \eta \left[ {1 - d_j^w{\rm{ - }}\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]{{\bf{x}}_w} θj1w:=θj1w+η[1djwσ(xwTθj1w)]xw
使用随机梯度上升法对 x w {{\bf{x}}_w} xw求偏导,如下所示:
∂ L ( w , j ) ∂ x w = [ 1 − d j w − σ ( x w T θ j − 1 w ) ] θ j − 1 w \frac{{\partial L\left( {w,j} \right)}}{{\partial {{\bf{x}}_w}}} = \left[ {1 - d_j^w{\rm{ - }}\sigma \left( {{\bf{x}}_w^{\rm{T}}\theta _{j - 1}^w} \right)} \right]\theta _{j - 1}^w xwL(w,j)=[1djwσ(xwTθj1w)]θj1w
对于词典中每个词的词向量 v ( w ~ ) {\bf{v}}\left( {\tilde w} \right) v(w~)更新方程,如下所示:
v ( w ~ ) : = v ( w ~ ) + η ∑ j = 2 l w ∂ L ( w , j ) ∂ x w , w ~ ∈ C o n t e x t ( w ) {\bf{v}}\left( {\tilde w} \right): = {\bf{v}}\left( {\tilde w} \right) + \eta \sum\limits_{j = 2}^{{l^w}} {\frac{{\partial L\left( {w,j} \right)}}{{\partial {{\bf{x}}_w}}}} ,\tilde w \in Context\left( w \right) v(w~):=v(w~)+ηj=2lwxwL(w,j),w~Context(w)

3.基于Hierarchical Softmax的模型[Skip-Gram模型]
解析:
这里写图片描述
其中, v ( w ) ∈ R m {\bf{v}}\left( w \right) \in {{\rm{R}}^m} v(w)Rm表示当前样本的中心词 w w w的词向量。
对于word2vec中基于Hierarchical Softmax的Skip-Gram模型,优化的目标函数,如下所示:
L = ∑ w ∈ C log ⁡ p ( C o n t e x t ( w ) ∣ w ) L = \sum\limits_{w \in C} {\log p\left( {Context\left( w \right)|w} \right)} L=wClogp(Context(w)w)
Skip-Gram模型中条件概率函数 p ( C o n t e x t ( w ) ∣ w ) p\left( {Context\left( w \right)|w} \right) p(Context(w)w),如下所示:
p ( C o n t e x t ( w ) ∣ w ) = ∏ u ∈ C o n t e x t ( w ) p ( u ∣ w ) p\left( {Context\left( w \right)|w} \right){\rm{ = }}\prod\limits_{u \in Context\left( w \right)} {p\left( {u|w} \right)} p(Context(w)w)=uContext(w)p(uw)
p ( u ∣ w ) = ∏ j = 2 l u p ( d j u ∣ v ( w ) , θ j − 1 u ) p\left( {u|w} \right) = \prod\limits_{j = 2}^{{l^u}} {p\left( {d_j^u|{\bf{v}}\left( w \right),\theta _{j - 1}^u} \right)} p(uw)=j=2lup(djuv(w),θj1u)
p ( d j u ∣ v ( w ) , θ j − 1 u ) = [ σ ( v ( w ) T θ j − 1 u ) ] 1 − d j u ⋅ [ 1 − σ ( v ( w ) T θ j − 1 u ) ] d j u p\left( {d_j^u|{\bf{v}}\left( w \right),\theta _{j - 1}^u} \right) = {\left[ {\sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}\theta _{j - 1}^u} \right)} \right]^{1 - d_j^u}} \cdot {\left[ {1 - \sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}\theta _{j - 1}^u} \right)} \right]^{d_j^u}} p(djuv(w),θj1u)=[σ(v(w)Tθj1u)]1dju[1σ(v(w)Tθj1u)]dju
这样得到对数似然函数,如下所示:
L = ∑ w ∈ C log ⁡ ∏ u ∈ C o n t e x t ( w ) ∏ j = 2 l u { [ σ ( v ( w ) T θ j − 1 u ) ] 1 − d j u ⋅ [ 1 − σ ( v ( w ) T θ j − 1 u ) ] d j u } = ∑ w ∈ C ∑ u ∈ C o n t e x t ( w ) ∑ i = 2 l u { ( 1 − d j u ) ⋅ log ⁡ [ σ ( v ( w ) T θ j − 1 u ) ] + d j u ⋅ log ⁡ [ 1 − σ ( v ( w ) T θ j − 1 u ) ] } \begin{array}{l} L = \sum\limits_{w \in C} {\log \prod\limits_{u \in Context\left( w \right)} {\prod\limits_{j = 2}^{{l^u}} {\left\{ {{{\left[ {\sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}\theta _{j - 1}^u} \right)} \right]}^{1 - d_j^u}} \cdot {{\left[ {1 - \sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}\theta _{j - 1}^u} \right)} \right]}^{d_j^u}}} \right\}} } } \\ = \sum\limits_{w \in C} {\sum\limits_{u \in Context\left( w \right)} {\sum\limits_{i = 2}^{{l^u}} {\left\{ {\left( {1 - d_j^u} \right) \cdot \log \left[ {\sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}\theta _{j - 1}^u} \right)} \right] + d_j^u \cdot \log \left[ {1 - \sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}\theta _{j - 1}^u} \right)} \right]} \right\}} } } \\ \end{array} L=wCloguContext(w)j=2lu{[σ(v(w)Tθj1u)]1dju[1σ(v(w)Tθj1u)]dju}=wCuContext(w)i=2lu{(1dju)log[σ(v(w)Tθj1u)]+djulog[1σ(v(w)Tθj1u)]}
将花括号中的内容简记为 L ( w , u , j ) L\left( {w,u,j} \right) L(w,u,j),如下所示:
L ( w , u , j ) = ( 1 − d j u ) ⋅ log ⁡ [ σ ( v ( w ) T θ j − 1 u ) ] + d j u ⋅ log ⁡ [ 1 − σ ( v ( w ) T θ j − 1 u ) ] L\left( {w,u,j} \right) = \left( {1 - d_j^u} \right) \cdot \log \left[ {\sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}\theta _{j - 1}^u} \right)} \right] + d_j^u \cdot \log \left[ {1 - \sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}\theta _{j - 1}^u} \right)} \right] L(w,u,j)=(1dju)log[σ(v(w)Tθj1u)]+djulog[1σ(v(w)Tθj1u)]

4.基于Negative Sampling的模型[CBOW模型]
Negative Sampling不再使用Huffman树,而是使用随机负采样,能大幅度提高性能。假定已经选好 w w w的负样本子集 N E G ( w ) ≠ ∅ NEG\left( w \right) \ne \emptyset NEG(w)=,定义词 w ~ \tilde w w~的标签[正样本为1,负样本为0],如下所示:
L w ( w ~ ) = { 1 , w ~ = w 0 , w ~ ≠ w {L^w}\left( {\tilde w} \right) = \left\{ \begin{array}{l} 1,\tilde w = w \\ 0,\tilde w \ne w \\ \end{array} \right. Lw(w~)={1,w~=w0,w~=w
对于给定的正样本 ( C o n t e x t ( w ) , w ) \left( {Context\left( w \right),w} \right) (Context(w),w),最大化 g ( w ) g\left( w \right) g(w),如下所示:
g ( w ) = ∏ u ∈ { w } ∪ N E G ( w ) p ( u ∣ C o n t e x t ( w ) ) g\left( w \right) = \prod\limits_{u \in \left\{ w \right\} \cup NEG\left( w \right)} {p\left( {u|Context\left( w \right)} \right)} g(w)=u{w}NEG(w)p(uContext(w))
p ( u ∣ C o n t e x t ( w ) ) = [ σ ( x w T θ u ) ] L w ( u ) ⋅ [ 1 − σ ( x w T θ u ) ] [ 1 − L w ( u ) ] p\left( {u|Context\left( w \right)} \right) = \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]^{L^{w}\left( u \right)} \cdot {\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]^{\left[ {1 - {L^{w}\left( u \right)}} \right]}} p(uContext(w))=[σ(xwTθu)]Lw(u)[1σ(xwTθu)][1Lw(u)]
其中, x w {{\bf{x}}_w} xw表示 C o n t e x t ( w ) Context\left( w \right) Context(w)中各词的词向量之和, θ u ∈ R m {\theta ^u} \in {{\rm{R}}^m} θuRm表示词 u u u对应的一个辅助向量,为待训练的参数。简化 g ( w ) g\left( w \right) g(w)方程,如下所示:
g ( w ) = σ ( x w T θ w ) ∏ u ∈ N E G ( w ) [ 1 − σ ( x w T θ u ) ] g\left( w \right) = \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^w}} \right)\prod\limits_{u \in NEG\left( w \right)} {\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]} g(w)=σ(xwTθw)uNEG(w)[1σ(xwTθu)]
其中, σ ( x w T θ w ) \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^w}} \right) σ(xwTθw)表示当上下文为 C o n t e x t ( w ) Context\left( w \right) Context(w)时,预测中心词为 w w w的概率,同样 σ ( x w T θ u ) , u ∈ N E G ( w ) \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right),u \in NEG\left( w \right) σ(xwTθu),uNEG(w)表示当上下文为 C o n t e x t ( w ) Context\left( w \right) Context(w)时,预测中心词为 u u u的概率。
对于给定的语料库 C C C,目标函数如下所示:
L = log ⁡ G = log ⁡ ∏ w ∈ C g ( w ) = ∑ w ∈ C log ⁡ g ( w ) = ∑ w ∈ C log ⁡ ∏ u ∈ { w } ∪ N E G ( w ) { [ σ ( x w T θ u ) ] L w ( u ) ⋅ [ 1 − σ ( x w T θ u ) ] 1 − L w ( u ) } = ∑ w ∈ C ∑ u ∈ { w } ∪ N E G ( w ) { L w ( u ) ⋅ log ⁡ [ σ ( x w T θ u ) ] + [ 1 − L w ( u ) ] ⋅ log ⁡ [ 1 − σ ( x w T θ u ) ] } = ∑ w ∈ C { log ⁡ [ σ ( x w T θ w ) ] + ∑ u ∈ N E G ( w ) log ⁡ [ 1 − σ ( x w T θ u ) ] } = ∑ w ∈ C { log ⁡ [ σ ( x w T θ w ) ] + ∑ u ∈ N E G ( w ) log ⁡ [ σ ( − x w T θ u ) ] } \begin{array}{l} L = \log G = \log \prod\limits_{w \in C} {g\left( w \right)} = \sum\limits_{w \in C} {\log g\left( w \right)} \\ = \sum\limits_{w \in C} {\log } {\prod _{u \in \left\{ w \right\} \cup NEG\left( w \right)}}\left\{ {{{\left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]}^{{L^w}\left( u \right)}}\cdot{{\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]}^{1 - {L^w}\left( u \right)}}} \right\} \\ = \sum\limits_{w \in C} {{\sum _{u \in \left\{ w \right\} \cup NEG\left( w \right)}}\left\{ {{L^w}\left( u \right)\cdot\log \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right] + \left[ {1 - {L^w}\left( u \right)} \right]\cdot\log \left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]} \right\}} \\ = \sum\limits_{w \in C} {\left\{ {\log \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^w}} \right)} \right] + {\sum _{u \in NEG\left( w \right)}}\log \left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]} \right\}} \\ = \sum\limits_{w \in C} {\left\{ {\log \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^w}} \right)} \right] + {\sum _{u \in NEG\left( w \right)}}\log \left[ {\sigma \left( { - {\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]} \right\}} \\ \end{array} L=logG=logwCg(w)=wClogg(w)=wClogu{w}NEG(w){[σ(xwTθu)]Lw(u)[1σ(xwTθu)]1Lw(u)}=wCu{w}NEG(w){Lw(u)log[σ(xwTθu)]+[1Lw(u)]log[1σ(xwTθu)]}=wC{log[σ(xwTθw)]+uNEG(w)log[1σ(xwTθu)]}=wC{log[σ(xwTθw)]+uNEG(w)log[σ(xwTθu)]}
L ( w , u ) = L w ( u ) ⋅ log ⁡ [ σ ( x w T θ u ) ] + [ 1 − L w ( u ) ] ⋅ log ⁡ [ 1 − σ ( x w T θ u ) ] L\left( {w,u} \right) = {L^w}\left( u \right)\cdot\log \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right] + \left[ {1 - {L^w}\left( u \right)} \right]\cdot\log \left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right] L(w,u)=Lw(u)log[σ(xwTθu)]+[1Lw(u)]log[1σ(xwTθu)],使用随机梯度上升法对 θ u {\theta ^u} θu求偏导,如下所示:
∂ L ( w , u ) ∂ θ u = ∂ ∂ θ u { L w ( u ) ⋅ log ⁡ [ σ ( x w T θ u ) ] + [ 1 − L w ( u ) ] ⋅ log ⁡ [ 1 − σ ( x w T θ u ) ] } = L w ( u ) [ 1 − σ ( x w T θ u ) ] x w − [ 1 − L w ( u ) ] σ ( x w T θ u ) x w = { L w ( u ) [ 1 − σ ( x w T θ u ) ] − [ 1 − L w ( u ) ] σ ( x w T θ u ) } x w = [ L w ( u ) − σ ( x w T θ u ) ] x w \begin{array}{l} \frac{{\partial L\left( {w,u} \right)}}{{\partial {\theta ^u}}}{\rm{ = }}\frac{\partial }{{\partial {\theta ^u}}}\left\{ {{L^w}\left( u \right)\cdot\log \left[ {\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right] + \left[ {1 - {L^w}\left( u \right)} \right]\cdot\log \left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]} \right\} \\ = {L^w}\left( u \right)\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]{{\bf{x}}_w} - \left[ {1 - {L^w}\left( u \right)} \right]\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right){{\bf{x}}_w} \\ = \left\{ {{L^w}\left( u \right)\left[ {1 - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right] - \left[ {1 - {L^w}\left( u \right)} \right]\sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right\}{{\bf{x}}_w} \\ = \left[ {{L^w}\left( u \right) - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]{{\bf{x}}_w} \\ \end{array} θuL(w,u)=θu{Lw(u)log[σ(xwTθu)]+[1Lw(u)]log[1σ(xwTθu)]}=Lw(u)[1σ(xwTθu)]xw[1Lw(u)]σ(xwTθu)xw={Lw(u)[1σ(xwTθu)][1Lw(u)]σ(xwTθu)}xw=[Lw(u)σ(xwTθu)]xw
参数 θ u \theta ^u θu的更新方程,如下所示:
θ u : = θ u + η [ L w ( u ) − σ ( x w T θ u ) ] x w {\theta ^u}: = {\theta ^u} + \eta \left[ {{L^w}\left( u \right) - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]{{\bf{x}}_w} θu:=θu+η[Lw(u)σ(xwTθu)]xw
使用随机梯度上升法对 x w {{\bf{x}}_w} xw求偏导,如下所示:
∂ L ( w , u ) ∂ x w = [ L w ( u ) − σ ( x w T θ u ) ] θ u \frac{{\partial L\left( {w,u} \right)}}{{\partial {{\bf{x}}_w}}}{\rm{ = }}\left[ {{L^w}\left( u \right) - \sigma \left( {{\bf{x}}_w^{\rm{T}}{\theta ^u}} \right)} \right]{\theta ^u} xwL(w,u)=[Lw(u)σ(xwTθu)]θu
参数 v ( w ~ ) , w ~ ∈ C o n t e x t ( w ) {\bf{v}}\left( {\tilde w} \right),\tilde w \in Context\left( w \right) v(w~),w~Context(w)的更新方程,如下所示:
v ( w ~ ) : = v ( w ~ ) + η ∑ u ∈ { w } ∪ N E G ( w ) ∂ L ( w , u ) ∂ x w , w ~ ∈ C o n t e x t ( w ) {\bf{v}}\left( {\tilde w} \right): = {\bf{v}}\left( {\tilde w} \right) + \eta \sum\limits_{u \in \left\{ w \right\} \cup NEG\left( w \right)} {\frac{{\partial L\left( {w,u} \right)}}{{\partial {{\bf{x}}_w}}}} ,\tilde w \in Context\left( w \right) v(w~):=v(w~)+ηu{w}NEG(w)xwL(w,u),w~Context(w)

5.基于Negative Sampling的模型[Skip-Gram模型]
对于给定的语料库 C C C,目标函数如下所示:
G = ∏ w ∈ C ∏ u ∈ C o n t e x t ( w ) g ( u ) G = \prod\limits_{w \in C} {\prod\limits_{u \in Context\left( w \right)} {g\left( u \right)} } G=wCuContext(w)g(u)
g ( u ) = ∏ z ∈ { u } ∪ N E G { u } p ( z ∣ w ) g\left( u \right) = \prod\limits_{z \in \left\{ u \right\} \cup NEG\left\{ u \right\}} {p\left( {z|w} \right)} g(u)=z{u}NEG{u}p(zw)
p ( z ∣ w ) = [ σ ( v ( w ) T θ z ) ] L u ( z ) ⋅ [ 1 − σ ( v ( w ) T θ z ) ] 1 − L u ( z ) p\left( {z|w} \right) = {\left[ {\sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}{\theta ^z}} \right)} \right]^{{L^u}\left( z \right)}} \cdot {\left[ {1 - \sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}{\theta ^z}} \right)} \right]^{1 - {L^u}\left( z \right)}} p(zw)=[σ(v(w)Tθz)]Lu(z)[1σ(v(w)Tθz)]1Lu(z)
L = log ⁡ G = log ⁡ ∏ w ∈ C ∏ u ∈ C o n t e x t ( w ) g ( u ) = ∑ w ∈ C ∑ u ∈ C o n t e x t ( w ) log ⁡ g ( u ) = ∑ w ∈ C ∑ u ∈ C o n t e x t ( w ) log ⁡ ∏ z ∈ { u } ∪ N E G { u } p ( z ∣ w ) = ∑ w ∈ C ∑ u ∈ C o n t e x t ( w ) ∑ z ∈ { u } ∪ N E G { u } log ⁡ { [ σ ( v ( w ) T θ z ) ] L u ( z ) ⋅ [ 1 − σ ( v ( w ) T θ z ) ] 1 − L u ( z ) } = ∑ w ∈ C ∑ u ∈ C o n t e x t ( w ) ∑ z ∈ { u } ∪ N E G { u } { L u ( z ) ⋅ log ⁡ [ σ ( v ( w ) T θ z ) ] + [ 1 − L u ( z ) ] ⋅ log ⁡ [ 1 − σ ( v ( w ) T θ z ) ] } \begin{array}{l} L = \log G = \log \prod\limits_{w \in C} {\prod\limits_{_{u \in Context\left( w \right)}} {g\left( u \right)} } = \sum\limits_{w \in C} {\sum\limits_{_{u \in Context\left( w \right)}} {\log g\left( u \right)} } \\ = \sum\limits_{w \in C} {\sum\limits_{_{u \in Context\left( w \right)}} {\log \prod\limits_{z \in \left\{ u \right\} \cup NEG\left\{ u \right\}} {p\left( {z|w} \right)} } } \\ = \sum\limits_{w \in C} {\sum\limits_{_{u \in Context\left( w \right)}} {\sum\limits_{z \in \left\{ u \right\} \cup NEG\left\{ u \right\}} {\log \left\{ {{{\left[ {\sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}{\theta ^z}} \right)} \right]}^{{L^u}\left( z \right)}} \cdot {{\left[ {1 - \sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}{\theta ^z}} \right)} \right]}^{1 - {L^u}\left( {\rm{z}} \right)}}} \right\}} } } \\ = \sum\limits_{w \in C} {\sum\limits_{_{u \in Context\left( w \right)}} {\sum\limits_{z \in \left\{ u \right\} \cup NEG\left\{ u \right\}} {\left\{ {{L^u}\left( z \right) \cdot \log \left[ {\sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}{\theta ^z}} \right)} \right] + \left[ {1 - {L^u}\left( {\rm{z}} \right)} \right] \cdot \log \left[ {1 - \sigma \left( {{\bf{v}}{{\left( w \right)}^{\rm{T}}}{\theta ^z}} \right)} \right]} \right\}} } } \\ \end{array} L=logG=logwCuContext(w)g(u)=wCuContext(w)logg(u)=wCuContext(w)logz{u}NEG{u}p(zw)=wCuContext(w)z{u}NEG{u}log{[σ(v(w)Tθz)]Lu(z)[1σ(v(w)Tθz)]1Lu(z)}=wCuContext(w)z{u}NEG{u}{Lu(z)log[σ(v(w)Tθz)]+[1Lu(z)]log[1σ(v(w)Tθz)]}
对每一个样本 ( w , C o n t e x t ( w ) ) \left( {w,Context\left( w \right)} \right) (w,Context(w)),需要针对 C o n t e x t ( w ) Context\left( w \right) Context(w)中的每一个词进行负采样,但是word2vec源码中只是针对 w w w进行了 ∣ C o n t e x t ( w ) ∣ \left| {Context\left( w \right)} \right| Context(w)次负采样。它本质上用的还是CBOW模型,只是将原来通过求和累加做整体用的上下文 C o n t e x t ( w ) Context\left( w \right) Context(w)拆成一个一个来考虑。对于给定的语料库 C C C,目标函数如下所示:
g ( w ) = ∏ w ~ ∈ C o n t e x t ( w ) ∏ u ∈ { w } ∪ N E G w ~ ( w ) p ( u ∣ w ~ ) {g\left( w \right) = {\prod _{\tilde w \in Context\left( w \right)}}{\prod _{u \in \left\{ w \right\} \cup NE{G^{\tilde w}}\left( w \right)}}p\left( {u|\tilde w} \right)} g(w)=w~Context(w)u{w}NEGw~(w)p(uw~)
p ( u ∣ w ~ ) = [ σ ( v ( w ~ ) T θ u ) ] L w ( u ) ⋅ [ 1 − σ ( v ( w ~ ) T θ u ) ] 1 − L w ( u ) {p\left( {u|\tilde w} \right) = {{\left[ {\sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]}^{{L^w}\left( u \right)}}\cdot{{\left[ {1 - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]}^{1 - {L^w}\left( u \right)}}} p(uw~)=[σ(v(w~)Tθu)]Lw(u)[1σ(v(w~)Tθu)]1Lw(u)
L = log ⁡ G = log ⁡ ∏ w ∈ C g ( w ) = ∑ w ∈ C log ⁡ g ( w ) = ∑ w ∈ C log ⁡ ∏ w ~ ∈ C o n t e x t ( w ) ∏ u ∈ { w } ∪ N E G w ~ ( w ) { [ σ ( v ( w ~ ) T θ u ) ] L w ( u ) ⋅ [ 1 − σ ( v ( w ~ ) T θ u ) ] 1 − L w ( u ) } = ∑ w ∈ C log ⁡ ∑ w ~ ∈ C o n t e x t ( w ) ∑ u ∈ { w } ∪ N E G w ~ ( w ) { L w ( u ) ⋅ log ⁡ [ σ ( v ( w ~ ) T θ u ) ] + [ 1 − L w ( u ) ] ⋅ log ⁡ [ 1 − σ ( v ( w ~ ) T θ u ) ] } \begin{array}{l} L = \log G = \log \prod\limits_{w \in C} {g\left( w \right)} = \sum\limits_{w \in C} {\log g\left( w \right)} \\ = \sum\limits_{w \in C} {\log \prod\limits_{\tilde w \in Context\left( w \right)} {\prod\limits_{u \in \left\{ w \right\} \cup NE{G^{\tilde w}}\left( w \right)} {\left\{ {{{\left[ {\sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]}^{{L^w}\left( u \right)}} \cdot {{\left[ {1 - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]}^{1 - {L^w}\left( u \right)}}} \right\}} } } \\ = \sum\limits_{w \in C} {\log \sum\limits_{\tilde w \in Context\left( w \right)} {\sum\limits_{u \in \left\{ w \right\} \cup NE{G^{\tilde w}}\left( w \right)} {\left\{ {{L^w}\left( u \right) \cdot \log \left[ {\sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right] + \left[ {1 - {L^w}\left( u \right)} \right] \cdot \log \left[ {1 - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]} \right\}} } } \\ \end{array} L=logG=logwCg(w)=wClogg(w)=wClogw~Context(w)u{w}NEGw~(w){[σ(v(w~)Tθu)]Lw(u)[1σ(v(w~)Tθu)]1Lw(u)}=wClogw~Context(w)u{w}NEGw~(w){Lw(u)log[σ(v(w~)Tθu)]+[1Lw(u)]log[1σ(v(w~)Tθu)]}
L ( w , w ~ , u ) = L w ( u ) ⋅ log ⁡ [ σ ( v ( w ~ ) T θ u ) ] + [ 1 − L w ( u ) ] ⋅ log ⁡ [ 1 − σ ( v ( w ~ ) T θ u ) ] L\left( {w,\tilde w,u} \right) = {L^w}\left( u \right) \cdot \log \left[ {\sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right] + \left[ {1 - {L^w}\left( u \right)} \right] \cdot \log \left[ {1 - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right] L(w,w~,u)=Lw(u)log[σ(v(w~)Tθu)]+[1Lw(u)]log[1σ(v(w~)Tθu)]。使用随机梯度上升法,对 θ u {\theta ^u} θu求偏导,如下所示:
∂ L ( w , w ~ , u ) ∂ θ u = ∂ L ∂ θ u { L w ( u ) ⋅ log ⁡ [ σ ( v ( w ~ ) T θ u ) ] + [ 1 − L w ( u ) ] ⋅ log ⁡ [ 1 − σ ( v ( w ~ ) T θ u ) ] } = L w ( u ) [ 1 − σ ( v ( w ~ ) T θ u ) ] v ( w ~ ) − [ 1 − L w ( u ) ] σ ( v ( w ~ ) T θ u ) v ( w ~ ) = { L w ( u ) [ 1 − σ ( v ( w ~ ) T θ u ) ] − [ 1 − L w ( u ) ] σ ( v ( w ~ ) T θ u ) } v ( w ~ ) = [ L w ( u ) − σ ( v ( w ~ ) T θ u ) ] v ( w ~ ) \begin{array}{l} \frac{{\partial L\left( {w,\tilde w,u} \right)}}{{\partial {\theta ^u}}} = \frac{{\partial L}}{{\partial {\theta ^u}}}\left\{ {{L^w}\left( u \right)\cdot\log \left[ {\sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right] + \left[ {1 - {L^w}\left( u \right)} \right]\cdot\log \left[ {1 - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]} \right\} \\ = {L^w}\left( u \right)\left[ {1 - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]{\bf{v}}\left( {\tilde w} \right) - \left[ {1 - {L^w}\left( u \right)} \right]\sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right){\bf{v}}\left( {\tilde w} \right) \\ = \left\{ {{L^w}\left( u \right)\left[ {1 - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right] - \left[ {1 - {L^w}\left( u \right)} \right]\sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right\}{\bf{v}}\left( {\tilde w} \right) \\ = \left[ {{L^w}\left( u \right) - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]{\bf{v}}\left( {\tilde w} \right) \\ \end{array} θuL(w,w~,u)=θuL{Lw(u)log[σ(v(w~)Tθu)]+[1Lw(u)]log[1σ(v(w~)Tθu)]}=Lw(u)[1σ(v(w~)Tθu)]v(w~)[1Lw(u)]σ(v(w~)Tθu)v(w~)={Lw(u)[1σ(v(w~)Tθu)][1Lw(u)]σ(v(w~)Tθu)}v(w~)=[Lw(u)σ(v(w~)Tθu)]v(w~)
θ u {\theta ^u} θu的更新方程,如下所示:
θ u : = θ u + η [ L w ( u ) − σ ( v ( w ~ ) T θ u ) ] v ( w ~ ) {\theta ^u}: = {\theta ^u} + \eta \left[ {{L^w}\left( u \right) - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]{\bf{v}}\left( {\tilde w} \right) θu:=θu+η[Lw(u)σ(v(w~)Tθu)]v(w~)
使用随机梯度上升法,对 v ( w ~ ) {\bf{v}}\left( {\tilde w} \right) v(w~)求偏导,如下所示:
∂ L ( w , w ~ , u ) ∂ v ( w ~ ) = [ L w ( u ) − σ ( v ( w ~ ) T θ u ) ] θ u \frac{{\partial L\left( {w,\tilde w,u} \right)}}{{\partial {\bf{v}}\left( {\tilde w} \right)}} = \left[ {{L^w}\left( u \right) - \sigma \left( {{\bf{v}}{{\left( {\tilde w} \right)}^{\rm{T}}}{\theta ^u}} \right)} \right]{\theta ^u} v(w~)L(w,w~,u)=[Lw(u)σ(v(w~)Tθu)]θu
参数 v ( w ~ ) {{\bf{v}}\left( {\tilde w} \right)} v(w~)的更新,如下所示:
v ( w ~ ) : = v ( w ~ ) + η ∑ u ∈ { w } ∪ N E G w ~ ( w ) ∂ L ( w , w ~ , u ) ∂ v ( w ~ ) {\bf{v}}\left( {\tilde w} \right): = {\bf{v}}\left( {\tilde w} \right) + \eta \sum\limits_{u \in \left\{ w \right\} \cup NE{G^{\tilde w}}\left( w \right)} {\frac{{\partial L\left( {w,\tilde w,u} \right)}}{{\partial {\bf{v}}\left( {\tilde w} \right)}}} v(w~):=v(w~)+ηu{w}NEGw~(w)v(w~)L(w,w~,u)
其中, N E G w ~ ( w ) NE{G^{\tilde w}}\left( w \right) NEGw~(w)表示处理词 w ~ \tilde w w~时生成的负样本子集。

6.Negative Sampling算法
[1]带权采样原理
设词典 D D D中的每一个词 w w w对应一个线段 l ( w ) l\left( {w} \right) l(w),长度如下所示:
l e n ( w ) = c o u n t e r ( w ) ∑ u ∈ D c o u n t e r ( u ) len\left( w \right) = \frac{{{\rm{counter}}\left( w \right)}}{{\sum\limits_{u \in D} {{\rm{counter}}\left( u \right)} }} len(w)=uDcounter(u)counter(w)
这里 c o u n t e r ( ⋅ ) {{\rm{counter}}\left( \cdot \right)} counter()表示一个词在语料 C C C中出现的次数。现在将这些线段首尾相连地拼接在一起,形成一个长度为1的单位线段。如果随机地往这个单位线段上打点,那么其中长度越长的线段(对应高频词)被打中的概率就越大。
[2]word2vec负采样
l 0 = 0 l_{0}=0 l0=0 l k = ∑ j = 1 k l e n ( w j ) , k = 1 , 2 , ⋯   , N {l_k} = \sum\limits_{j = 1}^k {len\left( {{w_j}} \right)} ,k = 1,2, \cdots ,N lk=j=1klen(wj),k=1,2,,N,这里 w j w_{j} wj表示词典 D D D中第 j j j个词,则以 { l j } j = 0 N \left\{ {{l_j}} \right\}_{j = 0}^N {lj}j=0N为剖分结点可得到区间 [ 0 , 1 ] \left[ {0,1} \right] [0,1]上的一个非等距剖分, I i = ( l i − 1 , l i ] , i = 1 , 2 , ⋯   , N {I_i} = ({l_{i - 1}},{l_i}],i = 1,2, \cdots ,N Ii=(li1,li],i=1,2,,N为其 N N N个剖分区间。进一步引入区间 [ 0 , 1 ] \left [{0,1}\right] [0,1]上的一个等距离剖分,剖分结点为 { m j } j = 0 M \left\{ {{m_j}} \right\}_{j = 0}^M {mj}j=0M,其中 M ≫ N M \gg N MN,具体示意图如下所示:
在这里插入图片描述
将内部剖分结点 { m j } j = 1 M − 1 \left\{ {{m_j}} \right\}_{j = 1}^{M-1} {mj}j=1M1投影到非等距剖分上,则可建立 { m j } j = 1 M − 1 \left\{ {{m_j}} \right\}_{j = 1}^{M-1} {mj}j=1M1与区间 { I j } j = 1 N \left\{ {{I_j}} \right\}_{j = 1}^{N} {Ij}j=1N(或 { w j } j = 1 N \left\{ {{w_j}} \right\}_{j = 1}^N {wj}j=1N)的映射关系,如下所示:
T a b l e ( i ) = w k , m i ∈ I k , i = 1 , 2 , ⋯   , M − 1 {\rm{Table}}\left( i \right) = {w_k},{m_i} \in {I_k},i = 1,2, \cdots ,M - 1 Table(i)=wk,miIk,i=1,2,,M1
根据映射每次生成一个 [ 1 , M − 1 ] \left[ {1,M - 1} \right] [1,M1]间的随机整数 r r r T a b l e ( r ) {\rm{Table}}\left( r \right) Table(r)就是一个样本。当对 w i w_i wi进行负采样时,如果采样为 w i w_i wi,那么就跳过去。

参考文献:
[1]word2vec中的数学原理详解

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

NLP工程化

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值