基于Negative Sampling的word2vec模型

最新推荐文章于 2022-03-09 07:37:38 发布

zhong_ddbb

最新推荐文章于 2022-03-09 07:37:38 发布

阅读量375

点赞数 1

分类专栏：自然语言处理文章标签：自然语言处理

本文链接：https://blog.csdn.net/zhong_ddbb/article/details/106504586

版权

自然语言处理专栏收录该内容

11 篇文章 0 订阅

订阅专栏

文章目录

负采样算法
CBOW模型
Skip-gram模型

在讲基于Negative Sampling的word2vec模型前，我们先看看Hierarchical Softmax的的缺点。的确，使用霍夫曼树来代替传统的神经网络，可以提高模型训练的效率。但是如果我们的训练样本里的中心词w是一个很生僻的词，那么就得在霍夫曼树中辛苦的向下走很久了。能不能不用搞这么复杂的一颗霍夫曼树，将模型变的更加简单呢？
Negative Sampling就是这么一种求解word2vec模型的方法，它摒弃了霍夫曼树，采用了Negative Sampling（负采样）的方法来求解。

负采样算法

在CBOW模型中，已知词 $w$ 的上下文 $c o n t e x t (w)$ 需要预测 $w$ 。因此，对于给定的 $c o n t e x t (w)$ ，词 $w$ 就是一个正样本，其它词就是负样本了。在Skip-gram中同样也存在正负样本问题。负样本那么多，该如何选取呢？这就是Negative Sampling（负采样）问题。也就是对于给定的词，如何生成其负样本子集 $N E G (w)$ ？

采用的基本要求是：词典 $D$ 中的词在语料 $C$ 中出现的次数有高有低，对于那些高频词，被选为负样本的概率就应该比较大，反之，对于那些低频词，其被选中的概率就应该比较小。本质上就是一个带权采样问题。

word2vec采用的负采样方法如下：

（1）首先将一段长度为1的线段分成长度不相等的 $V$ 份( $V$ 是词汇表的大小)，每份对应词汇表的一个词。高频词对应长线段，低频词对应短线段。每个词的线段长度由下式决定：
$\frac{count(w)}{\sum\limits_{u \in D} count(u)}$
在word2vec中，分子和分母都取了3/4次幂如下：
$\frac{count(w)^{3/4}}{\sum\limits_{u \in D} count(u)^{3/4}}$
（2）在引入一个长度为1的线段进行等距划分成 $M$ 份，其中 $M > > N$ ，如下图所示：
在这里插入图片描述

如图所示，M份中的每一份都会落在某一个词对应的线段上。

（3）采样时，先从M个位置中采出neg个位置，再匹配到这neg个位置对应的词就是负词。如假设我们先采出 $m_3$ ，对应 $I_2$ ， $I_2$ 对应的词就是负词。

注：在word2vec中，M取值默认为 $10^8$ 。

CBOW模型

假设已经采样一个关于 $w$ 的负样本子集 $\ne \emptyset$ ，且对于 $\tilde w \in D$ ，定义：
$L^w(\tilde w ) = \begin{cases} 1, \quad \tilde w = w \\ 0, \quad \tilde w \ne w \end{cases}$
表示词 $\tilde w$ 的标签。即正样本的标签为1，负样本的标签为0。

对于一个给定的正样本 $(c o n t e x t (w), w)$ ，希望最大化：
$\prod_{u \in {w} \cup NEG(w) }P(u|context(w))$
其中：
$\begin{cases} \sigma(\mathbf x_w^T\theta^u) \qquad L^w( u )=1 \\ 1-\sigma(\mathbf x_w^T\theta^u) \quad L^w( u )=0 \end{cases}$
写成整体表达式：
$[\sigma(\mathbf x_w^T\theta^u)]^{L^w( u )} \cdot [1-\sigma(\mathbf x_w^T\theta^u)]^{1- L^w( u )}$
这里的 $\mathbf x_w$ 是各词向量之和。 $\theta^u$ 表示词对应的一个向量，是个待训练参数。

所以，最终 $g (w)$ 的表达式如下：
$\sigma(\mathbf x_w^T\theta^w) \prod_{u \in NEG(w) } [1-\sigma(\mathbf x_w^T\theta^u) ]$
其中 $\sigma(\mathbf x_w^T\theta^w)$ 表示当上下文为 $c o n t e c x t (w)$ 时，预测中心词为w的概率；而 $\sigma(\mathbf x_w^T\theta^u) u∈NEG(w)$ ，预测中心词为u的概率。从形式上看，最大化 $g (w)$ , 相当于：增大正样本的概率同时降低负样本的概率。所以，给定预料库 $C$ ，函数：
$\prod_{w \in C} g(w)$
可以作为整体的优化目标。为了计算方便可以对G 取对数。所以：
$\begin{aligned} \mathcal L &= \log G = \log \prod_{w \in C} g(w) = \sum_{w \in C} \log g(w) \\ &=\sum_{w \in C} \log \prod_{u \in {w} \cup NEG(w) } \left\{ [\sigma(\mathbf x_w^T\theta^u)]^{L^w( u )} \cdot [1-\sigma(\mathbf x_w^T\theta^u)]^{1- L^w( u )} \right\} \\ & = \sum_{w \in C} \sum_{u \in {w} \cup NEG(w) } \left\{ {L^w( u )} \log [\sigma(\mathbf x_w^T\theta^u)]+ (1- L^w( u ))\log [1-\sigma(\mathbf x_w^T\theta^u)] \right\} \\ &= \sum_{w \in C} \sum_{u \in {w} \cup NEG(w) } \mathcal L(w,u) \end{aligned}$
接下来利用随机梯度上升对参数进行优化。

（1）更新 $\theta^u$ ：

因为：
$\begin{aligned}\frac{\partial \mathcal L(w,u)}{\partial \theta^u} &= \frac{\partial \{ L^w( u ) \log [\sigma(\mathbf x_w^T\theta^u)]+ [1- L^w( u )]\log [1-\sigma(\mathbf x_w^T\theta^u)]\} }{\partial \theta^u} \\&= L^w( u ) [1-\sigma(\mathbf x_w^T\theta^u)]\mathbf x_w - [1- L^w( u )] \sigma(\mathbf x_w^T\theta^u) \mathbf x_w \\ &=[L^w( u )-\sigma(\mathbf x_w^T\theta^u)] \mathbf x_w \end{aligned}$
所以 $\theta^u$ 更新公式为：
$\theta^u:=\theta^u + \eta [L^w( u )-\sigma(\mathbf x_w^T\theta^u)] \mathbf x_w$
其中 $\eta$ 为学习率。

（2）更新 $\mathbf x_w$

因为 $\mathcal L(w,u)$ 关于变量 $\mathbf x_w$ 和 $\theta^w$ 是对称的。所以：
$\begin{aligned} \frac{\partial \mathcal L(w,u)}{\partial \mathbf x_w} = [L^w( u )-\sigma(\mathbf x_w^T\theta^u)] \theta^u \end{aligned}$
所以：
$\mathbf v( \tilde w) := \mathbf v( \tilde w) + \eta \sum_{u \in {w} \cup NEG(w) } \frac{\partial \mathcal L(w,u)}{\partial \mathbf x(w)},\quad \tilde w \in context(w)$
以下是伪代码：