NCE损失(Negative Sampling)

DSSM的损失函数: 先是1个正例和5个负例过softmax:

最后交叉熵损失函数:

Word2Vec的损失函数:输入词的词向量和预测词(或负例)的分界面向量点乘,经过sigmoid,再过交叉熵损失函数

 

在词向量的生成过程中,用的loss函数是NCE或negative sampling,而不是常规的softmax。在《learning tensorflow》这本书中,作者这样说道:but it is sufficient to think of it (NCE) as a sort of efficient approximation to the ordinary softmax function used in classification tasks。由此看来,NCE是softmax的一种近似,但是为什么要做这种近似,而不直接用softmax呢?下边是一个网友的回答,把他的答案贴在下边。

答案:因为类别太多时,softmax计算量太大

原文链接:https://stats.stackexchange.com/questions/244616/how-sampling-works-in-word2vec-can-someone-please-make-me-understand-nce-and-ne/245452#245452

There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).

Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.

A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.

The (a) solution

In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值