caffe hinge _loss layer

铰链损失函数(Hinge Loss)

caffe <wbr>study(3) <wbr>关于激活函数以及loss <wbr>function

求和后便成了Hinge loss
   
这个loss就是SVM用到的loss。Hinge loss就是0-1 loss的改良版,这个改良主要在两个方面,一个是在t.y在【0 1】之间不再是采用hard的方式,而是一个soft的方式。另外一个就是在【-inf,0】之间不再采用固定的1来定义能量的损失,而是采用一个线性函数对于错误分类的情况进行惩罚。

For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

Note that y should be the "raw" output of the classifier's decision function, not the predicted class label. E.g., in linear SVMs, 

It can be seen that when t and y have the same sign (meaning y predicts the right class) and 

, the hinge loss 

, but when they have opposite sign, 

increases linearly with y (one-sided error).

   

来自 <http://en.wikipedia.org/wiki/Hinge_loss>

Plot of hinge loss (blue) vs. zero-one loss (misclassification, green:y < 0) for t = 1 and variable y. Note that the hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.

   

来自 <http://en.wikipedia.org/wiki/Hinge_loss>

   

   

在Pegasos: Primal Estimated sub-GrAdient SOlver for SVM论文中

这里把第一部分看成正规化部分,第二部分看成误差部分,注意对比ng关于svm的课件

不考虑规则化

考虑规则化


输入:

形状:  预测值   代表着预测   个类中的得分(注:CHW表示着在网络设计中,不一定要把预测值进行向量化,只有其拉直后元素的个数相同即可。) . 在SVM中,   是 D 维特征 , 和学习到的超平面参数  内积的结果   
所以,一个网络如果仅仅只有全连接层 + 铰链损失函数,而没有其它的可学习的参数,那么它就等价于SVM

标签值:

 标签  , 是一个整数类型的数   其代表在   个类中的正确的标签。

输出:

形状:  
损失函数计算:   范数 (默认是  , 是 L1 范数; L2 范数,正如在 L2-SVM中一样,也有实现),

其中 

应用场景:

在一对多的分类中应用,类似于SVM.


Computes the hinge loss for a one-of-many classification task.

Parameters
bottominput Blob vector (length 2)
  1. $ (N \times C \times H \times W) $ the predictions $ t $, a Blob with values in $ [-\infty, +\infty] $ indicating the predicted score for each of the $ K = CHW $ classes. In an SVM, $ t $ is the result of taking the inner product $ X^T W $ of the D-dimensional features $ X \in \mathcal{R}^{D \times N} $ and the learned hyperplane parameters $ W \in \mathcal{R}^{D \times K} $, so a Net with just an InnerProductLayer (with num_output = D) providing predictions to a HingeLossLayer and no other learnable parameters or losses is equivalent to an SVM.
  2. $ (N \times 1 \times 1 \times 1) $ the labels $ l $, an integer-valued Blob with values $ l_n \in [0, 1, 2, ..., K - 1] $ indicating the correct class label among the $ K $ classes
topoutput Blob vector (length 1)
  1. $ (1 \times 1 \times 1 \times 1) $ the computed hinge loss: $ E = \frac{1}{N} \sum\limits_{n=1}^N \sum\limits_{k=1}^K [\max(0, 1 - \delta\{l_n = k\} t_{nk})] ^ p $, for the $ L^p $ norm (defaults to $ p = 1 $, the L1 norm; L2 norm, as in L2-SVM, is also available), and $ \delta\{\mathrm{condition}\} = \left\{ \begin{array}{lr} 1 & \mbox{if condition} \\ -1 & \mbox{otherwise} \end{array} \right. $

In an SVM, $ t \in \mathcal{R}^{N \times K} $ is the result of taking the inner product $ X^T W $ of the features $ X \in \mathcal{R}^{D \times N} $ and the learned hyperplane parameters $ W \in \mathcal{R}^{D \times K} $. So, a Net with just an InnerProductLayer (with num_output = $k$) providing predictions to a HingeLossLayer is equivalent to an SVM (assuming it has no other learned outside theInnerProductLayer and no other losses outside the HingeLossLayer).



Computes the hinge loss error gradient w.r.t. the predictions.

Gradients cannot be computed with respect to the label inputs (bottom[1]), so this method ignores bottom[1] and requires !propagate_down[1], crashing if propagate_down[1] is set.

Parameters
topoutput Blob vector (length 1), providing the error gradient with respect to the outputs
  1. $ (1 \times 1 \times 1 \times 1) $ This Blob's diff will simply contain the loss_weight* $ \lambda $, as $ \lambda $ is the coefficient of this layer's output $\ell_i$ in the overall Net loss $ E = \lambda_i \ell_i + \mbox{other loss terms}$; hence $ \frac{\partial E}{\partial \ell_i} = \lambda_i $. (*Assuming that this top Blob is not used as a bottom (input) by any other layer of the Net.)
propagate_downsee Layer::Backward. propagate_down[1] must be false as we can't compute gradients with respect to the labels.
bottominput Blob vector (length 2)
  1. $ (N \times C \times H \times W) $ the predictions $t$; Backward computes diff $ \frac{\partial E}{\partial t} $
  2. $ (N \times 1 \times 1 \times 1) $ the labels – ignored as we can't compute their error gradients

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值