损失函数:Hinge Loss(max margin)

From Wikipedia, the free encyclopedia
Plot of hinge loss (blue) vs. zero-one loss (misclassification, green:  y < 0) for  t = 1 and variable  y. Note that the hinge loss penalizes predictions  y < 1, corresponding to the notion of a margin in a support vector machine.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1] For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

{\displaystyle \ell (y)=\max(0,1-t\cdot y)} \ell(y) = \max(0, 1-t \cdot y)

Note that y should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, {\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b}y = \mathbf{w} \cdot \mathbf{x} + b, where {\displaystyle (\mathbf {w} ,b)}(\mathbf{w},b) are the parameters of the hyperplane and {\displaystyle \mathbf {x} }\mathbf {x}  is the point to classify.

It can be seen that when t and y have the same sign (meaning y predicts the right class) and {\displaystyle |y|\geq 1}|y| \ge 1, the hinge loss {\displaystyle \ell (y)=0}\ell(y) = 0, but when they have opposite sign, {\displaystyle \ell (y)}\ell(y) increases linearly with y (one-sided error).

Extensions[edit]

While SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] there exists a "true" multiclass version of the hinge loss due to Crammer and Singer,[3] defined for a linear classifier as[4]

{\displaystyle \ell (y)=\max(0,1+\max _{t\neq y}\mathbf {w} _{t}\mathbf {x} -\mathbf {w} _{y}\mathbf {x} )} \ell (y)=\max(0,1+\max _{​{t\neq y}}{\mathbf  {w}}_{t}{\mathbf  {x}}-{\mathbf  {w}}_{y}{\mathbf  {x}})

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where ydenotes the SVM's parameters, φ the joint feature function, and Δ the Hamming loss:

{\displaystyle {\begin{aligned}\ell (\mathbf {y} )&=\max(0,\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle -\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\\&=\max(0,\max _{y\in {\mathcal {Y}}}\left(\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle \right)-\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\end{aligned}}} {\begin{aligned}\ell ({\mathbf  {y}})&=\max(0,\Delta ({\mathbf  {y}},{\mathbf  {t}})+\langle {\mathbf  {w}},\phi ({\mathbf  {x}},{\mathbf  {y}})\rangle -\langle {\mathbf  {w}},\phi ({\mathbf  {x}},{\mathbf  {t}})\rangle )\\&=\max(0,\max _{​{y\in {\mathcal  {Y}}}}\left(\Delta ({\mathbf  {y}},{\mathbf  {t}})+\langle {\mathbf  {w}},\phi ({\mathbf  {x}},{\mathbf  {y}})\rangle \right)-\langle {\mathbf  {w}},\phi ({\mathbf  {x}},{\mathbf  {t}})\rangle )\end{aligned}}

Optimization[edit]

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function {\displaystyle y=\mathbf {w} \cdot \mathbf {x} }y = \mathbf{w} \cdot \mathbf{x} that is given by

{\displaystyle {\frac {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1\\0&{\text{otherwise}}\end{cases}}} {\frac  {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1\\0&{\text{otherwise}}\end{cases}}
Plot of three variants of the hinge loss as a function of  z = ty: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red).

However, since the derivative of the hinge loss at {\displaystyle ty=1}ty = 1 is non-deterministic, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[5]

{\displaystyle \ell (y)={\begin{cases}{\frac {1}{2}}-ty&{\text{if}}~~ty\leq 0,\\{\frac {1}{2}}(1-ty)^{2}&{\text{if}}~~0<ty\leq 1,\\0&{\text{if}}~~1\leq ty\end{cases}}} \ell (y)={\begin{cases}{\frac  {1}{2}}-ty&{\text{if}}~~ty\leq 0,\\{\frac  {1}{2}}(1-ty)^{2}&{\text{if}}~~0<ty\leq 1,\\0&{\text{if}}~~1\leq ty\end{cases}}

or the quadratically smoothed

{\displaystyle \ell (y)={\frac {1}{2\gamma }}\max(0,1-ty)^{2}} \ell(y) = \frac{1}{2\gamma} \max(0, 1 - ty)^2

suggested by Zhang.[6] The modified Huber loss is a special case of this loss function with {\displaystyle \gamma =2}\gamma = 2.[6]

References[edit]

  1. Jump up^ Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF)Neural Computation16(5): 1063–1076. doi:10.1162/089976604773135104PMID 15070510.
  2. Jump up^ Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". Multiple Classifier Systems (PDF)LNCS3541. pp. 278–285.doi:10.1007/11494683_28ISBN 978-3-540-26306-7.
  3. Jump up^ Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines" (PDF)J. Machine Learning Research2: 265–292.
  4. Jump up^ Moore, Robert C.; DeNero, John (2011). "L1 and L2 regularization for multiclass hinge loss models" (PDF)Proc. Symp. on Machine Learning in Speech and Language Processing.
  5. Jump up^ Rennie, Jason D. M.; Srebro, Nathan (2005). Loss Functions for Preference Levels: Regression with Discrete Ordered Labels (PDF). Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling.
  6. Jump up to:a b Zhang, Tong (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. ICML.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值