损失函数：Hinge Loss（max margin）

最新推荐文章于 2025-03-21 09:36:37 发布

VictorLeeLk

最新推荐文章于 2025-03-21 09:36:37 发布

阅读量4k

点赞数 1

分类专栏：机器学习文章标签： hinge loss

本文链接：https://blog.csdn.net/LK274857347/article/details/54344149

版权

机器学习专栏收录该内容

15 篇文章

订阅专栏

From Wikipedia, the free encyclopedia

Plot of hinge loss (blue) vs. zero-one loss (misclassification, green:

y < 0

) for

t = 1

and variable

y

. Note that the hinge loss penalizes predictions

y < 1

, corresponding to the notion of a margin in a support vector machine.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).^[1] For an intended output $t = \pm1$ and a classifier score $y$ , the hinge loss of the prediction $y$ is defined as

\ell(y) = \max(0, 1-t \cdot y)

Note that $y$ should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, {\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b} $y = \mathbf{w} \cdot \mathbf{x} + b$ , where {\displaystyle (\mathbf {w} ,b)} $(\mathbf{w},b)$ are the parameters of the hyperplane and {\displaystyle \mathbf {x} } $\mathbf {x}$ is the point to classify.

It can be seen that when $t$ and $y$ have the same sign (meaning $y$ predicts the right class) and {\displaystyle |y|\geq 1} $|y| \ge 1$ , the hinge loss {\displaystyle \ell (y)=0} $\ell(y) = 0$ , but when they have opposite sign, {\displaystyle \ell (y)} $\ell(y)$ increases linearly with $y$ (one-sided error).

Extensions[edit]

While SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,^[2] there exists a "true" multiclass version of the hinge loss due to Crammer and Singer,^[3] defined for a linear classifier as^[4]

\ell (y)=\max(0,1+\max _{​{t\neq y}}{\mathbf {w}}_{t}{\mathbf {x}}-{\mathbf {w}}_{y}{\mathbf {x}})

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where $y$ denotes the SVM's parameters, $φ$ the joint feature function, and $Δ$ the Hamming loss:

{\begin{aligned}\ell ({\mathbf {y}})&=\max(0,\Delta ({\mathbf {y}},{\mathbf {t}})+\langle {\mathbf {w}},\phi ({\mathbf {x}},{\mathbf {y}})\rangle -\langle {\mathbf {w}},\phi ({\mathbf {x}},{\mathbf {t}})\rangle )\\&=\max(0,\max _{​{y\in {\mathcal {Y}}}}\left(\Delta ({\mathbf {y}},{\mathbf {t}})+\langle {\mathbf {w}},\phi ({\mathbf {x}},{\mathbf {y}})\rangle \right)-\langle {\mathbf {w}},\phi ({\mathbf {x}},{\mathbf {t}})\rangle )\end{aligned}}

Optimization[edit]

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters $w$ of a linear SVM with score function {\displaystyle y=\mathbf {w} \cdot \mathbf {x} } $y = \mathbf{w} \cdot \mathbf{x}$ that is given by

{\frac {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1\\0&{\text{otherwise}}\end{cases}}

Plot of three variants of the hinge loss as a function of

z = ty

: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red).

However, since the derivative of the hinge loss at {\displaystyle ty=1} $ty = 1$ is non-deterministic, smoothed versions may be preferred for optimization, such as Rennie and Srebro's^[5]

\ell (y)={\begin{cases}{\frac {1}{2}}-ty&{\text{if}}~~ty\leq 0,\\{\frac {1}{2}}(1-ty)^{2}&{\text{if}}~~0<ty\leq 1,\\0&{\text{if}}~~1\leq ty\end{cases}}

or the quadratically smoothed

\ell(y) = \frac{1}{2\gamma} \max(0, 1 - ty)^2

suggested by Zhang.^[6] The modified Huber loss is a special case of this loss function with {\displaystyle \gamma =2} $\gamma = 2$ .^[6]

References[edit]

Jump up^ Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation. 16(5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510.
Jump up^ Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". Multiple Classifier Systems (PDF). LNCS. 3541. pp. 278–285.doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.
Jump up^ Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines" (PDF). J. Machine Learning Research. 2: 265–292.
Jump up^ Moore, Robert C.; DeNero, John (2011). "L₁ and L₂ regularization for multiclass hinge loss models" (PDF). Proc. Symp. on Machine Learning in Speech and Language Processing.
Jump up^ Rennie, Jason D. M.; Srebro, Nathan (2005). Loss Functions for Preference Levels: Regression with Discrete Ordered Labels (PDF). Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling.
^ Jump up to:^a ^b Zhang, Tong (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. ICML.

Categories: