Surrogate Loss Functions in Machine Learning

转载 2015年11月19日 18:33:34

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.

In the binary-class classification setting we are given n training samples {(X1,Y1),,(Xn,Yn)}, where Xi belongs to some sample space X, usually Rp but for the purpose of this post we can keep i abstract, and yi{1,1} is an integer representing the class label.

We are also given a loss function :{1,1}×{1,1}R that measures the error of a given prediction. The value of the loss function  at an arbitrary point (y,y^) is interpreted as the cost incurred by predicting y^ when the true label is y. In classification this function is often the zero-one loss, that is, (y,y^) is zero when y=y^ and one otherwise.

The goal is to find a function h:X[k], the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function h that minimizes the expected -risk, given by


In theory, we could directly minimize the -risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of X×Y is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function  is discontinuous. Take for example a problem in which X=R2,k=2, and we seek to find the linear function f(X)=sign(Xw),wR2 and that minimizes the -risk. As a function of the parameter w this function looks something like

loss as function of w

This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function Ψ:RR+. An example of such surrogate loss functions is the hinge lossΨ(t)=max(1t,0), which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss,Ψ(t)=1/(1+exp(t)), used by the logistic regression model. If we consider the logistic loss, minimizing the Ψ-risk, given by EX×Y[Ψ(Y,f(X))], of the function f(X)=Xw becomes a much more more tractable optimization problem:

In short, we have replaced the -risk which is computationally difficult to optimize with the Ψ-risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the Ψ-risk leads to a function that also minimizes the -risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function Ψ: for some functions Ψ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if Ψ is convex then it is consistent if and only if it is differentiable at zero and Ψ(0)<0. This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.

  1. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. 

  2. A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007. 



Andrew NG 《machine learning》week 5,class1 —Cost functions and Backpropagation

Andrew NG 《machine learning》week 5,class1 —Cost functions and Backpropagation1.1 cost function如下图所示,...

【机器学习实战】Machine Learning in Action 代码 视频 项目案例

MachineLearning 欢迎任何人参与和完善:一个人可以走的很快,但是一群人却可以走的更远 ApacheCN - 学习机器学习群【629470233】Machine Learn...

<Machine Learning in Action >之一 k-近邻算法 C#实现手写识别

def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = tile(inX, (...

《Machine Learning in Action》 读书笔记之三:朴素贝叶斯(naive Bayes)

1.之所以叫naive bayes,是因为该分类器基于一些naive的假设,即假设数据集的各个特征之间是独立无关的。 2.利用贝叶斯分类器进行分档分类: 1)由训练集文档生成词典: de...

Machine Learning in Action/机器学习实践:kNN算法之约会问题

因为自己比较忙,就没有太多时间去学习python的基础知识,所以简单的看了一点py的语法,就直接上书开始敲代码了,中间遇到比较多的问题,所以就加了很多注释(用的python3)from numpy i...
  • Karo_cs
  • Karo_cs
  • 2017年11月16日 16:20
  • 41

Machine Learning in action --朴素贝叶斯(已勘误)

最近在自学机器学习,应导师要求,先把《Machine Learning with R》动手刷了一遍,感觉R真不能算是一门计算机语言,感觉也就是一个功能复杂的计算器。所以这次就决定使用经典教材《Mach...

读《Machine Learning in Action》的感想

今天刚刚读完了一本叫《Machine Learning in Action》的书,历时还是挺长的,中途因为课程紧张,停顿了几个月的时间。总体感觉还不错,让我开了眼界。 不过这不是一本入门的书,主要讲...

"Machine Learning in Action" Chapter 2

This chapter covers  • The k-Nearest Neighbors classification algorithm  • Parsing and importing dat...

《Machine Learning in Action》 读书笔记之五:AdaBoost Classification

1.bagging:对给定的原始数据集,“有放回”地随机取S次数量为N的数据集,这样得到S个新dataset,学习训练得到S个classifiers,对测试集进行均等投票得到分类结果。 2.boos...
您举报文章:Surrogate Loss Functions in Machine Learning