Surrogate Loss Functions in Machine Learning

转载 2015年11月19日 18:33:34

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.

In the binary-class classification setting we are given n training samples {(X1,Y1),,(Xn,Yn)}, where Xi belongs to some sample space X, usually Rp but for the purpose of this post we can keep i abstract, and yi{1,1} is an integer representing the class label.

We are also given a loss function :{1,1}×{1,1}R that measures the error of a given prediction. The value of the loss function  at an arbitrary point (y,y^) is interpreted as the cost incurred by predicting y^ when the true label is y. In classification this function is often the zero-one loss, that is, (y,y^) is zero when y=y^ and one otherwise.

The goal is to find a function h:X[k], the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function h that minimizes the expected -risk, given by


In theory, we could directly minimize the -risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of X×Y is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function  is discontinuous. Take for example a problem in which X=R2,k=2, and we seek to find the linear function f(X)=sign(Xw),wR2 and that minimizes the -risk. As a function of the parameter w this function looks something like

loss as function of w

This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function Ψ:RR+. An example of such surrogate loss functions is the hinge lossΨ(t)=max(1t,0), which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss,Ψ(t)=1/(1+exp(t)), used by the logistic regression model. If we consider the logistic loss, minimizing the Ψ-risk, given by EX×Y[Ψ(Y,f(X))], of the function f(X)=Xw becomes a much more more tractable optimization problem:

In short, we have replaced the -risk which is computationally difficult to optimize with the Ψ-risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the Ψ-risk leads to a function that also minimizes the -risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function Ψ: for some functions Ψ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if Ψ is convex then it is consistent if and only if it is differentiable at zero and Ψ(0)<0. This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.

  1. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. 

  2. A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007. 

surrogate loss function代理损失函数

surrogate loss function代理损失函数或者称为替代损失函数,一般是指当目标函数非凸、不连续时,数学性质不好,优化起来比较复杂,这时候需要使用其他的性能较好的函数进行替换。 具体可...
  • liwei1205
  • liwei1205
  • 2017年12月15日 17:15
  • 229


  • han____shuai
  • han____shuai
  • 2016年06月21日 12:04
  • 2727

【机器学习入门】Andrew NG《Machine Learning》课程笔记之五:神经网络简介

由于Andrew NG老师在Cousera的课程是简化版的,为了更加清楚细致的介绍神经网络,这里根据周志华老师的《机器学习》一书增补一些内容,不过定义的模型和参数略有区别,请注意区别。周志华老师的神经...
  • luckybuzhi
  • luckybuzhi
  • 2017年08月22日 20:54
  • 267

机器学习中的Loss function

机器学习中的Loss函数 ...
  • CatchBAT
  • CatchBAT
  • 2015年06月23日 13:17
  • 2445

Andrew Ng Machine Learning 专题【Linear Regression】

此文是斯坦福大学,机器学习界 superstar — Andrew Ng 所开设的 Coursera 课程:Machine Learning 的课程笔记。力求简洁,仅代表本人观点,不足之处希望大家探讨...
  • yOung_One
  • yOung_One
  • 2015年07月29日 17:38
  • 3109

学习Machine Learning书籍分级推荐

《How To Get Better At Machine Learning》推荐将学习Maching Learning技术划分为几个不同阶段,并为每个阶段的学习者推荐了参考书籍,在此转载一下 : ...
  • quicknet
  • quicknet
  • 2015年01月05日 15:35
  • 4416

【Machine Learning】机器学习:简明入门指南

本文是一篇转载自伯乐在线的译文,英文原文是这里:Adam Geitgey 在听到人们谈论机器学习的时候,你是不是对它的涵义只有几个模糊的认识呢?你是不是已经厌倦了在和同事交谈时只能一直点头?让我们改变...
  • u010983881
  • u010983881
  • 2016年08月14日 17:32
  • 1257

Coursera Machine Learning Week3 学习笔记

五、逻辑回归(Logistic Regression)在分类问题中,我们要预测变量的y是离散的值,所有我们将使用一种叫逻辑回归(Logistic Regression)算法。5.1 分类和表示(Cla...
  • codeforcer
  • codeforcer
  • 2017年03月11日 11:17
  • 884

Machine Learning练习整理

完成了Cousera上的Machine Learning,现在将几次exercise的答案整理在这里,供以后的同学参考。这里仅列出需要自己编程的部分,某些部分会给出说明。ex1: Linear Reg...
  • MajorDong100
  • MajorDong100
  • 2016年04月17日 15:13
  • 4829

Machine Learning课程 by Andrew Ng

大名鼎鼎的机器学习大牛Andrew Ng的Machine Learning课程,在此mark一下: 一:Coursera:
  • GarfieldEr007
  • GarfieldEr007
  • 2015年07月07日 17:24
  • 2424
您举报文章:Surrogate Loss Functions in Machine Learning