Surrogate Loss Functions in Machine Learning

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.

In the binary-class classification setting we are given  n  training samples  {(X1,Y1),,(Xn,Yn)} , where  Xi  belongs to some sample space  X , usually  Rp  but for the purpose of this post we can keep i abstract, and  yi{1,1}  is an integer representing the class label.

We are also given a loss function  :{1,1}×{1,1}R  that measures the error of a given prediction. The value of the loss function   at an arbitrary point  (y,y^)  is interpreted as the cost incurred by predicting  y^  when the true label is  y . In classification this function is often the zero-one loss, that is,  (y,y^)  is zero when  y=y^  and one otherwise.

The goal is to find a function  h:X[k] , the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function  h  that minimizes the expected  -risk, given by

R(h)=EX×Y[(Y,h(X))]

In theory, we could directly minimize the  -risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of  X×Y  is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function   is discontinuous. Take for example a problem in which  X=R2,k=2 , and we seek to find the linear function  f(X)=sign(Xw),wR2  and that minimizes the  -risk. As a function of the parameter  w  this function looks something like

loss as function of w

This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function  Ψ:RR+ . An example of such surrogate loss functions is the hinge loss Ψ(t)=max(1t,0) , which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss, Ψ(t)=1/(1+exp(t)) , used by the logistic regression model. If we consider the logistic loss, minimizing the  Ψ -risk, given by  EX×Y[Ψ(Y,f(X))] , of the function  f(X)=Xw  becomes a much more more tractable optimization problem:

In short, we have replaced the  -risk which is computationally difficult to optimize with the  Ψ -risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the  Ψ -risk leads to a function that also minimizes the  -risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function  Ψ : for some functions  Ψ  it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if  Ψ  is convex then it is consistent if and only if it is differentiable at zero and  Ψ(0)<0 . This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.


  1. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. 

  2. A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007. 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值