Surrogate Loss Functions in Machine Learning

最新推荐文章于 2021-11-20 10:35:52 发布

sdulibh

最新推荐文章于 2021-11-20 10:35:52 发布

阅读量1.3k

点赞数

分类专栏：计算机算法文章标签：机器学习

计算机算法专栏收录该内容

49 篇文章 2 订阅

订阅专栏

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.

In the binary-class classification setting we are given n training samples {(X1,Y1),…,(Xn,Yn)} , where Xi belongs to some sample space X , usually Rp but for the purpose of this post we can keep i abstract, and yi∈{−1,1} is an integer representing the class label.

We are also given a loss function ℓ:{−1,1}×{−1,1}→R that measures the error of a given prediction. The value of the loss function ℓ at an arbitrary point (y,y^) is interpreted as the cost incurred by predicting y^ when the true label is y . In classification this function is often the zero-one loss, that is, ℓ(y,y^) is zero when y=y^ and one otherwise.

The goal is to find a function h:X→[k] , the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function h that minimizes the expected ℓ -risk, given by

R ℓ (h) = E X \times Y [ℓ (Y, h (X))]

In theory, we could directly minimize the ℓ -risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of X×Y is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function ℓ is discontinuous. Take for example a problem in which X=R2,k=2 , and we seek to find the linear function f(X)=sign(Xw),w∈R2 and that minimizes the ℓ -risk. As a function of the parameter w this function looks something like

This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function Ψ:R→R+ . An example of such surrogate loss functions is the hinge loss, Ψ(t)=max(1−t,0) , which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss, Ψ(t)=1/(1+exp(−t)) , used by the logistic regression model. If we consider the logistic loss, minimizing the Ψ -risk, given by EX×Y[Ψ(Y,f(X))] , of the function f(X)=Xw becomes a much more more tractable optimization problem:

In short, we have replaced the ℓ -risk which is computationally difficult to optimize with the Ψ -risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the Ψ -risk leads to a function that also minimizes the ℓ -risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function Ψ : for some functions Ψ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if Ψ is convex then it is consistent if and only if it is differentiable at zero and Ψ′(0)<0 . This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.

P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity , Classification , and Risk Bounds,” J. Am. Stat. Assoc., pp. 1–36, 2003. ↩
A. Tewari and P. L. Bartlett, “On the Consistency of Multiclass Classification Methods,” J. Mach. Learn. Res., vol. 8, pp. 1007–1025, 2007. ↩

sdulibh

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Surrogate Loss Functions in Machine Learning

TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.In the binary-class classificatio
复制链接

扫一扫