TL; DR These are some notes on calibration of surrogate loss functions in the context of machine learning. But mostly it is an excuse to post some images I made.
In the binary-class classification setting we are given n training samples {(X1,Y1),…,(Xn,Yn)} , where Xi belongs to some sample space X , usually Rp but for the purpose of this post we can keep i abstract, and yi∈{−1,1} is an integer representing the class label.
We are also given a loss function ℓ:{−1,1}×{−1,1}→R that measures the error of a given prediction. The value of the loss function ℓ at an arbitrary point (y,y^) is interpreted as the cost incurred by predicting y^ when the true label is y . In classification this function is often the zero-one loss, that is, ℓ(y,y^) is zero when y=y^ and one otherwise.
The goal is to find a function h:X→[k] , the classifier, with the smallest expected loss on a new sample. In other words, we seek to find a function h that minimizes the expected ℓ -risk, given by
In theory, we could directly minimize the ℓ -risk and we would have the optimal classifier, also known as Bayes predictor. However, there are several problems associated with this approach. One is that the probability distribution of X×Y is unknown, thus computing the exact expected value is not feasible. It must be approximated by the empirical risk. Another issue is that this quantity is difficult to optimize because the function ℓ is discontinuous. Take for example a problem in which X=R2,k=2 , and we seek to find the linear function f(X)=sign(Xw),w∈R2 and that minimizes the ℓ -risk. As a function of the parameter w this function looks something like
![loss as function of w](https://i-blog.csdnimg.cn/blog_migrate/18395b7ee707b07387ccc41aad8e257d.png)
This function is discontinuous with large, flat regions and is thus extremely hard to optimize using gradient-based methods. For this reason it is usual to consider a proxy to the loss called a surrogate loss function. For computational reasons this is usually convex function Ψ:R→R+ . An example of such surrogate loss functions is the hinge loss, Ψ(t)=max(1−t,0) , which is the loss used by Support Vector Machines (SVMs). Another example is the logistic loss, Ψ(t)=1/(1+exp(−t)) , used by the logistic regression model. If we consider the logistic loss, minimizing the Ψ -risk, given by EX×Y[Ψ(Y,f(X))] , of the function f(X)=Xw becomes a much more more tractable optimization problem:
![](https://i-blog.csdnimg.cn/blog_migrate/bf69784bd6b17df425d59e900df37916.png)
In short, we have replaced the ℓ -risk which is computationally difficult to optimize with the Ψ -risk which has more advantageous properties. A natural questions to ask is how much have we lost by this change. The property of whether minimizing the Ψ -risk leads to a function that also minimizes the ℓ -risk is often referred to as consistency or calibration. For a more formal definition see [1] and [2]. This property will depend on the surrogate function Ψ : for some functions Ψ it will be verified the consistency property and for some not. One of the most useful characterizations was given in [1] and states that if Ψ is convex then it is consistent if and only if it is differentiable at zero and Ψ′(0)<0 . This includes most of the commonly used surrogate loss functions, including hinge, logistic regression and Huber loss functions.
![](https://i-blog.csdnimg.cn/blog_migrate/853ab5d82e78ee1cb5214d36bd2a9b4f.png)