以前非常熟练的一个操作是进行各种复杂的机器学习后,直接用predict函数就可以算出预测概率(当然这里是指二分类结局,毕竟这种类型在临床医学中最为常用)。
代码如下:
gridXgboost <- expand.grid(nrounds=(1:10)*10,
max_depth=c(6),
eta=c(0.1),
gamma= c(0.1),
colsample_bytree=1,
min_child_weight=c(0.5,0.8,0.9,1),
subsample=c(0.3,0.4,0.5,0.8))
modXgboost<-train(x=trainSetX,#must
y=trainSetY,
method = 'xgbTree',trControl=fitControl,
tuneGrid =gridXgboost,trace = FALSE,
metric = "ROC")
plotTuningXgboost <- plot(modXgboost)
PredXgboost <- predict(modXgboost,newdata = testSetX,type = "prob")[,2]
把这个method换成其它各种机器学习模型后就可以进行无数个建模。但其实最后predict算出来的这个事件概率,在不同的机器学习方法,会有不同的偏向导致了calibration画起来会有偏差,这个偏差不是说你的机器学习模型不行了,而是需要对这个偏差进行校验后再来评估模型的performance。我怎么会发现这个BUG呢?主要是因为阅读了最近的一篇经典文献,而后又查到了一篇专门探讨这个预测概率校正的会议文献,需要详细精读:
http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdfwww.cs.cornell.edu摘要
We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Models such as Naive Bayes, which make unrealistic independence assumptions, push probabilities toward 0 and 1. Other models such as neural nets and bagged trees do not have these biases and predict well calibrated probabilities. We experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. We qualitatively examine what kinds of distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. The empirical results show that after calibration boosted trees, random forests, and SVMs predict the best probabilities.
- 首先要分清楚比较的两个对象:各种预测模型预测出来的概率vs真实的后验概率。
- 不同的机器学习模型会有不同的偏向,比如提升树(树桩)会产生sigmoid型的概率分布,而朴素贝叶斯方法一般会产生比较靠近0或1的预测概率;另外的比如神经网络或bagged树产生的概率校验度较好。
- 两种校正偏倚概率的方法:Platt Scaling and Isotonic Regression。
- 这里的calibration这个词其实有两种意思,一个就是对预测概率的校正,另外一个就是跟实际概率比较,评估模型性能。后者更为常用,前者是首次接触。
前言
In many applications it is important to predict well calibrated probabilities; good accuracy or area under the ROC curve are not sufficient. This paper examines the probabilities predicted by ten supervised learning algorithms: SVMs, neural nets, decision trees, memory-based learning, bagged trees, random forests, boosted trees, boosted stumps, naive bayes and logistic regression. We show how maximum margin methods such as SVMs, boosted trees, and boosted stumps tend to push predicted probabilities away from 0 and 1. This hurts the quality of the probabilities they predict and yields a characteristic sigmoid-shaped distortion in the predicted probabilities. Other methods such as naive bayes have the opposite bias and tend to push predictions closer to 0 and 1. And some learning methods such as bagged trees and neural nets have little or no bias and predict well-calibrated probabilities.
- 一个好的模型光在二分类水平上准确是没用的,还要预测概率也准确;其预测性能在二分类和连续化水平上都可靠。
- 文章考察了10个机器学习模型。
- 有些模型预测概率偏向0或1,有些则偏离0或1,而有些模型则不偏。
After examining the distortion (or lack of) characteristic to each learning method, we experiment with two calibration methods for correcting these distortions.
Platt Scaling: a method for transforming SVM outputs from [−∞, +∞] to posterior probabilities (Platt, 1999)
Isotonic Regression: the method used by Zadrozny and Elkan (2002; 2001) to calibrate predictions from boosted naive bayes, SVM, and decision tree models
Platt Scaling is most effective when the distortion in the predicted probabilities is sigmoid-shaped. Isotonic Regression is a more powerful calibration method that can correct any monotonic distortion. Unfortunately, this extra power comes at a price. A learning curve analysis shows that Isotonic Regression is more prone to overfitting, and thus performs worse than Platt Scaling, when data is scarce.
- Platt Scaling(sigmoid形状的概率效果较好)和Isotonic Regression(对各种单调性偏倚都比较效果好)是本文要重点研究的,用来校验预测概率的方法。
- 数据少的时候Isotonic Regression容易过拟合。
Finally, we examine how good are the probabilities predicted by each learning method after each method’s predictions have been calibrated. Experiments with eight classification problems suggest that random forests, neural nets and bagged decision trees are the best learning methods for predicting well-calibrated probabilities prior to calibration, but after calibration the best methods are boosted trees, random forests and SVMs.
- 校正前后各种模型的校验度是不同。
校验方法
Platt Calibration
假设有一个机器学习函数
其中A和B是参数,用极大似然估计从数据中估计,其中数据集表示为
其中
这其实就是拟合一般模型的过程,R里面glm等函数就能轻松获得这个估计值。
这里主要问题在于:上面这个模型的训练集是什么?怎么可以避免过度拟合?很显然,如果继续采用训练机器学习模型
另外一点思考:假设我们的
Platt Calibration的核心方法是将结局变量进行转化,不再是取值0或1,其转化方式:
其中
Isotonic Regression
这个在R的基础包里面就可以计算,但出来的不是一个模型参数,而是一连串的数值,直接得到每个患者(观察个体)校验后的评分。
Isotonic / Monotone Regressionstat.ethz.ch实践效果
接下来看看这两种方法用于实际的案例时候的效果。这里用了8个数据集(特征不同),10中监督学习方法。数据集分别是:
这些数据集解决的都是二分类的问题。
这是Boosted Trees对8个 数据集建模的结果。第一行代表了预测概率的分布柱状图,可以发现不同的模型预测的事件概率分布很不相同 ,说明未校正的预测概率不能准确反应每个患者的实际风险,基本上预测概率都分布在远离0和1的中间部分。而LETTER.P1预测概率大部分集中在0,因为这个数据集里阳性个体只有3%,是严重的非平衡数据。其实仔细看的话,即使是含阳性样本比例这么低的数据集里,其预测概率在接近0的时候也是直线下降。这种往中间集中的趋势特点,使得下面的校验图呈现sigmoid形状。分布在两端的个体非常少,因此很容易接近0或1,而分布在中间的样本比较多,预测概率变化微小的变化,其观察到的阳性概率变化非常显著。(这里其实留个疑问?)
中间一行代表的是用Platt法去拟合点,这里称之为”Reliability Diagrams“,而最下面的是用Isotonic Regression去拟合曲线的结果,所以呈现的是阶梯函数。这里的点都是一样的,都是粗预测概率,这个概率没有进行校正,只是拟合曲线的时候用了不同方法。跟下面用Platt法和Isotonic Regression法去校正粗预测概率不同。
The figures show that calibration undoes the shift in probability mass caused by boosting: after calibration many more cases have predicted probabilities near 0 and 1. The reliability diagrams are closer to diagonal, and the S-shape characteristic of boosted tree predictions is gone. On each problem, transforming predictions using Platt Scaling or Isotonic Regression yields a significant improvement in the predicted probabilities, leading to much lower squared error and log-loss. One difference between Isotonic Regression and Platt Scaling is apparent in the histograms: because Isotonic Regression generates a piecewise constant function, the histograms are coarse, while the histograms generated by Platt Scaling are smoother. See (Niculescu-Mizil & Caruana, 2005) for a more thorough analysis of boosting from the point-of-view of predicting well-calibrated probabilities.
这个图就是用Platt法对预测概率进行了校正,可以看到其校验度有了显著提高。校验使得更多的预测概率集中到了0或者1.。很多情况下Platt法和Isotonic Regression都能获得比较好的校验效果。
这两个图比较一下发现,Isotonic Regression产生的柱状图不太光滑,由于其采取的是阶梯函数进行拟合。
接下来这部分的内容就是对各种机器学习算法进行讨论,看各种算法采用校验后有什么变化。
学习曲线分析
这里的学习曲线其实是指随着校验样本量的变化,其预测准确度的变化。而准确度用平方差来表示。
The plots in Figure 7 show the average squared error over the eight test problems. For each problem, we perform ten trials. Error bars are shown on the plots, but are so narrow that they may be difficult to see. Calibration learning curves are shown for nine of the ten learning methods (decision trees are left out). The nearly horizontal lines in the graphs show the squared error prior to calibration. These lines are not perfectly horizontal only because the test sets change as more data is moved into the calibration sets. Each plot shows the squared error after calibration with Platt’s method or Isotonic Regression as the size of the calibration set varies from small to large. When the calibration set is small (less than about 200-1000 cases), Platt Scaling outperforms Isotonic Regression with all nine learning methods. This happens because Isotonic Regression is less constrained than Platt Scaling, so it is easier for it to overfit when the calibration set is small. Platt’s method also has some overfitting control built in (see Section 2). As the size of the calibration set increases, the learning curves for Platt Scaling and Isotonic Regression join, or even cross. When there are 1000 or more points in the calibration set, Isotonic Regression always yields performance as good as, or better than, Platt Scaling.
这个分析用了8个数据集10种机器学习模型,并且校验的样本量在不断变化。每一个样本量的每一种机器学习模型都能产生8个校验精准度(平方和),然后就可以产生点估计和标准差。未校正的预测值其校验度基本不变,呈现一直线。这里发现一个有趣的现象,那就是样本量不大的时候,Isotonic Regression校验度较好,这主要是因为它是非参数模型,模型限制较少,容易在小样本的时候过度拟合。
For learning methods that make well calibrated predictions such as neural nets, bagged trees, and logistic regression, neither Platt Scaling nor Isotonic Regression yields much improvement in performance even when the calibration set is very large. With these methods calibration is not beneficial, and actually hurts performance when the the calibration sets are small. For the max margin methods, boosted trees, boosted stumps and SVMs, calibration provides an improvement even when the calibration set is small. In Section 4 we saw that a sigmoid is a good match for boosted trees, boosted stumps, and SVMs. As expected, for these methods Platt Scaling performs better than Isotonic Regression for small to medium sized calibration (less than 1000 cases), and is virtually indistinguishable for larger calibration sets. As expected, calibration improves the performance of Naive Bayes models for almost all calibration set sizes, with Isotonic Regression outperforming Platt Scaling when there is more data. For the rest of the models: KNN, RF and DT (not shown) post-calibration helps once the calibration sets are large enough.
对于本来就校验较好的模型,如神经网络模型、逻辑回归和bagged tree。这些用校验法就不能提高校验度了。相反,如果校验样本较少的话,用这些方法反而减低校验度。对很多模型而言,一定的样本量是必要的。