这期介绍一下NB的最佳集成分类方法之一 Gradient Boosting Machine(GBM),并实现在具体数据集上的应用,尤其是临床数据。
简 介
GBM(Gradient Boosting Machine)算法是Boosting(提升)算法的一种。主要思想是,串行地生成多个弱学习器,每个弱学习器的目标是拟合先前累加模型的损失函数的负梯度, 使加上该弱学习器后的累积模型损失往负梯度的方向减少。且它用不同的权重将基学习器进行线性组合,使表现优秀的学习器得到重用。最常用的基学习器为树模型。
Gradient Boosting 还可以将其理解为函数空间上的梯度下降。我们比较熟悉的梯度下降通常是值在参数空间上的梯度下降(如训练神经网络,每轮迭代中计算当前损失关于参数的梯度,对参数进行更新)。
而在 Gradient Boosting 中,每轮迭代生成一个弱学习器,这个弱学习器拟合损失函数关于之前累积模型的梯度,然后将这个弱学习器加入累积模型中,逐渐降低累积模型的损失。即参数空间的梯度下降利用梯度信息调整参数,从而降低损失,而函数空间的梯度下降利用梯度,拟合一个新的函数,从而降低损失。
前面提到的 AdaBoost 是依靠调整数据点的权重来降低偏差;而 GBM 则是让新分类器拟合负梯度来降低偏差。
软件包安装
这里我们主要介绍caret,另外还有两个包同样可以实现GBM算法,软件包安装方法如下:
if(!require('caret')) {
install.packages('caret')
}
数据读取
这里我们选择之前在分析机器学习是曾经使用过的数据集:BreastCancer,可以跟我们之前的方法对比一下:
基于机器学习构建临床预测模型
MachineLearning 2. 因子分析(Factor Analysis)
MachineLearning 3. 聚类分析(Cluster Analysis)
MachineLearning 4. 癌症诊断方法之 K-邻近算法(KNN)
MachineLearning 5. 癌症诊断和分子分型方法之支持向量机(SVM)
MachineLearning 6. 癌症诊断机器学习之分类树(Classification Trees)
MachineLearning 7. 癌症诊断机器学习之回归树(Regression Trees)
MachineLearning 8. 癌症诊断机器学习之随机森林(Random Forest)
MachineLearning 9. 癌症诊断机器学习之梯度提升算法(Gradient Boosting)
MachineLearning 10. 癌症诊断机器学习之神经网络(Neural network)
MachineLearning 11. 机器学习之随机森林生存分析(randomForestSRC)
MachineLearning 12. 机器学习之降维方法t-SNE及可视化 (Rtsne)
MachineLearning 13. 机器学习之降维方法UMAP及可视化 (umap)
MachineLearning 14. 机器学习之集成分类器(AdaBoost)
MachineLearning 15. 机器学习之集成分类器(LogitBoost)
library(caret)
BreastCancer <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
BreastCancer = BreastCancer[, -1]
dim(BreastCancer)
## [1] 568 31
str(BreastCancer)
## 'data.frame': 568 obs. of 31 variables:
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 20.6 19.7 11.4 20.3 12.4 ...
## $ texture_mean : num 17.8 21.2 20.4 14.3 15.7 ...
## $ perimeter_mean : num 132.9 130 77.6 135.1 82.6 ...
## $ area_mean : num 1326 1203 386 1297 477 ...
## $ smoothness_mean : num 0.0847 0.1096 0.1425 0.1003 0.1278 ...
## $ compactne_mean : num 0.0786 0.1599 0.2839 0.1328 0.17 ...
## $ concavity_mean : num 0.0869 0.1974 0.2414 0.198 0.1578 ...
## $ concave_points_mean : num 0.0702 0.1279 0.1052 0.1043 0.0809 ...
## $ symmetry_mean : num 0.181 0.207 0.26 0.181 0.209 ...
## $ fractal_dimension_mean : num 0.0567 0.06 0.0974 0.0588 0.0761 ...
## $ radius_se : num 0.543 0.746 0.496 0.757 0.335 ...
## $ texture_se : num 0.734 0.787 1.156 0.781 0.89 ...
## $ perimeter_se : num 3.4 4.58 3.44 5.44 2.22 ...
## $ area_se : num 74.1 94 27.2 94.4 27.2 ...
## $ smoothness_se : num 0.00522 0.00615 0.00911 0.01149 0.00751 ...
## $ compactne_se : num 0.0131 0.0401 0.0746 0.0246 0.0335 ...
## $ concavity_se : num 0.0186 0.0383 0.0566 0.0569 0.0367 ...
## $ concave_points_se : num 0.0134 0.0206 0.0187 0.0188 0.0114 ...
## $ symmetry_se : num 0.0139 0.0225 0.0596 0.0176 0.0216 ...
## $ fractal_dimension_se : num 0.00353 0.00457 0.00921 0.00511 0.00508 ...
## $ radius_worst : num 25 23.6 14.9 22.5 15.5 ...
## $ texture_worst : num 23.4 25.5 26.5 16.7 23.8 ...
## $ perimeter_worst : num 158.8 152.5 98.9 152.2 103.4 ...
## $ area_worst : num 1956 1709 568 1575 742 ...
## $ smoothness_worst : num 0.124 0.144 0.21 0.137 0.179 ...
## $ compactne_worst : num 0.187 0.424 0.866 0.205 0.525 ...
## $ concavity_worst : num 0.242 0.45 0.687 0.4 0.535 ...
## $ concave_points_worst : num 0.186 0.243 0.258 0.163 0.174 ...
## $ symmetry_worst : num 0.275 0.361 0.664 0.236 0.399 ...
## $ fractal_dimension_worst: num 0.089 0.0876 0.173 0.0768 0.1244 ...
table(BreastCancer$diagnosis)
##
## B M
## 357 211
实例操作
### 数据预处理
数据预处理包括五个部,先判断数据是否有缺失,缺失数量,在进行如下步骤:
删除低方差的变量
删欧与其它自变最有很强相关性的变最
去除多重共线性
对数据标准化处理,并补足缺失值
特征筛选,递归特征消除法(RFE)
# 删除方差为0的变量
zerovar = nearZeroVar(BreastCancer[, -1])
zerovar
## integer(0)
# BreastCancer=BreastCancer[,-zerovar]
# 首先删除强相关的变量
descrCorr = cor(BreastCancer[, -1])
descrCorr[1:5, 1:5]
## radius_mean texture_mean perimeter_mean area_mean
## radius_mean 1.0000000 0.32938305 0.9978764 0.9873442
## texture_mean 0.3293830 1.00000000 0.3359176 0.3261929
## perimeter_mean 0.9978764 0.33591759 1.0000000 0.9865482
## area_mean 0.9873442 0.32619289 0.9865482 1.0000000
## smoothness_mean 0.1680940 -0.01776898 0.2045046 0.1748380
## smoothness_mean
## radius_mean 0.16809398
## texture_mean -0.01776898
## perimeter_mean 0.20450464
## area_mean 0.17483805
## smoothness_mean 1.00000000
highCorr = findCorrelation(descrCorr, 0.9)
highCorr
## [1] 7 8 23 21 3 24 1 13 14 2
BreastCancer = BreastCancer[, -(highCorr + 1)]
dim(BreastCancer)
## [1] 568 21
# 随后解决多重共线性,本例中不存在多重共线性问题
comboInfo = findLinearCombos(BreastCancer[, -1])
comboInfo
## $linearCombos
## list()
##
## $remove
## NULL
# BreastCancer=BreastCancer[, -(comboInfo$remove+2)]
Process = preProcess(BreastCancer)
Process
## Created from 568 samples and 21 variables
##
## Pre-processing:
## - centered (20)
## - ignored (1)
## - scaled (20)
BreastCancer = predict(Process, BreastCancer)
特征选择
在进行数据挖掘时,我们并不需要将所有的自变量用来建模,而是从中选择若干最重要的变量,这称为特征选择(feature selection)。一种算法就是后向选择,即先将所有的变量都包括在模型中,然后计算其效能(如误差、预测精度)和变量重要排序,然后保留最重要的若干变量,再次计算效能,这样反复迭代,找出合适的自变量数目。这种算法的一个缺点在于可能会存在过度拟合,所以需要在此算法外再套上一个样本划分的循环。在caret包中的rfe命令可以完成这项任务。functions是确定用什么样的模型进行自变量排序,包括:
随机森林rfFuncs,
lmFuncs(线性回归),
nbFuncs(朴素贝叶斯),
treebagFuncs(装袋决策树),
caretFuncs(自定义的训练模型)。
method是确定抽样方法,cv即交叉检验, 还有提升boot以及留一交叉检验LOOCV。
ctrl = rfeControl(functions = caretFuncs, method = "repeatedcv", verbose = FALSE,
returnResamp = "final")
BreastCancer$diagnosis = as.factor(BreastCancer$diagnosis)
Profile = rfe(BreastCancer[, -1], BreastCancer$diagnosis, rfeControl = ctrl)
print(Profile)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 1 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 4 0.9367 0.8644 0.02881 0.06139
## 8 0.9507 0.8934 0.02839 0.06189
## 16 0.9613 0.9160 0.02457 0.05404 *
## 20 0.9613 0.9165 0.02457 0.05287
##
## The top 5 variables (out of 16):
## concave_points_worst, area_mean, concavity_worst, radius_se, compactne_worst
plot(Profile)
xyplot(Profile$results$Kappa ~ Profile$results$Variables, ylab = "Kappa", xlab = "Variables",
type = c("g", "p", "l"), auto.key = TRUE)
xyplot(Profile$results$Accuracy ~ Profile$results$Variables, ylab = "Accuracy", xlab = "Variables",
type = c("g", "p", "l"), auto.key = TRUE)
数据分割
数据分割就是将数据分割为测试数据集和验证数据集,关于这个数据分割可以参考Topic 5. 样本量确定及分割,具体操作如下:
library(tidyverse)
library(sampling)
set.seed(123)
# 每层抽取70%的数据
train_id <- strata(BreastCancer, "diagnosis", size = rev(round(table(BreastCancer$diagnosis) *
0.7)))$ID_unit
# 训练数据
trainData <- BreastCancer[train_id, ]
# 测试数据
testData <- BreastCancer[-train_id, ]
# 查看训练、测试数据中正负样本比例
prop.table(table(trainData$diagnosis))
##
## B M
## 0.6281407 0.3718593
prop.table(table(testData$diagnosis))
##
## B M
## 0.6294118 0.3705882
prop.table(table(BreastCancer$diagnosis))
##
## B M
## 0.6285211 0.3714789
可视化重要变量
我们可以使用featurePlot()函数可视化每个自变量的取值范围以及不同分类比较等问题。
对于分类模型选择:box, strip, density, pairs or ellipse
对于回归模型选择:pairs or scatter
#4. How to visualize the importance of variables using featurePlot()
featurePlot(x = trainData[, 2:21],
y = as.factor(trainData$diagnosis),
plot = "box", #For classification:box, strip, density, pairs or ellipse. For regression, pairs or scatter
strip=strip.custom(par.strip.text=list(cex=.7)),
scales = list(x = list(relation="free"),
y = list(relation="free"))
)
定义测试参数
在正式训练前,首先需要使用trainControl函数定义模型训练参数,method确定多次交叉检验的抽样方法,number确定了划分的重数, repeats确定了反复次数。
fitControl <- trainControl(
method = 'cv', # k-fold cross validation
number = 5, # number of folds
savePredictions = 'final', # saves predictions for optimal tuning parameter
classProbs = T, # should class probabilities be returned
summaryFunction=twoClassSummary # results summary function
)
构建GBM分类器
使用train训练模型,本例中使用的时gbm算法,我们可以对一些参数进行手动调优,包括interaction.depth,n.trees,shrinkage,n.minobsinnode等参数,也可以使用默认参数
names(getModelInfo())
## [1] "ada" "AdaBag" "AdaBoost.M1"
## [4] "adaboost" "amdai" "ANFIS"
## [7] "avNNet" "awnb" "awtan"
## [10] "bag" "bagEarth" "bagEarthGCV"
## [13] "bagFDA" "bagFDAGCV" "bam"
## [16] "bartMachine" "bayesglm" "binda"
## [19] "blackboost" "blasso" "blassoAveraged"
## [22] "bridge" "brnn" "BstLm"
## [25] "bstSm" "bstTree" "C5.0"
## [28] "C5.0Cost" "C5.0Rules" "C5.0Tree"
## [31] "cforest" "chaid" "CSimca"
## [34] "ctree" "ctree2" "cubist"
## [37] "dda" "deepboost" "DENFIS"
## [40] "dnn" "dwdLinear" "dwdPoly"
## [43] "dwdRadial" "earth" "elm"
## [46] "enet" "evtree" "extraTrees"
## [49] "fda" "FH.GBML" "FIR.DM"
## [52] "foba" "FRBCS.CHI" "FRBCS.W"
## [55] "FS.HGD" "gam" "gamboost"
## [58] "gamLoess" "gamSpline" "gaussprLinear"
## [61] "gaussprPoly" "gaussprRadial" "gbm_h2o"
## [64] "gbm" "gcvEarth" "GFS.FR.MOGUL"
## [67] "GFS.LT.RS" "GFS.THRIFT" "glm.nb"
## [70] "glm" "glmboost" "glmnet_h2o"
## [73] "glmnet" "glmStepAIC" "gpls"
## [76] "hda" "hdda" "hdrda"
## [79] "HYFIS" "icr" "J48"
## [82] "JRip" "kernelpls" "kknn"
## [85] "knn" "krlsPoly" "krlsRadial"
## [88] "lars" "lars2" "lasso"
## [91] "lda" "lda2" "leapBackward"
## [94] "leapForward" "leapSeq" "Linda"
## [97] "lm" "lmStepAIC" "LMT"
## [100] "loclda" "logicBag" "LogitBoost"
## [103] "logreg" "lssvmLinear" "lssvmPoly"
## [106] "lssvmRadial" "lvq" "M5"
## [109] "M5Rules" "manb" "mda"
## [112] "Mlda" "mlp" "mlpKerasDecay"
## [115] "mlpKerasDecayCost" "mlpKerasDropout" "mlpKerasDropoutCost"
## [118] "mlpML" "mlpSGD" "mlpWeightDecay"
## [121] "mlpWeightDecayML" "monmlp" "msaenet"
## [124] "multinom" "mxnet" "mxnetAdam"
## [127] "naive_bayes" "nb" "nbDiscrete"
## [130] "nbSearch" "neuralnet" "nnet"
## [133] "nnls" "nodeHarvest" "null"
## [136] "OneR" "ordinalNet" "ordinalRF"
## [139] "ORFlog" "ORFpls" "ORFridge"
## [142] "ORFsvm" "ownn" "pam"
## [145] "parRF" "PART" "partDSA"
## [148] "pcaNNet" "pcr" "pda"
## [151] "pda2" "penalized" "PenalizedLDA"
## [154] "plr" "pls" "plsRglm"
## [157] "polr" "ppr" "pre"
## [160] "PRIM" "protoclass" "qda"
## [163] "QdaCov" "qrf" "qrnn"
## [166] "randomGLM" "ranger" "rbf"
## [169] "rbfDDA" "Rborist" "rda"
## [172] "regLogistic" "relaxo" "rf"
## [175] "rFerns" "RFlda" "rfRules"
## [178] "ridge" "rlda" "rlm"
## [181] "rmda" "rocc" "rotationForest"
## [184] "rotationForestCp" "rpart" "rpart1SE"
## [187] "rpart2" "rpartCost" "rpartScore"
## [190] "rqlasso" "rqnc" "RRF"
## [193] "RRFglobal" "rrlda" "RSimca"
## [196] "rvmLinear" "rvmPoly" "rvmRadial"
## [199] "SBC" "sda" "sdwd"
## [202] "simpls" "SLAVE" "slda"
## [205] "smda" "snn" "sparseLDA"
## [208] "spikeslab" "spls" "stepLDA"
## [211] "stepQDA" "superpc" "svmBoundrangeString"
## [214] "svmExpoString" "svmLinear" "svmLinear2"
## [217] "svmLinear3" "svmLinearWeights" "svmLinearWeights2"
## [220] "svmPoly" "svmRadial" "svmRadialCost"
## [223] "svmRadialSigma" "svmRadialWeights" "svmSpectrumString"
## [226] "tan" "tanSearch" "treebag"
## [229] "vbmpRadial" "vglmAdjCat" "vglmContRatio"
## [232] "vglmCumulative" "widekernelpls" "WM"
## [235] "wsrf" "xgbDART" "xgbLinear"
## [238] "xgbTree" "xyf"
set.seed(2863)
model_GBM <- train(diagnosis ~ ., data = trainData, method = "gbm", tuneLength = 2,
metric = "ROC", trControl = fitControl)
plot(model_GBM, main = "GBM")
计算混淆矩阵
对于分类模型的只需要看混淆矩阵比较清晰的看出来分类的正确性。
# 6.5. Confusion Matrix Compute the confusion matrix
predProb <- predict(model_GBM, testData, type = "prob")
head(predProb)
## B M
## 1 0.01417378 0.9858262
## 2 0.07429899 0.9257010
## 3 0.62433290 0.3756671
## 4 0.02298048 0.9770195
## 5 0.02675321 0.9732468
## 6 0.03445710 0.9655429
predicted = predict(model_GBM, testData)
testData$predProb = predProb$B
testData$diagnosis = as.factor(testData$diagnosis)
confusionMatrix(reference = testData$diagnosis, data = predicted, mode = "everything",
positive = "B")
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 102 6
## M 5 57
##
## Accuracy : 0.9353
## 95% CI : (0.8872, 0.9673)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8608
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9533
## Specificity : 0.9048
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.9194
## Precision : 0.9444
## Recall : 0.9533
## F1 : 0.9488
## Prevalence : 0.6294
## Detection Rate : 0.6000
## Detection Prevalence : 0.6353
## Balanced Accuracy : 0.9290
##
## 'Positive' Class : B
##
绘制ROC曲线
但是根据模型构建后需要进行准确性的评估我们就需要计算一下AUC,绘制ROC曲线来展示一下准确性。
library(ROCR)
pred = prediction(testData$predProb, testData$diagnosis)
perf = performance(pred, measure = "fpr", x.measure = "tpr")
plot(perf, lwd = 2, col = "blue", main = "ROC")
abline(a = 0, b = 1, col = 2, lwd = 1, lty = 2)
多个分类器比较
# Train the model using rf
model_rf = train(diagnosis ~ ., data = trainData, method = "rf", tuneLength = 2,
trControl = fitControl)
model_rf
## Random Forest
##
## 398 samples
## 20 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 318, 319, 319, 318, 318
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9836966 0.984 0.9114943
## 20 0.9734851 0.952 0.9317241
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Train the model using adaboost
model_adaboost = train(diagnosis ~ ., data = trainData, method = "adaboost", tuneLength = 2,
trControl = fitControl)
model_adaboost
## AdaBoost Classification Trees
##
## 398 samples
## 20 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 318, 319, 318, 319, 318
## Resampling results across tuning parameters:
##
## nIter method ROC Sens Spec
## 50 Adaboost.M1 0.9913149 0.980 0.9528736
## 50 Real adaboost 0.8188874 0.988 0.9257471
## 100 Adaboost.M1 0.9918621 0.980 0.9459770
## 100 Real adaboost 0.8698092 0.988 0.9324138
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nIter = 100 and method = Adaboost.M1.
# Train the model using Logitboost
model_LogitBoost = train(diagnosis ~ ., data = trainData, method = "LogitBoost",
tuneLength = 2, trControl = fitControl)
model_LogitBoost
## Boosted Logistic Regression
##
## 398 samples
## 20 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 319, 318, 319, 318, 318
## Resampling results across tuning parameters:
##
## nIter ROC Sens Spec
## 11 0.9885425 0.948 0.9462069
## 21 0.9886874 0.972 0.9459770
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was nIter = 21.
models_compare <- resamples(list(ADABOOST = model_adaboost, RF = model_rf, LOGITBOOST = model_LogitBoost,
GBM = model_GBM))
summary(models_compare)
##
## Call:
## summary.resamples(object = models_compare)
##
## Models: ADABOOST, RF, LOGITBOOST, GBM
## Number of resamples: 5
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ADABOOST 0.9806897 0.9886667 0.9920000 0.9918621 0.9986207 0.9993333 0
## RF 0.9372414 0.9872414 0.9960000 0.9836966 0.9986667 0.9993333 0
## LOGITBOOST 0.9770000 0.9837931 0.9876667 0.9886874 0.9956667 0.9993103 0
## GBM 0.9713333 0.9903448 0.9937931 0.9904276 0.9966667 1.0000000 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ADABOOST 0.94 0.98 0.98 0.980 1.00 1 0
## RF 0.96 0.96 1.00 0.984 1.00 1 0
## LOGITBOOST 0.94 0.96 0.98 0.972 0.98 1 0
## GBM 0.94 0.94 0.98 0.968 0.98 1 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ADABOOST 0.8666667 0.9310345 0.9655172 0.9459770 0.9666667 1.0000000 0
## RF 0.7241379 0.9000000 0.9666667 0.9114943 0.9666667 1.0000000 0
## LOGITBOOST 0.9310345 0.9333333 0.9333333 0.9459770 0.9655172 0.9666667 0
## GBM 0.8666667 0.9310345 0.9655172 0.9459770 0.9666667 1.0000000 0
# Draw box plots to compare models
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(models_compare, scales = scales)
Reference:
Kuhn (2008), “Building Predictive Models in R Using the caret”
Efron (1983). “Estimating the error rate of a prediction rule: improvement on cross-validation”. Journal of the American Statistical Association, 78(382):316-331
Efron, B., & Tibshirani, R. J. (1994). “An introduction to the bootstrap”, pages 249-252. CRC press.
Bergstra and Bengio (2012), “Random Search for Hyper-Parameter Optimization”, Journal of Machine Learning Research, 13(Feb):281-305
5. Kuhn (2014), “Futility Analysis in the Cross-Validation of Machine Learning Models”
号外号外,桓峰基因单细胞生信分析免费培训课程即将开始快来报名吧!
桓峰基因,铸造成功的您!
未来桓峰基因公众号将不间断的推出单细胞系列生信分析教程,
敬请期待!!
桓峰基因官网正式上线,请大家多多关注,还有很多不足之处,大家多多指正!http://www.kyohogene.com/
桓峰基因和投必得合作,文章润色优惠85折,需要文章润色的老师可以直接到网站输入领取桓峰基因专属优惠券码:KYOHOGENE,然后上传,付款时选择桓峰基因优惠券即可享受85折优惠哦!https://www.topeditsci.com/