MachineLearning 19. 机器学习之神经网络分类器(NNET)

2b345a34629d9430b72e8b4307115e84.png


简    介

神经网络是一种运算模型,由大量的节点(或称“神经元”,或“单元”)和之间相互联接构成。每个节点代表一种特定的输出函数,称为激励函数。每两个节点间的连接都代表一个对于通过该连接信号的加权值,称之为权重,这相当于人工神经网络的记忆。网络的输出则依网络的连接方式,权重值和激励函数的不同而不同。而网络自身通常都是对自然界某种算法或者函数的逼近,也可能是对一种逻辑策略的表达。

使用神经网络的动机是由于强有力的非线性模型的需求。例如,你可能要解决一个非线性分类问题,如果你尝试使用逻辑回归,只有在特征变量较少的情况下才能得到解决方案。然而,逻辑回归并不是一个学习复杂非线性模型的好方法。神经网络要好得多,特别是当特征变量的数目较大时。神经网络算法可以有效地发现数据的底层结构,并已经成功应用到各式各样的问题中,范围从图像分类到自然语言处理和语音识别。

• 疾病判断:病人到医院去做了一大堆肝功、尿检测验,把测验结果送进一个机器里,机器需要判断这个病人是否得病,得的什么病。

• 分类器:分类器的目标就是让正确分类的比例尽可能高。一般需要首先收集一些样本,人为标记上正确分类结果,然后用这些标记好的数据训练分类器,训练好的分类器就可以在新来的特征向量上工作了。

1729a59292da1584d0efd7abcdbe3d96.png

软件包安装

这里我们主要介绍caret,另外还有两个包同样可以实现神经网络算法,软件包安装方法如下:

if(require(caret))
  install.packages("caret")

数据读取

这里我们选择之前在分析机器学习是曾经使用过的数据集:BreastCancer,可以跟我们之前的方法对比一下:

library(caret)
BreastCancer <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
BreastCancer = BreastCancer[, -1]
dim(BreastCancer)
## [1] 568  31
str(BreastCancer)
## 'data.frame':	568 obs. of  31 variables:
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  20.6 19.7 11.4 20.3 12.4 ...
##  $ texture_mean           : num  17.8 21.2 20.4 14.3 15.7 ...
##  $ perimeter_mean         : num  132.9 130 77.6 135.1 82.6 ...
##  $ area_mean              : num  1326 1203 386 1297 477 ...
##  $ smoothness_mean        : num  0.0847 0.1096 0.1425 0.1003 0.1278 ...
##  $ compactne_mean         : num  0.0786 0.1599 0.2839 0.1328 0.17 ...
##  $ concavity_mean         : num  0.0869 0.1974 0.2414 0.198 0.1578 ...
##  $ concave_points_mean    : num  0.0702 0.1279 0.1052 0.1043 0.0809 ...
##  $ symmetry_mean          : num  0.181 0.207 0.26 0.181 0.209 ...
##  $ fractal_dimension_mean : num  0.0567 0.06 0.0974 0.0588 0.0761 ...
##  $ radius_se              : num  0.543 0.746 0.496 0.757 0.335 ...
##  $ texture_se             : num  0.734 0.787 1.156 0.781 0.89 ...
##  $ perimeter_se           : num  3.4 4.58 3.44 5.44 2.22 ...
##  $ area_se                : num  74.1 94 27.2 94.4 27.2 ...
##  $ smoothness_se          : num  0.00522 0.00615 0.00911 0.01149 0.00751 ...
##  $ compactne_se           : num  0.0131 0.0401 0.0746 0.0246 0.0335 ...
##  $ concavity_se           : num  0.0186 0.0383 0.0566 0.0569 0.0367 ...
##  $ concave_points_se      : num  0.0134 0.0206 0.0187 0.0188 0.0114 ...
##  $ symmetry_se            : num  0.0139 0.0225 0.0596 0.0176 0.0216 ...
##  $ fractal_dimension_se   : num  0.00353 0.00457 0.00921 0.00511 0.00508 ...
##  $ radius_worst           : num  25 23.6 14.9 22.5 15.5 ...
##  $ texture_worst          : num  23.4 25.5 26.5 16.7 23.8 ...
##  $ perimeter_worst        : num  158.8 152.5 98.9 152.2 103.4 ...
##  $ area_worst             : num  1956 1709 568 1575 742 ...
##  $ smoothness_worst       : num  0.124 0.144 0.21 0.137 0.179 ...
##  $ compactne_worst        : num  0.187 0.424 0.866 0.205 0.525 ...
##  $ concavity_worst        : num  0.242 0.45 0.687 0.4 0.535 ...
##  $ concave_points_worst   : num  0.186 0.243 0.258 0.163 0.174 ...
##  $ symmetry_worst         : num  0.275 0.361 0.664 0.236 0.399 ...
##  $ fractal_dimension_worst: num  0.089 0.0876 0.173 0.0768 0.1244 ...
table(BreastCancer$diagnosis)
## 
##   B   M 
## 357 211

基于机器学习构建临床预测模型

MachineLearning 1. 主成分分析(PCA)

MachineLearning 2. 因子分析(Factor Analysis)

MachineLearning 3. 聚类分析(Cluster Analysis)

MachineLearning 4. 癌症诊断方法之 K-邻近算法(KNN)

MachineLearning 5. 癌症诊断和分子分型方法之支持向量机(SVM)

MachineLearning 6. 癌症诊断机器学习之分类树(Classification Trees)

MachineLearning 7. 癌症诊断机器学习之回归树(Regression Trees)

MachineLearning 8. 癌症诊断机器学习之随机森林(Random Forest)

MachineLearning 9. 癌症诊断机器学习之梯度提升算法(Gradient Boosting)

MachineLearning 10. 癌症诊断机器学习之神经网络(Neural network)

MachineLearning 11. 机器学习之随机森林生存分析(randomForestSRC)

MachineLearning 12. 机器学习之降维方法t-SNE及可视化 (Rtsne)

MachineLearning 13. 机器学习之降维方法UMAP及可视化 (umap)

MachineLearning 14. 机器学习之集成分类器(AdaBoost)

MachineLearning 15. 机器学习之集成分类器(LogitBoost)

MachineLearning 16. 机器学习之梯度提升机(GBM)

MachineLearning 17. 机器学习之围绕中心点划分算法(PAM)

MachineLearning 18. 机器学习之贝叶斯分类器(Naive Bayes)

实例操作

数据预处理

数据预处理包括五个部,先判断数据是否有缺失,缺失数量,在进行如下步骤:

  1. 删除低方差的变量

  2. 删欧与其它自变最有很强相关性的变最

  3. 去除多重共线性

  4. 对数据标准化处理,并补足缺失值

  5. 特征筛选,递归特征消除法(RFE)

# 删除方差为0的变量
zerovar = nearZeroVar(BreastCancer[, -1])
zerovar
## integer(0)
# BreastCancer=BreastCancer[,-zerovar]

# 首先删除强相关的变量
descrCorr = cor(BreastCancer[, -1])
descrCorr[1:5, 1:5]
##                 radius_mean texture_mean perimeter_mean area_mean
## radius_mean       1.0000000   0.32938305      0.9978764 0.9873442
## texture_mean      0.3293830   1.00000000      0.3359176 0.3261929
## perimeter_mean    0.9978764   0.33591759      1.0000000 0.9865482
## area_mean         0.9873442   0.32619289      0.9865482 1.0000000
## smoothness_mean   0.1680940  -0.01776898      0.2045046 0.1748380
##                 smoothness_mean
## radius_mean          0.16809398
## texture_mean        -0.01776898
## perimeter_mean       0.20450464
## area_mean            0.17483805
## smoothness_mean      1.00000000
highCorr = findCorrelation(descrCorr, 0.9)
highCorr
##  [1]  7  8 23 21  3 24  1 13 14  2
BreastCancer = BreastCancer[, -(highCorr + 1)]
dim(BreastCancer)
## [1] 568  21
# 随后解决多重共线性,本例中不存在多重共线性问题
comboInfo = findLinearCombos(BreastCancer[, -1])
comboInfo
## $linearCombos
## list()
## 
## $remove
## NULL
# BreastCancer=BreastCancer[, -(comboInfo$remove+2)]
Process = preProcess(BreastCancer)
Process
## Created from 568 samples and 21 variables
## 
## Pre-processing:
##   - centered (20)
##   - ignored (1)
##   - scaled (20)
BreastCancer = predict(Process, BreastCancer)

特征选择

在进行数据挖掘时,我们并不需要将所有的自变量用来建模,而是从中选择若干最重要的变量,这称为特征选择(feature selection)。一种算法就是后向选择,即先将所有的变量都包括在模型中,然后计算其效能(如误差、预测精度)和变量重要排序,然后保留最重要的若干变量,再次计算效能,这样反复迭代,找出合适的自变量数目。这种算法的一个缺点在于可能会存在过度拟合,所以需要在此算法外再套上一个样本划分的循环。在caret包中的rfe命令可以完成这项任务。 functions是确定用什么样的模型进行自变量排序,包括:

  1. 随机森林rfFuncs,

  2. lmFuncs(线性回归),

  3. nbFuncs(朴素贝叶斯),

  4. treebagFuncs(装袋决策树),

  5. caretFuncs(自定义的训练模型)。

method是确定抽样方法,cv即交叉检验, 还有提升boot以及留一交叉检验LOOCV。

ctrl = rfeControl(functions = caretFuncs, method = "repeatedcv", verbose = FALSE,
    returnResamp = "final")
BreastCancer$diagnosis = as.factor(BreastCancer$diagnosis)
Profile = rfe(BreastCancer[, -1], BreastCancer$diagnosis, rfeControl = ctrl)
print(Profile)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          4   0.9367 0.8639    0.03710 0.08212         
##          8   0.9560 0.9051    0.02768 0.06153         
##         16   0.9578 0.9081    0.03324 0.07487        *
##         20   0.9543 0.9002    0.03111 0.07052         
## 
## The top 5 variables (out of 16):
##    concave_points_worst, area_mean, concavity_worst, radius_se, compactne_mean
plot(Profile)

7529c8c25f759561f23a0816574a952a.png

xyplot(Profile$results$Kappa ~ Profile$results$Variables, ylab = "Kappa", xlab = "Variables",
    type = c("g", "p", "l"), auto.key = TRUE)

22a6f8bfdf822ecfa628e9f9ab20e6d4.png

xyplot(Profile$results$Accuracy ~ Profile$results$Variables, ylab = "Accuracy", xlab = "Variables",
    type = c("g", "p", "l"), auto.key = TRUE)

9f56572153679b69f10dcf80d175816e.png

数据分割

数据分割就是将数据分割为测试数据集和验证数据集,关于这个数据分割可以参考Topic 5. 样本量确定及分割,具体操作如下:

library(tidyverse)
library(sampling)
set.seed(123)
# 每层抽取70%的数据
train_id <- strata(BreastCancer, "diagnosis", size = rev(round(table(BreastCancer$diagnosis) *
    0.7)))$ID_unit
# 训练数据
trainData <- BreastCancer[train_id, ]
# 测试数据
testData <- BreastCancer[-train_id, ]

# 查看训练、测试数据中正负样本比例
prop.table(table(trainData$diagnosis))
## 
##         B         M 
## 0.6281407 0.3718593

prop.table(table(testData$diagnosis))
## 
##         B         M 
## 0.6294118 0.3705882

prop.table(table(BreastCancer$diagnosis))
## 
##         B         M 
## 0.6285211 0.3714789

可视化重要变量

我们可以使用featurePlot()函数可视化每个自变量的取值范围以及不同分类比较等问题。

对于分类模型选择:box, strip, density, pairs or ellipse

对于回归模型选择:pairs or scatter

#4. How to visualize the importance of variables using featurePlot()
featurePlot(x = trainData[, 2:21], 
            y = as.factor(trainData$diagnosis), 
            plot = "box", #For classification:box, strip, density, pairs or ellipse. For regression, pairs or scatter
            strip=strip.custom(par.strip.text=list(cex=.7)),
            scales = list(x = list(relation="free"), 
                          y = list(relation="free"))
)

7f7343ce6c95f9ed98d7ec512a3092dc.png

定义测试参数

在正式训练前,首先需要使用trainControl函数定义模型训练参数,method确定多次交叉检验的抽样方法,number确定了划分的重数, repeats确定了反复次数。

fitControl <- trainControl(
  method = 'cv',                   # k-fold cross validation
  number = 5,                      # number of folds
  savePredictions = 'final',       # saves predictions for optimal tuning parameter
  classProbs = T,                  # should class probabilities be returned
  summaryFunction=twoClassSummary  # results summary function
)

构建神经网络分类器

使用train训练模型,本例中使用的时神经网络算法,我们可以对一些参数进行手动调优,包括interaction.depth,n.trees,shrinkage,n.minobsinnode等参数,也可以使用默认参数

names(getModelInfo())
##   [1] "ada"                 "AdaBag"              "AdaBoost.M1"        
##   [4] "adaboost"            "amdai"               "ANFIS"              
##   [7] "avNNet"              "awnb"                "awtan"              
##  [10] "bag"                 "bagEarth"            "bagEarthGCV"        
##  [13] "bagFDA"              "bagFDAGCV"           "bam"                
##  [16] "bartMachine"         "bayesglm"            "binda"              
##  [19] "blackboost"          "blasso"              "blassoAveraged"     
##  [22] "bridge"              "brnn"                "BstLm"              
##  [25] "bstSm"               "bstTree"             "C5.0"               
##  [28] "C5.0Cost"            "C5.0Rules"           "C5.0Tree"           
##  [31] "cforest"             "chaid"               "CSimca"             
##  [34] "ctree"               "ctree2"              "cubist"             
##  [37] "dda"                 "deepboost"           "DENFIS"             
##  [40] "dnn"                 "dwdLinear"           "dwdPoly"            
##  [43] "dwdRadial"           "earth"               "elm"                
##  [46] "enet"                "evtree"              "extraTrees"         
##  [49] "fda"                 "FH.GBML"             "FIR.DM"             
##  [52] "foba"                "FRBCS.CHI"           "FRBCS.W"            
##  [55] "FS.HGD"              "gam"                 "gamboost"           
##  [58] "gamLoess"            "gamSpline"           "gaussprLinear"      
##  [61] "gaussprPoly"         "gaussprRadial"       "gbm_h2o"            
##  [64] "gbm"                 "gcvEarth"            "GFS.FR.MOGUL"       
##  [67] "GFS.LT.RS"           "GFS.THRIFT"          "glm.nb"             
##  [70] "glm"                 "glmboost"            "glmnet_h2o"         
##  [73] "glmnet"              "glmStepAIC"          "gpls"               
##  [76] "hda"                 "hdda"                "hdrda"              
##  [79] "HYFIS"               "icr"                 "J48"                
##  [82] "JRip"                "kernelpls"           "kknn"               
##  [85] "knn"                 "krlsPoly"            "krlsRadial"         
##  [88] "lars"                "lars2"               "lasso"              
##  [91] "lda"                 "lda2"                "leapBackward"       
##  [94] "leapForward"         "leapSeq"             "Linda"              
##  [97] "lm"                  "lmStepAIC"           "LMT"                
## [100] "loclda"              "logicBag"            "LogitBoost"         
## [103] "logreg"              "lssvmLinear"         "lssvmPoly"          
## [106] "lssvmRadial"         "lvq"                 "M5"                 
## [109] "M5Rules"             "manb"                "mda"                
## [112] "Mlda"                "mlp"                 "mlpKerasDecay"      
## [115] "mlpKerasDecayCost"   "mlpKerasDropout"     "mlpKerasDropoutCost"
## [118] "mlpML"               "mlpSGD"              "mlpWeightDecay"     
## [121] "mlpWeightDecayML"    "monmlp"              "msaenet"            
## [124] "multinom"            "mxnet"               "mxnetAdam"          
## [127] "naive_bayes"         "nb"                  "nbDiscrete"         
## [130] "nbSearch"            "neuralnet"           "nnet"               
## [133] "nnls"                "nodeHarvest"         "null"               
## [136] "OneR"                "ordinalNet"          "ordinalRF"          
## [139] "ORFlog"              "ORFpls"              "ORFridge"           
## [142] "ORFsvm"              "ownn"                "pam"                
## [145] "parRF"               "PART"                "partDSA"            
## [148] "pcaNNet"             "pcr"                 "pda"                
## [151] "pda2"                "penalized"           "PenalizedLDA"       
## [154] "plr"                 "pls"                 "plsRglm"            
## [157] "polr"                "ppr"                 "pre"                
## [160] "PRIM"                "protoclass"          "qda"                
## [163] "QdaCov"              "qrf"                 "qrnn"               
## [166] "randomGLM"           "ranger"              "rbf"                
## [169] "rbfDDA"              "Rborist"             "rda"                
## [172] "regLogistic"         "relaxo"              "rf"                 
## [175] "rFerns"              "RFlda"               "rfRules"            
## [178] "ridge"               "rlda"                "rlm"                
## [181] "rmda"                "rocc"                "rotationForest"     
## [184] "rotationForestCp"    "rpart"               "rpart1SE"           
## [187] "rpart2"              "rpartCost"           "rpartScore"         
## [190] "rqlasso"             "rqnc"                "RRF"                
## [193] "RRFglobal"           "rrlda"               "RSimca"             
## [196] "rvmLinear"           "rvmPoly"             "rvmRadial"          
## [199] "SBC"                 "sda"                 "sdwd"               
## [202] "simpls"              "SLAVE"               "slda"               
## [205] "smda"                "snn"                 "sparseLDA"          
## [208] "spikeslab"           "spls"                "stepLDA"            
## [211] "stepQDA"             "superpc"             "svmBoundrangeString"
## [214] "svmExpoString"       "svmLinear"           "svmLinear2"         
## [217] "svmLinear3"          "svmLinearWeights"    "svmLinearWeights2"  
## [220] "svmPoly"             "svmRadial"           "svmRadialCost"      
## [223] "svmRadialSigma"      "svmRadialWeights"    "svmSpectrumString"  
## [226] "tan"                 "tanSearch"           "treebag"            
## [229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio"      
## [232] "vglmCumulative"      "widekernelpls"       "WM"                 
## [235] "wsrf"                "xgbDART"             "xgbLinear"          
## [238] "xgbTree"             "xyf"
set.seed(2863)
model_NNET <- train(diagnosis ~ ., data = trainData, method = "nnet", tuneLength = 2,
    metric = "ROC", trControl = fitControl)
plot(model_NNET, main = "NNET")

9de21d00722b4d2df572476058d8e177.png

计算变量重要性

#6.2 How to compute variable importance?
varimp_mars <- varImp(model_NNET)
plot(varimp_mars, main="Variable Importance with BreastCancer")

9906de7f3002949486809fe513ceab4c.png

计算混淆矩阵

对于分类模型的只需要看混淆矩阵比较清晰的看出来分类的正确性。

# 6.5. Confusion Matrix Compute the confusion matrix
predProb <- predict(model_NNET, testData, type = "prob")
head(predProb)
##             B          M
## 2  0.01380695 0.98619305
## 3  0.01445181 0.98554819
## 10 0.12982585 0.87017415
## 11 0.01691759 0.98308241
## 12 0.96866714 0.03133286
## 15 0.01384554 0.98615446
predicted = predict(model_NNET, testData)
testData$predProb = predProb$B
testData$diagnosis = as.factor(testData$diagnosis)
confusionMatrix(reference = testData$diagnosis, data = predicted, mode = "everything",
    positive = "B")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 104   4
##          M   3  59
##                                          
##                Accuracy : 0.9588         
##                  95% CI : (0.917, 0.9833)
##     No Information Rate : 0.6294         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9114         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9720         
##             Specificity : 0.9365         
##          Pos Pred Value : 0.9630         
##          Neg Pred Value : 0.9516         
##               Precision : 0.9630         
##                  Recall : 0.9720         
##                      F1 : 0.9674         
##              Prevalence : 0.6294         
##          Detection Rate : 0.6118         
##    Detection Prevalence : 0.6353         
##       Balanced Accuracy : 0.9542         
##                                          
##        'Positive' Class : B              
##

绘制ROC曲线

但是根据模型构建后需要进行准确性的评估我们就需要计算一下AUC,绘制ROC曲线来展示一下准确性。

library(ROCR)
pred = prediction(testData$predProb, testData$diagnosis)
perf = performance(pred, measure = "fpr", x.measure = "tpr")
plot(perf, lwd = 2, col = "blue", main = "ROC")
abline(a = 0, b = 1, col = 2, lwd = 1, lty = 2)

d813c0e0f53cf761ebbcb52e1b0212d5.png

多个分类器比较

# Train the model using rf
model_rf = train(diagnosis ~ ., data = trainData, method = "rf", tuneLength = 2,
    trControl = fitControl)
model_rf
## Random Forest 
## 
## 398 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 319, 318, 319, 318, 318 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens   Spec     
##    2    0.9869126  0.964  0.9195402
##   20    0.9793701  0.948  0.9193103
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Train the model using adaboost
model_adaboost = train(diagnosis ~ ., data = trainData, method = "adaboost", tuneLength = 2,
    trControl = fitControl)
model_adaboost
## AdaBoost Classification Trees 
## 
## 398 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 318, 319, 318, 319, 318 
## Resampling results across tuning parameters:
## 
##   nIter  method         ROC        Sens   Spec     
##    50    Adaboost.M1    0.9921655  0.972  0.9321839
##    50    Real adaboost  0.8474414  0.984  0.9319540
##   100    Adaboost.M1    0.9917563  0.976  0.9183908
##   100    Real adaboost  0.8507287  0.984  0.9250575
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nIter = 50 and method = Adaboost.M1.
# Train the model using Logitboost
model_LogitBoost = train(diagnosis ~ ., data = trainData, method = "LogitBoost",
    tuneLength = 2, trControl = fitControl)
model_LogitBoost
## Boosted Logistic Regression 
## 
## 398 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 319, 318, 319, 318, 318 
## Resampling results across tuning parameters:
## 
##   nIter  ROC        Sens   Spec     
##   11     0.9881264  0.972  0.9388506
##   21     0.9878874  0.960  0.9457471
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was nIter = 11.

# Train the model using GBM
model_GBM = train(diagnosis ~ ., data = trainData, method = "gbm", tuneLength = 2,
    trControl = fitControl)
model_GBM
## Stochastic Gradient Boosting 
## 
## 398 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 318, 318, 319, 319, 318 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens   Spec     
##   1                   50      0.9857057  0.968  0.9186207
##   1                  100      0.9870989  0.968  0.9186207
##   2                   50      0.9885103  0.960  0.9183908
##   2                  100      0.9900184  0.976  0.9252874
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.

# Train the model using PAM
model_PAM = train(diagnosis ~ ., data = trainData, method = "pam", tuneLength = 2,
    trControl = fitControl)
## 123456789101112131415161718192021222324252627282930111111
model_PAM
## Nearest Shrunken Centroids 
## 
## 398 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 318, 319, 318, 318, 319 
## Resampling results across tuning parameters:
## 
##   threshold   ROC        Sens  Spec     
##    0.3730189  0.9581609  0.94  0.8045977
##   10.4445296  0.5000000  1.00  0.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was threshold = 0.3730189.

# Train the model using NB
model_NB = train(diagnosis ~ ., data = trainData, method = "naive_bayes", tuneLength = 2,
    trControl = fitControl)
model_NB
## Naive Bayes 
## 
## 398 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 318, 318, 319, 319, 318 
## Resampling results across tuning parameters:
## 
##   usekernel  ROC        Sens   Spec     
##   FALSE      0.9717885  0.936  0.9188506
##    TRUE      0.9625701  0.912  0.9124138
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
##  and adjust = 1.

models_compare <- resamples(list(ADABOOST = model_adaboost, RF = model_rf, LOGITBOOST = model_LogitBoost,
    GBM = model_GBM, PAM = model_PAM, NB = model_NB, NNET = model_NNET))
summary(models_compare)
## 
## Call:
## summary.resamples(object = models_compare)
## 
## Models: ADABOOST, RF, LOGITBOOST, GBM, PAM, NB, NNET 
## Number of resamples: 5 
## 
## ROC 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## ADABOOST   0.9880000 0.9917241 0.9920000 0.9921655 0.9931034 0.9960000    0
## RF         0.9623333 0.9889655 0.9920000 0.9869126 0.9933333 0.9979310    0
## LOGITBOOST 0.9717241 0.9806667 0.9913333 0.9881264 0.9972414 0.9996667    0
## GBM        0.9733333 0.9868966 0.9940000 0.9900184 0.9958621 1.0000000    0
## PAM        0.9213793 0.9433333 0.9700000 0.9581609 0.9733333 0.9827586    0
## NB         0.9413333 0.9703448 0.9779310 0.9717885 0.9833333 0.9860000    0
## NNET       0.9760000 0.9866667 0.9951724 0.9911540 0.9979310 1.0000000    0
## 
## Sens 
##            Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
## ADABOOST   0.92    0.98   0.98 0.972    0.98 1.00    0
## RF         0.94    0.94   0.98 0.964    0.98 0.98    0
## LOGITBOOST 0.92    0.98   0.98 0.972    0.98 1.00    0
## GBM        0.94    0.96   0.98 0.976    1.00 1.00    0
## PAM        0.88    0.94   0.96 0.940    0.96 0.96    0
## NB         0.92    0.92   0.92 0.936    0.94 0.98    0
## NNET       0.96    0.96   0.98 0.980    1.00 1.00    0
## 
## Spec 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## ADABOOST   0.8965517 0.9000000 0.9310345 0.9321839 0.9333333 1.0000000    0
## RF         0.8333333 0.9000000 0.9333333 0.9195402 0.9655172 0.9655172    0
## LOGITBOOST 0.8275862 0.9000000 0.9666667 0.9388506 1.0000000 1.0000000    0
## GBM        0.8620690 0.8666667 0.9310345 0.9252874 0.9666667 1.0000000    0
## PAM        0.7000000 0.7586207 0.8000000 0.8045977 0.8333333 0.9310345    0
## NB         0.8333333 0.8965517 0.9310345 0.9188506 0.9333333 1.0000000    0
## NNET       0.8666667 0.9000000 0.9310345 0.9395402 1.0000000 1.0000000    0
# Draw box plots to compare models
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(models_compare, scales = scales)

8910384d50a8432dfa60779f4460b9b1.png

生成测试集结果

绘制 Calibration Curves

## Generate the test set results
results <- data.frame(Diagnosis = testData$diagnosis)
results$RF <- predict(model_rf, testData, type = "prob")[, "B"]
results$adaboost <- predict(model_adaboost, testData, type = "prob")[, "B"]
results$LogitBoost <- predict(model_LogitBoost, testData, type = "prob")[, "B"]
results$GBM <- predict(model_GBM, testData, type = "prob")[, "B"]
results$PAM <- predict(model_PAM, testData, type = "prob")[, "B"]
results$NB <- predict(model_NB, testData, type = "prob")[, "B"]
results$NNET <- predict(model_NNET, testData, type = "prob")[, "B"]
head(results)
##   Diagnosis    RF  adaboost   LogitBoost         GBM          PAM           NB
## 1         M 0.010 0.0000000 0.0066928509 0.003343993 1.538730e-02 2.842085e-20
## 2         M 0.208 0.2198671 0.0474258732 0.066966943 6.673408e-05 1.290245e-56
## 3         M 0.760 0.7028569 0.7310585786 0.531578623 9.392887e-01 9.999659e-01
## 4         M 0.032 0.0517809 0.0009110512 0.003528009 1.155797e-01 1.031544e-09
## 5         M 0.194 0.2142548 0.0066928509 0.008309093 4.748940e-03 7.900317e-23
## 6         M 0.048 0.1512978 0.0474258732 0.012181739 9.553552e-03 2.765838e-19
##         NNET
## 1 0.01380695
## 2 0.01445181
## 3 0.12982585
## 4 0.01691759
## 5 0.96866714
## 6 0.01384554
trellis.par.set(caretTheme())
cal_obj <- calibration(Diagnosis ~ RF + adaboost + LogitBoost + GBM + PAM + NB +
    NNET, data = results, cuts = 13)
plot(cal_obj, type = "l", auto.key = list(columns = 5, lines = TRUE, points = FALSE))

4b5803149d9772f25fb0b96337d70872.png

还有一种ggplot方法可以显示子集内部比例的置信区间:

ggplot(cal_obj)

72dd277a42177a6871e9cf0a274df872.png

号外号外,桓峰基因单细胞生信分析免费培训课程即将开始快来报名吧!

桓峰基因,铸造成功的您!

未来桓峰基因公众号将不间断的推出单细胞系列生信分析教程,

敬请期待!!

桓峰基因官网正式上线,请大家多多关注,还有很多不足之处,大家多多指正!http://www.kyohogene.com/

桓峰基因和投必得合作,文章润色优惠85折,需要文章润色的老师可以直接到网站输入领取桓峰基因专属优惠券码:KYOHOGENE,然后上传,付款时选择桓峰基因优惠券即可享受85折优惠哦!https://www.topeditsci.com/

4f4ef449e5a865e64a72c67d71b77020.png

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值