ML 44. 机器学习之梯度提升回归树用于生存数据 (BlackBoost)

aafe4fd95b0656511d35d22ebbc81ce3.png



简   介

梯度提升回归树GBRT(Gradient Boosting Regression Tree)是利用树模型进行回归的算法模型。梯度提升采用连续的方式构造树,每棵树都试图纠正前一棵树的错误。默认情况下,梯度提升回归树中没有随机化,而是用到了强预剪枝。梯度提升树通常使用深度很小的数,这样模型占用内存更少,预测速度也更快。

7b21b6569c921b2053ed3869c92c959c.png

软件包安装

if(!require(mboost))
  install.packages("mboost")

数据读取

数据说明:

  • futime: survival or censoring time

  • fustat: censoring status

  • age: in years

  • resid.ds: residual disease present (1=no,2=yes)

  • rx: treatment group

  • ecog.ps: ECOG performance status (1 is better, see reference)

data(cancer, package="survival")
head(ovarian)
##   futime fustat     age resid.ds rx ecog.ps
## 1     59      1 72.3315        2  1       1
## 2    115      1 74.4932        2  1       1
## 3    156      1 66.4658        2  1       2
## 4    421      0 53.3644        2  2       1
## 5    431      1 50.3397        2  1       1
## 6    448      0 56.4301        1  1       2

实例操作

参数说明:

blackboost(formula, data = list(),
           weights = NULL, na.action = na.pass,
           offset = NULL, family = Gaussian(), 
           control = boost_control(),
           oobweights = NULL,
           tree_controls = partykit::ctree_control(
               teststat = "quad",
               testtype = "Teststatistic",
               mincriterion = 0,
               minsplit = 10, 
               minbucket = 4,
               maxdepth = 2, 
               saveinfo = FALSE),
           ...)
Arguments
formula	
a symbolic description of the model to be fit.

data	
a data frame containing the variables in the model.

weights	
an optional vector of weights to be used in the fitting process.

na.action	
a function which indicates what should happen when the data contain NAs.

offset	
a numeric vector to be used as offset (optional).

family	
a Family object.

control	
a list of parameters controlling the algorithm. For more details see boost_control.

oobweights	
an additional vector of out-of-bag weights, which is used for the out-of-bag risk (i.e., if boost_control(risk = "oobag")). This argument is also used internally by cvrisk.

tree_controls	
an object of class "TreeControl", which can be obtained using ctree_control. Defines hyper-parameters for the trees which are used as base-learners. It is wise to make sure to understand the consequences of altering any of its arguments. By default, two-way interactions (but not deeper trees) are fitted.

...	
additional arguments passed to mboost_fit, including weights, offset, family and control. For default values see mboost_fit.

构建模型

根据说明ovarian数据集,构建模型:

library(survivalmodels)
library(survival)
library(mboost)
set.seed(123)
fm <- Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps
fit <- blackboost(fm, data = ovarian, family = CoxPH(), control = boost_control(mstop = 500))
summary(fit)
## 
## 	 Model-based Boosting
## 
## Call:
## blackboost(formula = fm, data = ovarian, family = CoxPH(), control = boost_control(mstop = 500))
## 
## 
## 	 Cox Partial Likelihood 
## 
## Loss function:  
## 
## Number of boosting iterations: mstop = 500 
## Step size:  0.1 
## Offset:  0 
## Number of baselearners:  1 
## 
## Selection frequencies:
## btree(mf, tree_controls = tree_controls) 
##                                        1

交叉验证

set.seed(123)
cv <- cv(model.weights(fit), type = "kfold")
cvm <- cvrisk(fit, folds = cv, papply = lapply)
mstop(cvm)
## [1] 18
plot(cvm)

2236ab3b054145ba1c0aed78d81f73c4.png

最优模型

#########最优模型
OptFit <- blackboost(fm, data = ovarian, family = CoxPH(),
                   control=boost_control(mstop = 55))

summary(OptFit)
## 
## 	 Model-based Boosting
## 
## Call:
## blackboost(formula = fm, data = ovarian, family = CoxPH(), control = boost_control(mstop = 55))
## 
## 
## 	 Cox Partial Likelihood 
## 
## Loss function:  
## 
## Number of boosting iterations: mstop = 55 
## Step size:  0.1 
## Offset:  0 
## Number of baselearners:  1 
## 
## Selection frequencies:
## btree(mf, tree_controls = tree_controls) 
##                                        1

预测

pred <- predict(OptFit, newdata = ovarian, type = "link", lambda = 1)

一致性

library(Hmisc)
rcorr.cens(pred, Surv(ovarian$futime, ovarian$fustat))
##        C Index            Dxy           S.D.              n        missing 
##      0.1605505     -0.6788991      0.1422779     26.0000000      0.0000000 
##     uncensored Relevant Pairs     Concordant      Uncertain 
##     12.0000000    436.0000000     70.0000000    214.0000000
1 - 0.142
## [1] 0.858

生存分析

直接做生存分析即可:

S1 <- survFit(OptFit)
S1
## $surv
##            [,1]
##  [1,] 0.9885428
##  [2,] 0.9743340
##  [3,] 0.9555477
##  [4,] 0.9275716
##  [5,] 0.8711831
##  [6,] 0.8169079
##  [7,] 0.7611885
##  [8,] 0.6886802
##  [9,] 0.6147990
## [10,] 0.5434993
## [11,] 0.4556460
## [12,] 0.3744336
## 
## $time
##   1   2   3  22  23  24  25   5   7   8  10  11 
##  59 115 156 268 329 353 365 431 464 475 563 638 
## 
## $n.event
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1
## 
## attr(,"class")
## [1] "survFit"
newdata <- ovarian[c(1, 3, 12), ]
S2 <- survFit(fit, newdata = newdata)
S2
## $surv
##                   1            3        12
##  [1,]  6.490381e-01 9.427433e-01 0.9999928
##  [2,]  3.031142e-01 8.497489e-01 0.9999800
##  [3,]  1.246987e-02 5.498874e-01 0.9999267
##  [4,]  4.387625e-05 2.544462e-01 0.9998322
##  [5,]  8.815378e-16 8.841313e-03 0.9994203
##  [6,]  5.603102e-27 2.625797e-04 0.9989893
##  [7,]  1.388944e-40 3.659526e-06 0.9984659
##  [8,]  3.909110e-62 4.206316e-09 0.9976375
##  [9,]  3.182394e-95 1.289695e-13 0.9963668
## [10,] 3.625818e-139 1.307896e-19 0.9946817
## [11,] 1.166763e-218 1.878159e-30 0.9916408
## [12,]  0.000000e+00 1.834657e-46 0.9871676
## 
## $time
##   1   2   3  22  23  24  25   5   7   8  10  11 
##  59 115 156 268 329 353 365 431 464 475 563 638 
## 
## $n.event
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1
## 
## attr(,"class")
## [1] "survFit"
par(mfrow = c(1, 2), mai = par("mai") * c(1, 1, 1, 2.5))
plot(S1, col = "red")
plot(S2)

f774ee2b3b3c4337aa51b0e2fe56d6e4.png

绘制ROC曲线

library(pROC)
roc <- roc(ovarian$fustat, pred, legacy.axes = T, print.auc = T, print.auc.y = 45)
roc$auc
## Area under the curve: 0.878
plot(roc, legacy.axes = T, col = "red", lwd = 2, main = "ROC curv")
text(0.2, 0.2, paste("AUC: ", round(roc$auc, 3)))

af2773b506e68d49286fe038cff22a5c.png

Reference
  1. Peter Buehlmann and Torsten Hothorn (2007), Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.

  2. Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.

  3. Yoav Freund and Robert E. Schapire (1996), Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, 148–156.

  4. Jerome H. Friedman (2001), Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232.

  5. Greg Ridgeway (1999), The state of boosting. Computing Science and Statistics, 31, 172–181.


基于机器学习构建临床预测模型

MachineLearning 1. 主成分分析(PCA)

MachineLearning 2. 因子分析(Factor Analysis)

MachineLearning 3. 聚类分析(Cluster Analysis)

MachineLearning 4. 癌症诊断方法之 K-邻近算法(KNN)

MachineLearning 5. 癌症诊断和分子分型方法之支持向量机(SVM)

MachineLearning 6. 癌症诊断机器学习之分类树(Classification Trees)

MachineLearning 7. 癌症诊断机器学习之回归树(Regression Trees)

MachineLearning 8. 癌症诊断机器学习之随机森林(Random Forest)

MachineLearning 9. 癌症诊断机器学习之梯度提升算法(Gradient Boosting)

MachineLearning 10. 癌症诊断机器学习之神经网络(Neural network)

MachineLearning 11. 机器学习之随机森林生存分析(randomForestSRC)

MachineLearning 12. 机器学习之降维方法t-SNE及可视化(Rtsne)

MachineLearning 13. 机器学习之降维方法UMAP及可视化 (umap)

MachineLearning 14. 机器学习之集成分类器(AdaBoost)

MachineLearning 15. 机器学习之集成分类器(LogitBoost)

MachineLearning 16. 机器学习之梯度提升机(GBM)

MachineLearning 17. 机器学习之围绕中心点划分算法(PAM)

MachineLearning 18. 机器学习之贝叶斯分析类器(Naive Bayes)

MachineLearning 19. 机器学习之神经网络分类器(NNET)

MachineLearning 20. 机器学习之袋装分类回归树(Bagged CART)

MachineLearning 21. 机器学习之临床医学上的生存分析(xgboost)

MachineLearning 22. 机器学习之有监督主成分分析筛选基因(SuperPC)

MachineLearning 23. 机器学习之岭回归预测基因型和表型(Ridge)

MachineLearning 24. 机器学习之似然增强Cox比例风险模型筛选变量及预后估计(CoxBoost)

MachineLearning 25. 机器学习之支持向量机应用于生存分析(survivalsvm)

MachineLearning 26. 机器学习之弹性网络算法应用于生存分析(Enet)

MachineLearning 27. 机器学习之逐步Cox回归筛选变量(StepCox)

MachineLearning 28. 机器学习之偏最小二乘回归应用于生存分析(plsRcox)

MachineLearning 29. 机器学习之嵌套交叉验证(Nested CV)

MachineLearning 30. 机器学习之特征选择森林之神(Boruta)

MachineLearning 31. 机器学习之基于RNA-seq的基因特征筛选 (GeneSelectR)

MachineLearning 32. 机器学习之支持向量机递归特征消除的特征筛选 (mSVM-RFE)

MachineLearning 33. 机器学习之时间-事件预测与神经网络和Cox回归

MachineLearning 34. 机器学习之竞争风险生存分析的深度学习方法(DeepHit)

MachineLearning 35. 机器学习之Lasso+Cox回归筛选变量 (LassoCox)

MachineLearning 36. 机器学习之基于神经网络的Cox比例风险模型 (Deepsurv)

MachineLearning 37. 机器学习之倾斜随机生存森林 (obliqueRSF)

MachineLearning 38. 机器学习之基于最近收缩质心分类法的肿瘤亚型分类器 (pamr)

MachineLearning 39. 机器学习之基于条件随机森林的生存分析临床预测 (CForest)

MachineLearning 40. 机器学习之基于条件推理树的生存分析临床预测 (CTree)

MachineLearning 41. 机器学习之参数生存回归模型 (survreg)

MachineLearning 42. 机器学习之Akritas条件非参数生存估计 (akritas)

桓峰基因,铸造成功的您!

未来桓峰基因公众号将不间断的推出机器学习系列生信分析教程,

敬请期待!!

桓峰基因官网正式上线,请大家多多关注,还有很多不足之处,大家多多指正!http://www.kyohogene.com/

桓峰基因和投必得合作,文章润色优惠85折,需要文章润色的老师可以直接到网站输入领取桓峰基因专属优惠券码:KYOHOGENE,然后上传,付款时选择桓峰基因优惠券即可享受85折优惠哦!https://www.topeditsci.com/

750f651a27b43f287a20571ca63c4c60.png

  • 25
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值