ML 44. 机器学习之梯度提升回归树用于生存数据 (BlackBoost)-CSDN博客

本文链接：https://blog.csdn.net/weixin_41368414/article/details/140056064

简介

梯度提升回归树GBRT（Gradient Boosting Regression Tree）是利用树模型进行回归的算法模型。梯度提升采用连续的方式构造树，每棵树都试图纠正前一棵树的错误。默认情况下，梯度提升回归树中没有随机化，而是用到了强预剪枝。梯度提升树通常使用深度很小的数，这样模型占用内存更少，预测速度也更快。

软件包安装

if(!require(mboost))
  install.packages("mboost")

数据读取

数据说明：

futime: survival or censoring time
fustat: censoring status
age: in years
resid.ds: residual disease present (1=no,2=yes)
rx: treatment group
ecog.ps: ECOG performance status (1 is better, see reference)

data(cancer, package="survival")
head(ovarian)

##   futime fustat     age resid.ds rx ecog.ps
## 1     59      1 72.3315        2  1       1
## 2    115      1 74.4932        2  1       1
## 3    156      1 66.4658        2  1       2
## 4    421      0 53.3644        2  2       1
## 5    431      1 50.3397        2  1       1
## 6    448      0 56.4301        1  1       2

实例操作

参数说明：

blackboost(formula, data = list(),
           weights = NULL, na.action = na.pass,
           offset = NULL, family = Gaussian(), 
           control = boost_control(),
           oobweights = NULL,
           tree_controls = partykit::ctree_control(
               teststat = "quad",
               testtype = "Teststatistic",
               mincriterion = 0,
               minsplit = 10, 
               minbucket = 4,
               maxdepth = 2, 
               saveinfo = FALSE),
           ...)
Arguments
formula	
a symbolic description of the model to be fit.

data	
a data frame containing the variables in the model.

weights	
an optional vector of weights to be used in the fitting process.

na.action	
a function which indicates what should happen when the data contain NAs.

offset	
a numeric vector to be used as offset (optional).

family	
a Family object.

control	
a list of parameters controlling the algorithm. For more details see boost_control.

oobweights	
an additional vector of out-of-bag weights, which is used for the out-of-bag risk (i.e., if boost_control(risk = "oobag")). This argument is also used internally by cvrisk.

tree_controls	
an object of class "TreeControl", which can be obtained using ctree_control. Defines hyper-parameters for the trees which are used as base-learners. It is wise to make sure to understand the consequences of altering any of its arguments. By default, two-way interactions (but not deeper trees) are fitted.

...	
additional arguments passed to mboost_fit, including weights, offset, family and control. For default values see mboost_fit.

构建模型

根据说明ovarian数据集，构建模型:

library(survivalmodels)
library(survival)
library(mboost)
set.seed(123)
fm <- Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps
fit <- blackboost(fm, data = ovarian, family = CoxPH(), control = boost_control(mstop = 500))
summary(fit)
## 
## 	 Model-based Boosting
## 
## Call:
## blackboost(formula = fm, data = ovarian, family = CoxPH(), control = boost_control(mstop = 500))
## 
## 
## 	 Cox Partial Likelihood 
## 
## Loss function:  
## 
## Number of boosting iterations: mstop = 500 
## Step size:  0.1 
## Offset:  0 
## Number of baselearners:  1 
## 
## Selection frequencies:
## btree(mf, tree_controls = tree_controls) 
##                                        1

交叉验证

set.seed(123)
cv <- cv(model.weights(fit), type = "kfold")
cvm <- cvrisk(fit, folds = cv, papply = lapply)
mstop(cvm)

## [1] 18

plot(cvm)

最优模型

#########最优模型
OptFit <- blackboost(fm, data = ovarian, family = CoxPH(),
                   control=boost_control(mstop = 55))

summary(OptFit)

## 
## 	 Model-based Boosting
## 
## Call:
## blackboost(formula = fm, data = ovarian, family = CoxPH(), control = boost_control(mstop = 55))
## 
## 
## 	 Cox Partial Likelihood 
## 
## Loss function:  
## 
## Number of boosting iterations: mstop = 55 
## Step size:  0.1 
## Offset:  0 
## Number of baselearners:  1 
## 
## Selection frequencies:
## btree(mf, tree_controls = tree_controls) 
##                                        1

预测

pred <- predict(OptFit, newdata = ovarian, type = "link", lambda = 1)

一致性

library(Hmisc)
rcorr.cens(pred, Surv(ovarian$futime, ovarian$fustat))
##        C Index            Dxy           S.D.              n        missing 
##      0.1605505     -0.6788991      0.1422779     26.0000000      0.0000000 
##     uncensored Relevant Pairs     Concordant      Uncertain 
##     12.0000000    436.0000000     70.0000000    214.0000000

1 - 0.142
## [1] 0.858

生存分析

直接做生存分析即可：

S1 <- survFit(OptFit)
S1
## $surv
##            [,1]
##  [1,] 0.9885428
##  [2,] 0.9743340
##  [3,] 0.9555477
##  [4,] 0.9275716
##  [5,] 0.8711831
##  [6,] 0.8169079
##  [7,] 0.7611885
##  [8,] 0.6886802
##  [9,] 0.6147990
## [10,] 0.5434993
## [11,] 0.4556460
## [12,] 0.3744336
## 
## $time
##   1   2   3  22  23  24  25   5   7   8  10  11 
##  59 115 156 268 329 353 365 431 464 475 563 638 
## 
## $n.event
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1
## 
## attr(,"class")
## [1] "survFit"

newdata <- ovarian[c(1, 3, 12), ]
S2 <- survFit(fit, newdata = newdata)
S2
## $surv
##                   1            3        12
##  [1,]  6.490381e-01 9.427433e-01 0.9999928
##  [2,]  3.031142e-01 8.497489e-01 0.9999800
##  [3,]  1.246987e-02 5.498874e-01 0.9999267
##  [4,]  4.387625e-05 2.544462e-01 0.9998322
##  [5,]  8.815378e-16 8.841313e-03 0.9994203
##  [6,]  5.603102e-27 2.625797e-04 0.9989893
##  [7,]  1.388944e-40 3.659526e-06 0.9984659
##  [8,]  3.909110e-62 4.206316e-09 0.9976375
##  [9,]  3.182394e-95 1.289695e-13 0.9963668
## [10,] 3.625818e-139 1.307896e-19 0.9946817
## [11,] 1.166763e-218 1.878159e-30 0.9916408
## [12,]  0.000000e+00 1.834657e-46 0.9871676
## 
## $time
##   1   2   3  22  23  24  25   5   7   8  10  11 
##  59 115 156 268 329 353 365 431 464 475 563 638 
## 
## $n.event
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1
## 
## attr(,"class")
## [1] "survFit"

par(mfrow = c(1, 2), mai = par("mai") * c(1, 1, 1, 2.5))
plot(S1, col = "red")
plot(S2)

绘制ROC曲线

library(pROC)
roc <- roc(ovarian$fustat, pred, legacy.axes = T, print.auc = T, print.auc.y = 45)
roc$auc
## Area under the curve: 0.878

plot(roc, legacy.axes = T, col = "red", lwd = 2, main = "ROC curv")
text(0.2, 0.2, paste("AUC: ", round(roc$auc, 3)))

Reference

Peter Buehlmann and Torsten Hothorn (2007), Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.
Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Yoav Freund and Robert E. Schapire (1996), Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, 148–156.
Jerome H. Friedman (2001), Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232.
Greg Ridgeway (1999), The state of boosting. Computing Science and Statistics, 31, 172–181.