MachineLearning 39. 机器学习之基于条件随机森林的生存分析临床预测 (CForest)

f1d596348a9a5fbc3509ebe42a29cfe5.png


简介

条件随机森林(cforest) 是一个R中用于建立随机森林(Random Forest) 模型的函数。随机森林 是一种机器学习算法,通过集成多个决策树来进行预测和分类。创建一个大量决策树的模型,每个决策树都是相互独立的。最后的预测使用来自单个树的所有预测并将它们组合起来。

在本文中,我们将介绍 cforest 的用法,并提供一些示例代码。

1b368246c24612c41d42dda144464ddd.png

软件包安装

软件包安装方式:

if(!require("party"))
  install.packages("party")

数据读取

数据集来自 TH.data 包中的 GBSG2,这个数据包括686个女性的数据:

  • horTh hormonal therapy, a factor at two levels no and yes.

  • age of the patients in years.

  • menostat menopausal status, a factor at two levels pre (premenopausal) and post (postmenopausal).

  • tsize tumor size (in mm).

  • tgrade tumor grade, a ordered factor at levels I < II < III.

  • pnodes number of positive nodes.

  • progrec progesterone receptor (in fmol).

  • estrec estrogen receptor (in fmol).

  • time recurrence free survival time (in days).

  • cens censoring indicator (0- censored, 1- event).

require("TH.data")
require("survival")
data("GBSG2", package = "TH.data")
head(GBSG2)
##   horTh age menostat tsize tgrade pnodes progrec estrec time cens
## 1    no  70     Post    21     II      3      48     66 1814    1
## 2   yes  56     Post    12     II      7      61     77 2018    1
## 3   yes  58     Post    35     II      9      52    271  712    1
## 4   yes  59     Post    17     II      4      60     29 1807    1
## 5    no  73     Post    35     II      1      26     65  772    1
## 6    no  32      Pre    57    III     24       0     13  448    1

实例操作

构建模型

参数说明:

  • ormula a symbolic description of the model to be fit. Note that symbols like : and - will not work and the tree will make use of all variables listed on the rhs of formula.

  • data an data frame containing the variables in the model.

  • subset an optional vector specifying a subset of observations to be used in the fitting process.

  • weights an optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities weights / sum(weights). The fraction of observations to be sampled (without replacement) is computed based on the sum of the weights if all weights are integer-valued and based on the number of weights greater zero else. Alternatively, weights can be a double matrix defining case weights for all ncol(weights) trees in the forest directly. This requires more storage but gives the user more control.

  • controls an object of class ForestControl-class, which can be obtained using cforest_control (and its convenience interfaces cforest_unbiased and cforest_classical).

  • xtrafo a function to be applied to all input variables. By default, the ptrafo function is applied.

  • ytrafo a function to be applied to all response variables. By default, the ptrafo function is applied.

  • scores an optional named list of scores to be attached to ordered factors.

  • object an object as returned by cforest.

  • newdata an optional data frame containing test data.

library(party)
bst <- cforest(Surv(time, cens) ~ ., data = GBSG2, control = cforest_unbiased(ntree = 50))
bst
## 
## 	 Random Forest using Conditional Inference Trees
## 
## Number of trees:  50 
## 
## Response:  Surv(time, cens) 
## Inputs:  horTh, age, menostat, tsize, tgrade, pnodes, progrec, estrec 
## Number of observations:  686

变量重要性

针对变量分析可以使用 varimp()来计算,然后再利用barplot()绘图 :

### compare variable importances and absolute z-statistics
vi <- varimp(bst)
## 
## Variable importance for survival forests; this feature is _experimental_
layout(matrix(1:2))
barplot(vi, las = 2, main = "CForest Model")
barplot(abs(summary(coxph(Surv(time, cens) ~ ., data = GBSG2))$coeff[, "z"]), las = 2,
    main = "Coxph Model")

be8d12bc8013f797196b20bd4a1e8869.png

终端节点响应分布

### don't use the mean but the median as prediction in each terminal node
pmedian <- sapply(weights(bst), function(w) median(GBSG2$time[rep(1:nrow(GBSG2),
    w)]))
pmean <- sapply(weights(bst), function(w) weighted.mean(GBSG2$time, w))
layout(matrix(1:2))
plot(GBSG2$time, pmean, col = "red")
points(GBSG2$time, pmedian, col = "blue")
### distribution of responses in the terminal nodes
plot(GBSG2$time ~ as.factor(unlist(where(bst)[1])), ylab = "Time", xlab = "Responses",
    cex.axis = 0.5)

0456961f6b9b3d201ed476f733251619.png

验证

预测Predict()参数中type有2种选择:prob和response, 可以获得预测和相对风险:

# type='response'给出的是预测概率
GBSG2$resposnse <- Predict(bst, newdata = GBSG2, type = "response")
GBSG2 = GBSG2[GBSG2$resposnse != "Inf", ]  ### 去掉Inf

一致性

先使用Predict()函数中type="response"给出的是预测值,然后利用coxph()来计算一致性:

C_index <- data.frame(Cindex = as.numeric(summary(coxph(Surv(time, cens) ~ resposnse,
    GBSG2))$concordance[1]))
C_index
##     Cindex
## 1 0.725659

生存分析

根据treeresponse()可以估计患者的Kaplan-Meier curves :

### estimate conditional Kaplan-Meier curves
treeresponse(bst, newdata = GBSG2[1, ], OOB = TRUE)
## $`1`
## Call: survfit(formula = y ~ 1, weights = weights)
## 
##      records  n events median 0.95LCL 0.95UCL
## [1,]     211 50   28.3   1601    1218    1990

绘制ROC曲线

由于我们所作的模型根时间密切相关因此我们选择pROC,可以快速的技术出来不同时期的roc,进一步作图:

library(pROC)
roc <- roc(GBSG2$cens, GBSG2$resposnse, legacy.axes = T, print.auc = T, print.auc.y = 45)
roc$auc
## Area under the curve: 0.746
plot(roc, legacy.axes = T, col = "red", lwd = 2)
text(0.2, 0.2, paste("AUC: ", round(roc$auc, 2)))

5b3d3b7bdf79635105260feaa30d80b2.png

Reference
  1. Leo Breiman (2001). Random Forests. Machine Learning, 45(1), 5–32.

  2. Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77–91.

  3. Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro and Mark J. van der Laan (2006a). Survival Ensembles. Biostatistics, 7(3), 355–373.

  4. Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. Preprint available from

  5. Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007). Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25.

  6. Carolin Strobl, James Malley and Gerhard Tutz (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Psychological Methods, 14(4), 323–348.


基于机器学习构建临床预测模型

MachineLearning 1. 主成分分析(PCA)

MachineLearning 2. 因子分析(Factor Analysis)

MachineLearning 3. 聚类分析(Cluster Analysis)

MachineLearning 4. 癌症诊断方法之 K-邻近算法(KNN)

MachineLearning 5. 癌症诊断和分子分型方法之支持向量机(SVM)

MachineLearning 6. 癌症诊断机器学习之分类树(Classification Trees)

MachineLearning 7. 癌症诊断机器学习之回归树(Regression Trees)

MachineLearning 8. 癌症诊断机器学习之随机森林(Random Forest)

MachineLearning 9. 癌症诊断机器学习之梯度提升算法(Gradient Boosting)

MachineLearning 10. 癌症诊断机器学习之神经网络(Neural network)

MachineLearning 11. 机器学习之随机森林生存分析(randomForestSRC)

MachineLearning 12. 机器学习之降维方法t-SNE及可视化(Rtsne)

MachineLearning 13. 机器学习之降维方法UMAP及可视化 (umap)

MachineLearning 14. 机器学习之集成分类器(AdaBoost)

MachineLearning 15. 机器学习之集成分类器(LogitBoost)

MachineLearning 16. 机器学习之梯度提升机(GBM)

MachineLearning 17. 机器学习之围绕中心点划分算法(PAM)

MachineLearning 18. 机器学习之贝叶斯分析类器(Naive Bayes)

MachineLearning 19. 机器学习之神经网络分类器(NNET)

MachineLearning 20. 机器学习之袋装分类回归树(Bagged CART)

MachineLearning 21. 机器学习之临床医学上的生存分析(xgboost)

MachineLearning 22. 机器学习之有监督主成分分析筛选基因(SuperPC)

MachineLearning 23. 机器学习之岭回归预测基因型和表型(Ridge)

MachineLearning 24. 机器学习之似然增强Cox比例风险模型筛选变量及预后估计(CoxBoost)

MachineLearning 25. 机器学习之支持向量机应用于生存分析(survivalsvm)

MachineLearning 26. 机器学习之弹性网络算法应用于生存分析(Enet)

MachineLearning 27. 机器学习之逐步Cox回归筛选变量(StepCox)

MachineLearning 28. 机器学习之偏最小二乘回归应用于生存分析(plsRcox)

MachineLearning 29. 机器学习之嵌套交叉验证(Nested CV)

MachineLearning 30. 机器学习之特征选择森林之神(Boruta)

MachineLearning 31. 机器学习之基于RNA-seq的基因特征筛选 (GeneSelectR)

MachineLearning 32. 机器学习之支持向量机递归特征消除的特征筛选 (mSVM-RFE)

MachineLearning 33. 机器学习之时间-事件预测与神经网络和Cox回归

MachineLearning 34. 机器学习之竞争风险生存分析的深度学习方法(DeepHit)

MachineLearning 35. 机器学习之Lasso+Cox回归筛选变量 (LassoCox)

MachineLearning 36. 机器学习之基于神经网络的Cox比例风险模型 (Deepsurv)

MachineLearning 37. 机器学习之倾斜随机生存森林 (obliqueRSF)

MachineLearning 38. 机器学习之基于最近收缩质心分类法的肿瘤亚型分类器 (pamr)

桓峰基因,铸造成功的您!

未来桓峰基因公众号将不间断的推出机器学习系列生信分析教程,

敬请期待!!

桓峰基因官网正式上线,请大家多多关注,还有很多不足之处,大家多多指正!http://www.kyohogene.com/

桓峰基因和投必得合作,文章润色优惠85折,需要文章润色的老师可以直接到网站输入领取桓峰基因专属优惠券码:KYOHOGENE,然后上传,付款时选择桓峰基因优惠券即可享受85折优惠哦!https://www.topeditsci.com/

17f5c3f5691d8bed52adbb4f857fe782.png

  • 12
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值