MachineLearning 40. 机器学习之基于条件推理树的生存分析临床预测 (CTree)-CSDN博客

本文链接：https://blog.csdn.net/weixin_41368414/article/details/139667888

简介

条件推理树（Conditional inference trees）。条件推断树的算法如下：

(1) 对输出变量与每个预测变量间的关系计算p值。

(2) 选取p值最小的变量。

(3) 在因变量与被选中的变量间尝试所有可能的二元分割（通过排列检验），并选取最显著的分割。

(4) 将数据集分成两群，并对每个子群重复上述步骤。

(5) 重复直至所有分割都不显著或已到达最小节点为止。

条件推理树与决策树有什么不同

条件推断树与传统决策树类似，但变量和分割的选取是基于显著性检验的，而不是纯净度或同质性一类的度量。显著性检验是置换检验。条件推理树是一种基于树的分类算法。它与决策树类似，因为 ctree() 也像决策树一样对数据进行递归划分。使得条件推理树与决策树不同的唯一程序是，条件推理树使用显著性检验来选择输入变量，而不是选择使信息量最大化的变量。例如，在传统的决策树中，基尼系数被用来选择使信息度量最大化的变量。

软件包安装

软件包安装方式：

if(!require("partykitx"))
  install.packages("partykitx")

数据读取

数据集来自 TH.data 包中的 GBSG2,这个数据包括686个女性的数据：

horTh hormonal therapy, a factor at two levels no and yes.
age of the patients in years.
menostat menopausal status, a factor at two levels pre (premenopausal) and post (postmenopausal).
tsize tumor size (in mm).
tgrade tumor grade, a ordered factor at levels I < II < III.
pnodes number of positive nodes.
progrec progesterone receptor (in fmol).
estrec estrogen receptor (in fmol).
time recurrence free survival time (in days).
cens censoring indicator (0- censored, 1- event).

require("TH.data")
require("survival")
data("GBSG2", package = "TH.data")
head(GBSG2)
##   horTh age menostat tsize tgrade pnodes progrec estrec time cens
## 1    no  70     Post    21     II      3      48     66 1814    1
## 2   yes  56     Post    12     II      7      61     77 2018    1
## 3   yes  58     Post    35     II      9      52    271  712    1
## 4   yes  59     Post    17     II      4      60     29 1807    1
## 5    no  73     Post    35     II      1      26     65  772    1
## 6    no  32      Pre    57    III     24       0     13  448    1

实例操作

构建模型

参数说明：

formula a symbolic description of the model to be fit.
data a data frame containing the variables in the model.
subset an optional vector specifying a subset of observations to be used in the fitting process.
weights an optional vector of weights to be used in the fitting process. Only non-negative integer valued weights are allowed.
offset an optional vector of offset values.
cluster an optional factor indicating independent clusters. Highly experimental, use at your own risk.
na.action a function which indicates what should happen when the data contain missing value.

control a list with control parameters, see ctree_control.

ytrafo an optional named list of functions to be applied to the response variable(s) before testing their association with the explanatory variables. Note that this transformation is only performed once for the root node and does not take weights into account. Alternatively, ytrafo can be a function of data and weights. In this case, the transformation is computed for every node with corresponding weights. This feature is experimental and the user interface likely to change.
converged an optional function for checking user-defined criteria before splits are implemented. This is not to be used and very likely to change.
scores an optional named list of scores to be attached to ordered factors.
doFit a logical, if FALSE, the tree is not fitted.

library(partykitx)
bst <- ctree(Surv(time, cens) ~ ., data = GBSG2)
bst
## 
## Model formula:
## Surv(time, cens) ~ horTh + age + menostat + tsize + tgrade + 
##     pnodes + progrec + estrec
## 
## Fitted party:
## [1] root
## |   [2] pnodes <= 3
## |   |   [3] horTh in no: 2093.000 (n = 248)
## |   |   [4] horTh in yes: Inf (n = 128)
## |   [5] pnodes > 3
## |   |   [6] progrec <= 20: 624.000 (n = 144)
## |   |   [7] progrec > 20: 1701.000 (n = 166)
## 
## Number of inner nodes:    3
## Number of terminal nodes: 4

plot(bst)

说明：执行后，上面的代码会生成一个条件推理树图，该图以箱形图的形式显示不同环境条件下每个节点中的值。如上图所示，节点 6 显示了最小值。此外，学习行为表明 6，7 显示生存周期最短。

验证

预测predict()参数中type有2种选择：prob和response, 可以获得预测和相对风险：

# type='response'给出的是预测概率
GBSG2$resposnse <- predict(bst, newdata = GBSG2, type = "response")
GBSG2 = GBSG2[GBSG2$resposnse != "Inf", ]  ### 去掉Inf

一致性

先使用Predict()函数中type="response"给出的是预测值，然后利用coxph()来计算一致性：

C_index <- data.frame(Cindex = as.numeric(summary(coxph(Surv(time, cens) ~ resposnse,
    GBSG2))$concordance[1]))
C_index
##     Cindex
## 1 0.645079

绘制ROC曲线

由于我们所作的模型根时间密切相关因此我们选择pROC,可以快速的技术出来不同时期的roc，进一步作图：

library(pROC)
roc <- roc(GBSG2$cens, GBSG2$resposnse, legacy.axes = T, print.auc = T, print.auc.y = 45)
roc$auc
## Area under the curve: 0.6503

plot(roc, legacy.axes = T, col = "red", lwd = 2)
text(0.2, 0.2, paste("AUC: ", round(roc$auc, 2)))

Reference

Hothorn T, Hornik K, Van de Wiel MA, Zeileis A (2006). A Lego System for Conditional Inference. The American Statistician, 60(3), 257--263.
Hothorn T, Hornik K, Zeileis A (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651--674.
Hothorn T, Zeileis A (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905--3909.
Strasser H, Weber C (1999). On the Asymptotic Theory of Permutation Statistics. Mathematical Methods of Statistics, 8, 220--250.