MachineLearning 28. 机器学习之偏最小二乘回归应用于生存分析 (plsRcox)

最新推荐文章于 2024-04-08 12:06:15 发布

桓峰基因

最新推荐文章于 2024-04-08 12:06:15 发布

阅读量1.1k

点赞数 26

文章标签：机器学习回归人工智能数据挖掘算法

本文链接：https://blog.csdn.net/weixin_41368414/article/details/136753937

版权

简介

偏最小二乘回归（Partial Least Squares Regression，PLS Regression）是一种常用的统计建模方法，用于解决多元线性回归中自变量间高度相关的问题。在偏最小二乘回归中，通过将原始自变量转换为一组新的综合变量（称为主成分或潜在变量），然后再使用这些主成分进行回归分析，从而减少自变量之间的共线性，并且提高模型的稳定性和预测能力。偏小二乘回归提供一种多对多线性回归建模的方法，特别当两组变量的个数很多，且都存在多重相关性，而观测数据的数量（样本量）又较少时，用偏小二乘回归建立的模型具有传统的经典回归分析等方法所没有的优点。偏小二乘回归分析在建模过程中集中了主成分分析，典型相关分析和线性回归分析方法的特点，因此在分析结果中，除了可以提供一个更为合理的回归模型外，还可以同时完成一些类似于主成分分析和典型相关分析的研究内容，提供更丰富、深入的一些信息。Cox模型的偏最小二乘回归及相关技术(plsRcox: Partial Least Squares Regression for Cox Models and Related Techniques)是将偏最小二乘回归广义线性模型应用于Cox回归中，解决临床上的问题。

软件包安装

if(require(plsRcox))
  devtools::install_github("fbertran/plsRcox")

数据读取

lung <- read.table("TCGA_LUSC.m6A_gene.exp_clin.txt", header = T)
table(lung$status)
## 
##   1   2 
## 343 158
lung$status <- ifelse(lung$status == 2, 1, 0)
lung <- na.omit(lung)  # 去掉NA
head(lung)
##   Tumor_Sample_Barcode gender vital_status status time    YTHDC2   IGF2BP2
## 1         TCGA-18-3406   MALE         Dead      1  371 10.490851  8.870365
## 2         TCGA-18-3407   MALE         Dead      1  136 11.032735  8.603626
## 3         TCGA-18-3408 FEMALE        Alive      0 2099  9.796040 11.279611
## 4         TCGA-18-3409   MALE        Alive      0 2417  9.856426 12.023061
## 5         TCGA-18-3410   MALE         Dead      1  146 11.074810  6.584963
## 6         TCGA-18-3411 FEMALE        Alive      0 1415  9.881114 12.384514
##     YTHDC1   ALKBH5   HNRNPC     FMR1 HNRNPA2B1   ZC3H13   IGF2BP3      FTO
## 1 10.69870 10.68738 14.38485 10.59059  13.69675 10.23122  6.523562 10.58684
## 2 12.28511 11.84823 14.44876 11.49935  15.05431 11.94361  9.579316 11.26268
## 3 11.57128 12.10165 14.67325 11.72877  14.47168 11.24911 10.005625 11.33204
## 4 11.92258 12.32839 14.16034 11.19168  14.44320 11.83368  4.954196 11.58825
## 5 11.98762 12.39714 14.92435 11.17555  15.51209 11.36632  9.786270 10.91289
## 6 11.56701 12.23751 15.74026 10.48784  15.16632 10.94910 10.558421 11.18177
##     METTL14     WTAP   YTHDF1   IGF2BP1    RBM15 KIAA1429    METTL3   YTHDF3
## 1  9.063395 13.19583 10.61102  2.584963 8.164907 10.85720  8.924813 12.03032
## 2  9.552669 12.67331 12.07046  1.584963 9.087463 11.58543 10.689124 12.26004
## 3  9.949827 12.61218 11.92518  2.584963 8.280771 11.45224 10.134426 12.23182
## 4 10.675957 12.28135 11.25798  2.321928 9.038919 12.19476 10.029287 12.48734
## 5 10.990104 14.66217 11.67375  7.643856 9.594325 12.09902 10.813781 12.80272
## 6  9.914385 12.97549 11.51668 11.499348 9.204571 12.02721 11.310613 12.48382
##     YTHDF2
## 1 11.31175
## 2 12.31175
## 3 11.53916
## 4 12.17773
## 5 12.20427
## 6 11.96939

数据分割

library(sampling)
set.seed(123)
# 每层抽取70%的数据
train_id <- strata(lung, "status", size = rev(round(table(lung$status) * 0.7)))$ID_unit
# 训练数据
trainData <- lung[train_id, 4:24]
# 测试数据
testData <- lung[-train_id, 4:24]

实例操作

X and Y datasets

train.x = data.frame(trainData[, 3:21])
train.T = trainData$time
train.E = trainData$status

###构建模型

plsRcox() 函数实现了对Cox模型的偏最小二乘回归的扩展。

library(plsRcox)
modpls <- plsRcox(train.x, time = trainData$time, event = trainData$status, nt = 5)
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Component____ 5 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
print(modpls)
## Number of required components:
## [1] 5
## Number of successfully computed components:
## [1] 5
## Coefficients:
##                 [,1]
## YTHDC2     -16.07881
## IGF2BP2    -24.82209
## YTHDC1    -260.53853
## ALKBH5    -134.04808
## HNRNPC    -241.78865
## FMR1      -339.85551
## HNRNPA2B1  741.89587
## ZC3H13      78.12486
## IGF2BP3    -71.62873
## FTO        115.24575
## METTL14   -272.76625
## WTAP      -141.72605
## YTHDF1     248.33078
## IGF2BP1    -17.36155
## RBM15     -174.08545
## KIAA1429   474.04073
## METTL3    -291.91006
## YTHDF3      78.98179
## YTHDF2     400.19217
## Information criteria and Fit statistics:
##                AIC      BIC
## Nb_Comp_0 993.9860 993.9860
## Nb_Comp_1 990.8065 994.6442
## Nb_Comp_2 982.0783 989.7538
## Nb_Comp_3 977.7540 989.2672
## Nb_Comp_4 978.2605 993.6114
## Nb_Comp_5 979.5256 998.7142

(偏差)残差计算

par(mfrow = c(1, 2))
DR_coxph(train.T, train.E, plot = TRUE)
##            1            2            3            4            5            6 
##  1.337823863  1.622600157  2.080195812  0.790732498  0.691439652  1.448139607 
##            7            8            9           10           11           12 
##  0.462187937  0.711772712 -0.147358283  1.380608251  1.790094759  0.154022689 

DR_coxph(train.T, train.E, scaleY = FALSE, plot = TRUE)

交叉验证

cv.plsRcox = cv.plsRcox(list(x = train.x, time = train.T, status = train.E), nt = 10,
    verbose = FALSE)

cv.plsRcox$lambda.min5
## [1] 8
OptModpls <- plsRcox(Xplan = train.x, time = train.T, event = train.E, nt = cv.plsRcox$lambda.min5,
    alpha.pvals.expli = 0.05, sparse = TRUE, pvals.expli = TRUE)
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## No more significant predictors (<0.05) found
## Warning only 2 components were thus extracted
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****

验证

library(InformationValue)
test.x = data.frame(testData[, 3:21])
RiskScore <- predict(OptModpls, newdata = test.x, type = "risk")
actuals <- testData$status
misClassError(actuals, RiskScore)
## [1] 0.6871
sensitivity(actuals, RiskScore)
## [1] 0.9782609
specificity(actuals, RiskScore)
## [1] 0.00990099
optimalCutoff(actuals, RiskScore)
## [1] 2.232423
precision(actuals, RiskScore)
## [1] 0.3103448
npv(actuals, RiskScore)
## [1] 0.5

生存分析

library(survival)
library(survminer)
groups <- ifelse(RiskScore>median(RiskScore),"high","low")
f <- survfit(Surv(testData$time, testData$status) ~ groups)
f
## Call: survfit(formula = Surv(testData$time, testData$status) ~ groups)
## 
##              n events median 0.95LCL 0.95UCL
## groups=high 73     24   1067     716      NA
## groups=low  74     22   1640     669      NA

ggsurvplot(f,
           data = testData,
           surv.median.line = "hv", 
           #legend.title = "Risk Group",
           #legend.labs = c("Low Risk", "High Risk"),
           pval = TRUE, 
           ggtheme = theme_bw()
)

一致性

Concordance(actuals, RiskScore)
## $Concordance
## [1] 0.4928971
## 
## $Discordance
## [1] 0.5071029
## 
## $Tied
## [1] -1.110223e-16
## 
## $Pairs
## [1] 4646

绘制 ROC

plotROC(actuals, RiskScore)

计算基于sPLSR的Cox模型

coxsplsDR()(Fitting a sPLSR model on the (Deviance) Residuals)函数计算基于sPLSR组件计算模型的Cox模型。

作为响应:没有协变量的cox模型的残差

作为解释变量:Xplan。

使用包spls执行SPLSR中的第一步，然后使用mixOmics执行PLSR步骤拟合。

cox_splsDR_fit = coxsplsDR(train.x, time = train.T, event = train.E, ncomp = 6, eta = 0.5)
cox_splsDR_fit
## Call:
## coxph(formula = YCsurv ~ ., data = tt_splsDR)
## 
##          coef exp(coef) se(coef)     z       p
## dim.1 0.18549   1.20381  0.05641 3.288 0.00101
## dim.2 0.10265   1.10811  0.04528 2.267 0.02338
## dim.3 0.27791   1.32037  0.11220 2.477 0.01326
## dim.4 0.08678   1.09065  0.13165 0.659 0.50981
## dim.5 0.02858   1.02899  0.14666 0.195 0.84550
## dim.6 0.00610   1.00612  0.13003 0.047 0.96258
## 
## Likelihood ratio test=20.14  on 6 df, p=0.002615
## n= 343, number of events= 108
set.seed(123456)
cv.coxsplsDR.res = cv.coxsplsDR(list(x = train.x, time = train.T, status = train.E),
    nt = 10, eta = 0.1)
## CV Fold 1 
## CV Fold 2 
## CV Fold 3 
## CV Fold 4 
## CV Fold 5

cv.coxsplsDR.res$lambda.min10
## [1] 2
cox_DKsplsDR_fit = coxDKsplsDR(train.x, train.T, train.E, ncomp = cv.coxsplsDR.res$lambda.min10,
    validation = "CV", eta = 0.5)
## Kernel :  rbfdot 
## Estimated_sigma  0.03533116
cox_DKsplsDR_fit
## Call:
## coxph(formula = YCsurv ~ ., data = tt_DKsplsDR)
## 
##          coef exp(coef) se(coef)     z      p
## dim.1 0.25752   1.29372  0.10210 2.522 0.0117
## dim.2 0.13689   1.14670  0.07711 1.775 0.0759
## 
## Likelihood ratio test=9.57  on 2 df, p=0.00834
## n= 343, number of events= 108

DKplsRcox()函数同样可以实现

DKplsRcox(train.x, time = train.T, event = train.E, nt = 5)
## Kernel :  rbfdot 
## Estimated_sigma  0.0362187 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Component____ 5 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## Call:
## coxph(formula = YCsurv ~ ., data = tt_DKplsRcox)
## 
##           coef exp(coef) se(coef)     z       p
## Comp_1 0.25539   1.29096  0.09058 2.819 0.00481
## Comp_2 0.35469   1.42574  0.12010 2.953 0.00314
## Comp_3 1.32550   3.76406  0.34961 3.791 0.00015
## Comp_4 0.07951   1.08276  0.11511 0.691 0.48972
## Comp_5 0.61531   1.85024  0.20946 2.938 0.00331
## 
## Likelihood ratio test=35.29  on 5 df, p=1.317e-06
## n= 343, number of events= 108
DKplsRcox(Xplan = train.x, time = train.T, event = train.E, nt = 5, sparse = TRUE,
    alpha.pvals.expli = 0.05)
## Kernel :  rbfdot 
## Estimated_sigma  0.03446764 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## No more significant predictors (<0.05) found
## Warning only 3 components were thus extracted
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## Call:
## coxph(formula = YCsurv ~ ., data = tt_DKplsRcox)
## 
##          coef exp(coef) se(coef)     z       p
## Comp_1 0.4046    1.4987   0.2318 1.745 0.08092
## Comp_2 0.4619    1.5872   0.1562 2.957 0.00310
## Comp_3 1.5964    4.9353   0.5256 3.037 0.00239
## 
## Likelihood ratio test=24.23  on 3 df, p=2.238e-05
## n= 343, number of events= 108

Reference

plsRcox, Cox-Models in a high dimensional setting in R, Frederic Bertrand, Philippe Bastien, Nicolas Meyer and Myriam Maumy-Bertrand (2014). Proceedings of User2014!, Los Angeles, page 152.
Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data, Philippe Bastien, Frederic Bertrand, Nicolas Meyer and Myriam Maumy-Bertrand (2015), Bioinformatics, 31(3):397-404, doi:10.1093/bioinformatics/btu660.