MachineLearning 28. 机器学习之偏最小二乘回归应用于生存分析 (plsRcox)

383f2d3c9f483f7755a55003597a1d26.png


简         介

偏最小二乘回归(Partial Least Squares Regression,PLS Regression)是一种常用的统计建模方法,用于解决多元线性回归中自变量间高度相关的问题。在偏最小二乘回归中,通过将原始自变量转换为一组新的综合变量(称为主成分或潜在变量),然后再使用这些主成分进行回归分析,从而减少自变量之间的共线性,并且提高模型的稳定性和预测能力。偏小二乘回归提供一种多对多线性回归建模的方法,特别当两组变量的个数很多,且都存在多重相关性,而观测数据的数量(样本量)又较少时,用偏小二乘回归建立的模型具有传统的经典回归分析等方法所没有的优点。偏小二乘回归分析在建模过程中集中了主成分分析,典型相关分析和线性回归分析方法的特点,因此在分析结果中,除了可以提供一个更为合理的回归模型外,还可以同时完成一些类似于主成分分析和典型相关分析的研究内容,提供更丰富、深入的一些信息。Cox模型的偏最小二乘回归及相关技术(plsRcox: Partial Least Squares Regression for Cox Models and Related Techniques)是将偏最小二乘回归广义线性模型应用于Cox回归中, 解决临床上的问题。

24d0118555496ddfa975f428dacd5380.png

软件包安装

if(require(plsRcox))
  devtools::install_github("fbertran/plsRcox")

数据读取

lung <- read.table("TCGA_LUSC.m6A_gene.exp_clin.txt", header = T)
table(lung$status)
## 
##   1   2 
## 343 158
lung$status <- ifelse(lung$status == 2, 1, 0)
lung <- na.omit(lung)  # 去掉NA
head(lung)
##   Tumor_Sample_Barcode gender vital_status status time    YTHDC2   IGF2BP2
## 1         TCGA-18-3406   MALE         Dead      1  371 10.490851  8.870365
## 2         TCGA-18-3407   MALE         Dead      1  136 11.032735  8.603626
## 3         TCGA-18-3408 FEMALE        Alive      0 2099  9.796040 11.279611
## 4         TCGA-18-3409   MALE        Alive      0 2417  9.856426 12.023061
## 5         TCGA-18-3410   MALE         Dead      1  146 11.074810  6.584963
## 6         TCGA-18-3411 FEMALE        Alive      0 1415  9.881114 12.384514
##     YTHDC1   ALKBH5   HNRNPC     FMR1 HNRNPA2B1   ZC3H13   IGF2BP3      FTO
## 1 10.69870 10.68738 14.38485 10.59059  13.69675 10.23122  6.523562 10.58684
## 2 12.28511 11.84823 14.44876 11.49935  15.05431 11.94361  9.579316 11.26268
## 3 11.57128 12.10165 14.67325 11.72877  14.47168 11.24911 10.005625 11.33204
## 4 11.92258 12.32839 14.16034 11.19168  14.44320 11.83368  4.954196 11.58825
## 5 11.98762 12.39714 14.92435 11.17555  15.51209 11.36632  9.786270 10.91289
## 6 11.56701 12.23751 15.74026 10.48784  15.16632 10.94910 10.558421 11.18177
##     METTL14     WTAP   YTHDF1   IGF2BP1    RBM15 KIAA1429    METTL3   YTHDF3
## 1  9.063395 13.19583 10.61102  2.584963 8.164907 10.85720  8.924813 12.03032
## 2  9.552669 12.67331 12.07046  1.584963 9.087463 11.58543 10.689124 12.26004
## 3  9.949827 12.61218 11.92518  2.584963 8.280771 11.45224 10.134426 12.23182
## 4 10.675957 12.28135 11.25798  2.321928 9.038919 12.19476 10.029287 12.48734
## 5 10.990104 14.66217 11.67375  7.643856 9.594325 12.09902 10.813781 12.80272
## 6  9.914385 12.97549 11.51668 11.499348 9.204571 12.02721 11.310613 12.48382
##     YTHDF2
## 1 11.31175
## 2 12.31175
## 3 11.53916
## 4 12.17773
## 5 12.20427
## 6 11.96939

数据分割

library(sampling)
set.seed(123)
# 每层抽取70%的数据
train_id <- strata(lung, "status", size = rev(round(table(lung$status) * 0.7)))$ID_unit
# 训练数据
trainData <- lung[train_id, 4:24]
# 测试数据
testData <- lung[-train_id, 4:24]

实例操作

X and Y datasets

train.x = data.frame(trainData[, 3:21])
train.T = trainData$time
train.E = trainData$status

###构建模型

plsRcox() 函数实现了对Cox模型的偏最小二乘回归的扩展。

library(plsRcox)
modpls <- plsRcox(train.x, time = trainData$time, event = trainData$status, nt = 5)
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Component____ 5 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
print(modpls)
## Number of required components:
## [1] 5
## Number of successfully computed components:
## [1] 5
## Coefficients:
##                 [,1]
## YTHDC2     -16.07881
## IGF2BP2    -24.82209
## YTHDC1    -260.53853
## ALKBH5    -134.04808
## HNRNPC    -241.78865
## FMR1      -339.85551
## HNRNPA2B1  741.89587
## ZC3H13      78.12486
## IGF2BP3    -71.62873
## FTO        115.24575
## METTL14   -272.76625
## WTAP      -141.72605
## YTHDF1     248.33078
## IGF2BP1    -17.36155
## RBM15     -174.08545
## KIAA1429   474.04073
## METTL3    -291.91006
## YTHDF3      78.98179
## YTHDF2     400.19217
## Information criteria and Fit statistics:
##                AIC      BIC
## Nb_Comp_0 993.9860 993.9860
## Nb_Comp_1 990.8065 994.6442
## Nb_Comp_2 982.0783 989.7538
## Nb_Comp_3 977.7540 989.2672
## Nb_Comp_4 978.2605 993.6114
## Nb_Comp_5 979.5256 998.7142

(偏差)残差计算

par(mfrow = c(1, 2))
DR_coxph(train.T, train.E, plot = TRUE)
##            1            2            3            4            5            6 
##  1.337823863  1.622600157  2.080195812  0.790732498  0.691439652  1.448139607 
##            7            8            9           10           11           12 
##  0.462187937  0.711772712 -0.147358283  1.380608251  1.790094759  0.154022689 

DR_coxph(train.T, train.E, scaleY = FALSE, plot = TRUE)

815ed4b13da9970b6bee475925b1b1c6.png

交叉验证

cv.plsRcox = cv.plsRcox(list(x = train.x, time = train.T, status = train.E), nt = 10,
    verbose = FALSE)

435f4a1abafecc7a4eaf2494e1487208.png

cv.plsRcox$lambda.min5
## [1] 8
OptModpls <- plsRcox(Xplan = train.x, time = train.T, event = train.E, nt = cv.plsRcox$lambda.min5,
    alpha.pvals.expli = 0.05, sparse = TRUE, pvals.expli = TRUE)
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## No more significant predictors (<0.05) found
## Warning only 2 components were thus extracted
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****

验证

library(InformationValue)
test.x = data.frame(testData[, 3:21])
RiskScore <- predict(OptModpls, newdata = test.x, type = "risk")
actuals <- testData$status
misClassError(actuals, RiskScore)
## [1] 0.6871
sensitivity(actuals, RiskScore)
## [1] 0.9782609
specificity(actuals, RiskScore)
## [1] 0.00990099
optimalCutoff(actuals, RiskScore)
## [1] 2.232423
precision(actuals, RiskScore)
## [1] 0.3103448
npv(actuals, RiskScore)
## [1] 0.5

生存分析

library(survival)
library(survminer)
groups <- ifelse(RiskScore>median(RiskScore),"high","low")
f <- survfit(Surv(testData$time, testData$status) ~ groups)
f
## Call: survfit(formula = Surv(testData$time, testData$status) ~ groups)
## 
##              n events median 0.95LCL 0.95UCL
## groups=high 73     24   1067     716      NA
## groups=low  74     22   1640     669      NA

ggsurvplot(f,
           data = testData,
           surv.median.line = "hv", 
           #legend.title = "Risk Group",
           #legend.labs = c("Low Risk", "High Risk"),
           pval = TRUE, 
           ggtheme = theme_bw()
)

79d3701c87688752edc92105d22e0f98.png

一致性

Concordance(actuals, RiskScore)
## $Concordance
## [1] 0.4928971
## 
## $Discordance
## [1] 0.5071029
## 
## $Tied
## [1] -1.110223e-16
## 
## $Pairs
## [1] 4646

绘制 ROC

plotROC(actuals, RiskScore)

eafc8241b2ae1196306ad59b99fe605f.png

计算基于sPLSR的Cox模型

coxsplsDR()(Fitting a sPLSR model on the (Deviance) Residuals)函数计算基于sPLSR组件计算模型的Cox模型。

作为响应:没有协变量的cox模型的残差

作为解释变量:Xplan。

使用包spls执行SPLSR中的第一步,然后使用mixOmics执行PLSR步骤拟合。

cox_splsDR_fit = coxsplsDR(train.x, time = train.T, event = train.E, ncomp = 6, eta = 0.5)
cox_splsDR_fit
## Call:
## coxph(formula = YCsurv ~ ., data = tt_splsDR)
## 
##          coef exp(coef) se(coef)     z       p
## dim.1 0.18549   1.20381  0.05641 3.288 0.00101
## dim.2 0.10265   1.10811  0.04528 2.267 0.02338
## dim.3 0.27791   1.32037  0.11220 2.477 0.01326
## dim.4 0.08678   1.09065  0.13165 0.659 0.50981
## dim.5 0.02858   1.02899  0.14666 0.195 0.84550
## dim.6 0.00610   1.00612  0.13003 0.047 0.96258
## 
## Likelihood ratio test=20.14  on 6 df, p=0.002615
## n= 343, number of events= 108
set.seed(123456)
cv.coxsplsDR.res = cv.coxsplsDR(list(x = train.x, time = train.T, status = train.E),
    nt = 10, eta = 0.1)
## CV Fold 1 
## CV Fold 2 
## CV Fold 3 
## CV Fold 4 
## CV Fold 5

1be9184114712402d7b5fcd33dd3df08.png

cv.coxsplsDR.res$lambda.min10
## [1] 2
cox_DKsplsDR_fit = coxDKsplsDR(train.x, train.T, train.E, ncomp = cv.coxsplsDR.res$lambda.min10,
    validation = "CV", eta = 0.5)
## Kernel :  rbfdot 
## Estimated_sigma  0.03533116
cox_DKsplsDR_fit
## Call:
## coxph(formula = YCsurv ~ ., data = tt_DKsplsDR)
## 
##          coef exp(coef) se(coef)     z      p
## dim.1 0.25752   1.29372  0.10210 2.522 0.0117
## dim.2 0.13689   1.14670  0.07711 1.775 0.0759
## 
## Likelihood ratio test=9.57  on 2 df, p=0.00834
## n= 343, number of events= 108
DKplsRcox()函数同样可以实现
DKplsRcox(train.x, time = train.T, event = train.E, nt = 5)
## Kernel :  rbfdot 
## Estimated_sigma  0.0362187 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Component____ 5 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## Call:
## coxph(formula = YCsurv ~ ., data = tt_DKplsRcox)
## 
##           coef exp(coef) se(coef)     z       p
## Comp_1 0.25539   1.29096  0.09058 2.819 0.00481
## Comp_2 0.35469   1.42574  0.12010 2.953 0.00314
## Comp_3 1.32550   3.76406  0.34961 3.791 0.00015
## Comp_4 0.07951   1.08276  0.11511 0.691 0.48972
## Comp_5 0.61531   1.85024  0.20946 2.938 0.00331
## 
## Likelihood ratio test=35.29  on 5 df, p=1.317e-06
## n= 343, number of events= 108
DKplsRcox(Xplan = train.x, time = train.T, event = train.E, nt = 5, sparse = TRUE,
    alpha.pvals.expli = 0.05)
## Kernel :  rbfdot 
## Estimated_sigma  0.03446764 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## No more significant predictors (<0.05) found
## Warning only 3 components were thus extracted
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## Call:
## coxph(formula = YCsurv ~ ., data = tt_DKplsRcox)
## 
##          coef exp(coef) se(coef)     z       p
## Comp_1 0.4046    1.4987   0.2318 1.745 0.08092
## Comp_2 0.4619    1.5872   0.1562 2.957 0.00310
## Comp_3 1.5964    4.9353   0.5256 3.037 0.00239
## 
## Likelihood ratio test=24.23  on 3 df, p=2.238e-05
## n= 343, number of events= 108
Reference
  1. plsRcox, Cox-Models in a high dimensional setting in R, Frederic Bertrand, Philippe Bastien, Nicolas Meyer and Myriam Maumy-Bertrand (2014). Proceedings of User2014!, Los Angeles, page 152.

  2. Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data, Philippe Bastien, Frederic Bertrand, Nicolas Meyer and Myriam Maumy-Bertrand (2015), Bioinformatics, 31(3):397-404, doi:10.1093/bioinformatics/btu660.


基于机器学习构建临床预测模型

MachineLearning 1. 主成分分析(PCA)

MachineLearning 2. 因子分析(Factor Analysis)

MachineLearning 3. 聚类分析(Cluster Analysis)

MachineLearning 4. 癌症诊断方法之 K-邻近算法(KNN)

MachineLearning 5. 癌症诊断和分子分型方法之支持向量机(SVM)

MachineLearning 6. 癌症诊断机器学习之分类树(Classification Trees)

MachineLearning 7. 癌症诊断机器学习之回归树(Regression Trees)

MachineLearning 8. 癌症诊断机器学习之随机森林(Random Forest)

MachineLearning 9. 癌症诊断机器学习之梯度提升算法(Gradient Boosting)

MachineLearning 10. 癌症诊断机器学习之神经网络(Neural network)

MachineLearning 11. 机器学习之随机森林生存分析(randomForestSRC)

MachineLearning 12. 机器学习之降维方法t-SNE及可视化(Rtsne)

MachineLearning 13. 机器学习之降维方法UMAP及可视化 (umap)

MachineLearning 14. 机器学习之集成分类器(AdaBoost)

MachineLearning 15. 机器学习之集成分类器(LogitBoost)

MachineLearning 16. 机器学习之梯度提升机(GBM)

MachineLearning 17. 机器学习之围绕中心点划分算法(PAM)

MachineLearning 18. 机器学习之贝叶斯分析类器(Naive Bayes)

MachineLearning 19. 机器学习之神经网络分类器(NNET)

MachineLearning 20. 机器学习之袋装分类回归树(Bagged CART)

MachineLearning 21. 机器学习之临床医学上的生存分析 (xgboost)

MachineLearning 22. 机器学习之有监督主成分分析筛选基因 (SuperPC)

MachineLearning 23. 机器学习之岭回归预测基因型和表型 (Ridge)

MachineLearning 24. 机器学习之似然增强Cox 比例风险模型筛选变量及预后估计 (CoxBoost)

MachineLearning 25. 机器学习之支持向量机应用于生存分析 (survivalsvm)

MachineLearning 26. 机器学习之弹性网络算法应用于生存分析 (Enet)

MachineLearning 27. 机器学习之逐步Cox回归筛选变量 (StepCox)

号外号外,桓峰基因单细胞生信分析免费培训课程即将开始快来报名吧!

桓峰基因,铸造成功的您!

未来桓峰基因公众号将不间断的推出单细胞系列生信分析教程,

敬请期待!!

桓峰基因官网正式上线,请大家多多关注,还有很多不足之处,大家多多指正!http://www.kyohogene.com/

桓峰基因和投必得合作,文章润色优惠85折,需要文章润色的老师可以直接到网站输入领取桓峰基因专属优惠券码:KYOHOGENE,然后上传,付款时选择桓峰基因优惠券即可享受85折优惠哦!https://www.topeditsci.com/

30316ad4e509c90c1f4d4402abae981d.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值