近几年,预测模型成为一大研究热点。星辰在pubmed 粗略检索了一下发现,近年来,预测模型研究呈指数增长,仅在2022年就有5908篇论文发表。
预测模型主要是基于横断面研究(诊断模型)或队列研究(预后模型)。在标书撰写中,样本量计算是一个非常重要的部分。
很多同学在报告回归模型所需样本量时候都提到过“一个变量最少需要10个样本”这样的话,英文文献中叫做10 events per variable(EPV)。这种说法确切的表达应该是你估计n个待估参数,就最少需要10*n个样本。但是需要注意的是,如果预测变量有分类变量,转化为哑变量的时候,所估计的待估参数是增多的,比如对于一个3分类变量,可能最少需要20个样本。
"10 EPV"是一个经验性的指导原则,不能绝对适用于所有情况,特别是在特殊的数据分析和建模场景中。在实际应用中,建议根据具体问题和数据特征来决定样本量和预测变量的选择。
Riley et al.2018设计了用于标准计算开发新的多变量预测模型所需的最小样本量的pmsampsize R包,可用于计算具有连续、二元或生存(事件发生时间)结果的模型开发的最小样本量,并提出了一系列样本量应该满足的标准。其目的是最小化过度拟合,并确保预测模型中关键参数的精确估计。
接下来,星辰为大家演示一下Logistic prediction models,Cox prediction models和Linear prediction models的操作:
首先安装一下所用的pmsampsize R包
install.packages("pmsampsize")
library("pmsampsize")
1、Binary outcomes (Logistic prediction models)
pmsampsize(type = "b", rsquared = 0.288, parameters = 24, prevalence = 0.174)
type : 指定要计算样本量的分析类型,“c”指定具有连续结果的预测模型的样本量计算,“b”指定具有二进制结果的预测模型的样本量计算,“s”指定具有生存(事件发生时间)结果的预测模型的样本量计算。
24个候选变量,且从先前的文献中我们了解到某一结局的患病率为0.174 (17.4%),新模型的 R 方值下限(取自现有预测模型的调整后 Cox-Snell R 方值)为 0.288。
结果得到最小样本量为662,Events per Predictor Parameter (EPP)=4.8,这里的EPP就是经验法则的EPV。
> pmsampsize(type = "b", rsquared = 0.288, parameters = 24, prevalence = 0.174)
NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
NB: Assuming 0.05 margin of error in estimation of intercept
NB: Events per Predictor Parameter (EPP) assumes prevalence = 0.174
Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq EPP
Criteria 1 623 0.900 24 0.288 0.603 0.477 4.52
Criteria 2 662 0.905 24 0.288 0.603 0.477 4.80
Criteria 3 221 0.905 24 0.288 0.603 0.477 1.60
Final 662 0.905 24 0.288 0.603 0.477 4.80
Minimum sample size required for new model development based on user inputs = 662,
with 116 events (assuming an outcome prevalence = 0.174) and an EPP = 4.8
2、Survial outcomes (Cox prediction models)
pmsampsize(type = "s", rsquared = 0.051, parameters = 30, rate = 0.065,
timepoint = 2, meanfup = 2.07)
30个候选变量,从先前文献中了解到同领域预测模型中某一结局发病率为0.065,调整的R方为0.051,平均生存时间为2.07年。
结果得到最小样本量为5243,Events per Predictor Parameter (EPP)=23.07
> pmsampsize(type = "s", rsquared = 0.051, parameters = 30, rate = 0.065,
+ timepoint = 2, meanfup = 2.07)
NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
NB: Assuming 0.05 margin of error in estimation of overall risk at time point = 2
NB: Events per Predictor Parameter (EPP) assumes overall event rate = 0.065
Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq EPP
Criteria 1 5143 0.900 30 0.051 0.555 0.092 23.07
Criteria 2 1039 0.648 30 0.051 0.555 0.092 4.66
Criteria 3 * 5143 0.900 30 0.051 0.555 0.092 23.07
Final SS 5143 0.900 30 0.051 0.555 0.092 23.07
Minimum sample size required for new model development based on user inputs = 5143,
corresponding to 10646 person-time** of follow-up, with 692 outcome events
assuming an overall event rate = 0.065 and therefore an EPP = 23.07
* 95% CI for overall risk = (0.113, 0.13), for true value of 0.122 and sample size n = 5143
**where time is in the units mean follow-up time was specified in
3、Continuous outcomes (Linear prediction models)
pmsampsize(type = "c", rsquared = 0.2, parameters = 25, intercept = 1.9, sd = 0.6)
25个候选变量,从先前文献中了解到同领域预测模型整的R方为0.2,人群中结局的平均值为1.9,标准差为0.6。
结果得到最小样本量为918,Events per Predictor Parameter (EPP)=36.72
> pmsampsize(type = "c", rsquared = 0.2, parameters = 25, intercept = 1.9, sd = 0.6)
NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
NB: Assuming MMOE <= 1.1 in estimation of intercept & residual standard deviation
SPP - Subjects per Predictor Parameter
Samp_size Shrinkage Parameter Rsq SPP
Criteria 1 918 0.900 25 0.2 36.72
Criteria 2 401 0.801 25 0.2 16.04
Criteria 3 259 0.727 25 0.2 10.36
Criteria 4* 918 0.900 25 0.2 36.72
Final 918 0.900 25 0.2 36.72
Minimum sample size required for new model development based on user inputs = 918
* 95% CI for intercept = (1.87, 1.93), for sample size n = 918
好啦,今天的内容就到这里。
如果有帮助,记得分享给需要的人!