Department of Statistical Science, School of Multidisciplinary Sciences, The Graduate University for Advanced Studies, Tokyo, Japan.
Office of Biostatistics, Department of Biometrics, Headquarters of Clinical Development, Otsuka Pharmaceutical Co., Ltd., Tokyo, Japan.
BMC Med Res Methodol. 2021 Jan 7;21(1):9. doi: 10.1186/s12874-020-01201-w.
Multivariable prediction models are important statistical tools for providing synthetic diagnosis and prognostic algorithms based on patients' multiple characteristics. Their apparent measures for predictive accuracy usually have overestimation biases (known as 'optimism') relative to the actual performances for external populations. Existing statistical evidence and guidelines suggest that three bootstrap-based bias correction methods are preferable in practice, namely Harrell's bias correction and the .632 and .632+ estimators. Although Harrell's method has been widely adopted in clinical studies, simulation-based evidence indicates that the .632+ estimator may perform better than the other two methods. However, these methods' actual comparative effectiveness is still unclear due to limited numerical evidence.
We conducted extensive simulation studies to compare the effectiveness of these three bootstrapping methods, particularly using various model building strategies: conventional logistic regression, stepwise variable selections, Firth's penalized likelihood method, ridge, lasso, and elastic-net regression. We generated the simulation data based on the Global Utilization of Streptokinase and Tissue plasminogen activator for Occluded coronary arteries (GUSTO-I) trial Western dataset and considered how event per variable, event fraction, number of candidate predictors, and the regression coefficients of the predictors impacted the performances. The internal validity of C-statistics was evaluated.
Under relatively large sample settings (roughly, events per variable ≥ 10), the three bootstrap-based methods were comparable and performed well. However, all three methods had biases under small sample settings, and the directions and sizes of biases were inconsistent. In general, Harrell's and .632 methods had overestimation biases when event fraction become lager. Besides, .632+ method had a slight underestimation bias when event fraction was very small. Although the bias of the .632+ estimator was relatively small, its root mean squared error (RMSE) was comparable or sometimes larger than those of the other two methods, especially for the regularized estimation methods.
In general, the three bootstrap estimators were comparable, but the .632+ estimator performed relatively well under small sample settings, except when the regularized estimation methods are adopted.
多变量预测模型是基于患者多个特征提供综合诊断和预后算法的重要统计工具。它们的明显预测精度指标通常相对于外部人群的实际表现存在高估偏差(称为“乐观性”)。现有的统计证据和指南表明,在实践中,三种基于 bootstrap 的偏差校正方法更可取,即 Harrell 的偏差校正和.632 和.632+估计量。虽然 Harrell 方法已在临床研究中广泛采用,但基于模拟的证据表明,.632+估计量的性能可能优于其他两种方法。然而,由于数值证据有限,这些方法的实际比较效果仍不清楚。
我们进行了广泛的模拟研究,以比较这三种引导法的有效性,特别是使用各种模型构建策略:传统的逻辑回归、逐步变量选择、Firth 的惩罚似然法、岭回归、lasso 和弹性网络回归。我们基于 Global Utilization of Streptokinase and Tissue plasminogen activator for Occluded coronary arteries(GUSTO-I)试验西方数据集生成模拟数据,并考虑了事件变量比、事件分数、候选预测因子数量以及预测因子的回归系数如何影响性能。内部 C 统计量的有效性进行了评估。
在相对较大的样本设置(大致为每个变量的事件数≥10)下,三种基于 bootstrap 的方法具有可比性且表现良好。然而,在小样本设置下,所有三种方法都存在偏差,偏差的方向和大小不一致。一般来说,当事件分数变大时,Harrell 和.632 方法存在高估偏差。此外,当事件分数非常小时,.632+方法存在轻微的低估偏差。尽管.632+估计量的偏差相对较小,但它的均方根误差(RMSE)与其他两种方法相当,有时甚至更大,尤其是对于正则化估计方法。
一般来说,三种引导估计量具有可比性,但.632+估计量在小样本设置下表现相对较好,除非采用正则化估计方法。