Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America.
Virginia Commonwealth University School of Medicine, Richmond, VA, United States of America.
PLoS One. 2024 Jul 2;19(7):e0306359. doi: 10.1371/journal.pone.0306359. eCollection 2024.
Sleep is critical to a person's physical and mental health and there is a need to create high performing machine learning models and critically understand how models rank covariates.
The study aimed to compare how different model metrics rank the importance of various covariates.
DESIGN, SETTING, AND PARTICIPANTS: A cross-sectional cohort study was conducted retrospectively using the National Health and Nutrition Examination Survey (NHANES), which is publicly available.
This study employed univariate logistic models to filter out strong, independent covariates associated with sleep disorder outcome, which were then used in machine-learning models, of which, the most optimal was chosen. The machine-learning model was used to rank model covariates based on gain, cover, and frequency to identify risk factors for sleep disorder and feature importance was evaluated using both univariable and multivariable t-statistics. A correlation matrix was created to determine the similarity of the importance of variables ranked by different model metrics.
The XGBoost model had the highest mean AUROC of 0.865 (SD = 0.010) with Accuracy of 0.762 (SD = 0.019), F1 of 0.875 (SD = 0.766), Sensitivity of 0.768 (SD = 0.023), Specificity of 0.782 (SD = 0.025), Positive Predictive Value of 0.806 (SD = 0.025), and Negative Predictive Value of 0.737 (SD = 0.034). The model metrics from the machine learning of gain and cover were strongly positively correlated with one another (r > 0.70). Model metrics from the multivariable model and univariable model were weakly negatively correlated with machine learning model metrics (R between -0.3 and 0).
The ranking of important variables associated with sleep disorder in this cohort from the machine learning models were not related to those from regression models.
睡眠对一个人的身心健康至关重要,因此需要创建高性能的机器学习模型,并深入了解模型如何对协变量进行排名。
本研究旨在比较不同模型指标如何对各种协变量的重要性进行排名。
设计、设置和参与者:这是一项使用可公开获取的国家健康和营养调查(NHANES)进行的回顾性横断面队列研究。
本研究采用单变量逻辑模型筛选出与睡眠障碍结果相关的强独立协变量,然后将这些协变量用于机器学习模型中,选择其中最优的模型。该机器学习模型用于根据增益、覆盖率和频率对模型协变量进行排名,以确定睡眠障碍的危险因素,并使用单变量和多变量 t 检验评估特征重要性。创建相关矩阵以确定不同模型指标排名的变量重要性的相似性。
XGBoost 模型的平均 AUROC 最高,为 0.865(SD = 0.010),准确性为 0.762(SD = 0.019),F1 为 0.875(SD = 0.766),敏感度为 0.768(SD = 0.023),特异性为 0.782(SD = 0.025),阳性预测值为 0.806(SD = 0.025),阴性预测值为 0.737(SD = 0.034)。机器学习模型增益和覆盖率的模型指标彼此之间呈强正相关(r > 0.70)。多变量模型和单变量模型的模型指标与机器学习模型指标呈弱负相关(R 值在-0.3 到 0 之间)。
从机器学习模型中,该队列中与睡眠障碍相关的重要变量的排名与回归模型中的排名无关。