Li Xiaolong, Ding Fan, Zhang Lu, Zhao Shi, Hu Zengyun, Ma Zhanbing, Li Feng, Zhang Yuhong, Zhao Yi, Zhao Yu
School of Public Health, Ningxia Medical University, Yinchuan Ningxia, 750004, China.
NHC Key Laboratory of Metabolic Cardiovascular Diseases Research, Ningxia Medical University, Yinchuan, 750004, China.
BMC Public Health. 2025 Mar 26;25(1):1145. doi: 10.1186/s12889-025-22419-7.
The incidence of Type 2 Diabetes Mellitus (T2DM) continues to rise steadily, significantly impacting human health. Early prediction of pre-diabetic risks has emerged as a crucial public health concern in recent years. Machine learning methods have proven effective in enhancing prediction accuracy. However, existing approaches may lack interpretability regarding underlying mechanisms. Therefore, we aim to employ an interpretable machine learning approach utilizing nationwide cross-sectional data to predict pre-diabetic risk and quantify the impact of potential risks.
The LASSO regression algorithm was used to conduct feature selection from 30 factors, ultimately identifying nine non-zero coefficient features associated with pre-diabetes, including age, TG, TC, BMI, Apolipoprotein B, TP, leukocyte count, HDL-C, and hypertension. Various machine learning algorithms, including Extreme Gradient Boosting (XGBoost), Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB), Artificial Neural Networks (ANNs), Decision Trees (DT), and Logistic Regression (LR), were employed to compare predictive performance. Employing an interpretable machine learning approach, we aimed to enhance the accuracy of pre-diabetes risk prediction and quantify the impact and significance of potential risks on pre-diabetes.
From the China Health and Nutrition Survey (CHNS) data, a cohort of 8,277 individuals was selected, exhibiting a disease prevalence of 7.13%. The XGBoost model demonstrated superior performance with an AUC value of 0.939, surpassing RF, SVM, DT, ANNs, Naive Bayes, and LR models. Additionally, Shapley Additive Explanation (SHAP) analysis indicated that age, BMI, TC, ApoB, TG, hypertension, TP, HDL-C, and WBC may serve as risk factors for pre-diabetes.
The constructed model comprises nine easily accessible predictive factors, which prove highly effective in forecasting the risk of pre-diabetes. Concurrently, we have quantified the specific impact of each predictive factor on the risk and ranked them based on their influence. This result may serve as a convenient tool for early identification of individuals at high risk of pre-diabetes, providing effective guidance for preventing the progression of pre-diabetes to T2DM.
2型糖尿病(T2DM)的发病率持续稳步上升,对人类健康产生重大影响。近年来,糖尿病前期风险的早期预测已成为一个关键的公共卫生问题。机器学习方法已被证明在提高预测准确性方面有效。然而,现有方法可能缺乏对潜在机制的可解释性。因此,我们旨在采用一种可解释的机器学习方法,利用全国横断面数据来预测糖尿病前期风险并量化潜在风险的影响。
使用LASSO回归算法从30个因素中进行特征选择,最终确定了9个与糖尿病前期相关的非零系数特征,包括年龄、甘油三酯(TG)、总胆固醇(TC)、体重指数(BMI)、载脂蛋白B、总蛋白(TP)、白细胞计数、高密度脂蛋白胆固醇(HDL-C)和高血压。采用了各种机器学习算法,包括极端梯度提升(XGBoost)、随机森林(RF)、支持向量机(SVM)、朴素贝叶斯(NB)、人工神经网络(ANNs)、决策树(DT)和逻辑回归(LR),以比较预测性能。采用可解释的机器学习方法,旨在提高糖尿病前期风险预测的准确性,并量化潜在风险对糖尿病前期的影响和重要性。
从中国健康与营养调查(CHNS)数据中选取了8277名个体组成队列,疾病患病率为7.13%。XGBoost模型表现出卓越性能,AUC值为0.939,超过了RF、SVM、DT、ANNs、朴素贝叶斯和LR模型。此外,夏普利值附加解释(SHAP)分析表明,年龄、BMI、TC、载脂蛋白B(ApoB)、TG、高血压、TP、HDL-C和白细胞(WBC)可能是糖尿病前期的危险因素。
构建的模型包含9个易于获取的预测因素,并在预测糖尿病前期风险方面被证明非常有效。同时,我们已经量化了每个预测因素对风险的具体影响,并根据其影响进行了排序。这一结果可为早期识别糖尿病前期高危个体提供便利工具,为预防糖尿病前期进展为T2DM提供有效指导。