Department of Clinical Laboratory, Tianjin Medical University Cancer Institute & Hospital, Tianjin, 300060, China.
National Clinical Research Center for Cancer, Tianjin, 300060, China.
BMC Med Inform Decis Mak. 2023 Nov 29;23(1):276. doi: 10.1186/s12911-023-02377-z.
Breast cancer is the most common malignancy diagnosed in women worldwide. The prevalence and incidence of breast cancer is increasing every year; therefore, early diagnosis along with suitable relapse detection is an important strategy for prognosis improvement. This study aimed to compare different machine algorithms to select the best model for predicting breast cancer recurrence. The prediction model was developed by using eleven different machine learning (ML) algorithms, including logistic regression (LR), random forest (RF), support vector classification (SVC), extreme gradient boosting (XGBoost), gradient boosting decision tree (GBDT), decision tree, multilayer perceptron (MLP), linear discriminant analysis (LDA), adaptive boosting (AdaBoost), Gaussian naive Bayes (GaussianNB), and light gradient boosting machine (LightGBM), to predict breast cancer recurrence. The area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and F1 score were used to evaluate the performance of the prognostic model. Based on performance, the optimal ML was selected, and feature importance was ranked by Shapley Additive Explanation (SHAP) values. Compared to the other 10 algorithms, the results showed that the AdaBoost algorithm had the best prediction performance for successfully predicting breast cancer recurrence and was adopted in the establishment of the prediction model. Moreover, CA125, CEA, Fbg, and tumor diameter were found to be the most important features in our dataset to predict breast cancer recurrence. More importantly, our study is the first to use the SHAP method to improve the interpretability of clinicians to predict the recurrence model of breast cancer based on the AdaBoost algorithm. The AdaBoost algorithm offers a clinical decision support model and successfully identifies the recurrence of breast cancer.
乳腺癌是全球女性最常见的恶性肿瘤。乳腺癌的患病率和发病率逐年上升;因此,早期诊断和适当的复发检测是改善预后的重要策略。本研究旨在比较不同的机器算法,以选择预测乳腺癌复发的最佳模型。该预测模型是通过使用 11 种不同的机器学习(ML)算法开发的,包括逻辑回归(LR)、随机森林(RF)、支持向量分类(SVC)、极端梯度提升(XGBoost)、梯度提升决策树(GBDT)、决策树、多层感知机(MLP)、线性判别分析(LDA)、自适应提升(AdaBoost)、高斯朴素贝叶斯(GaussianNB)和轻梯度提升机(LightGBM),以预测乳腺癌复发。曲线下面积(AUC)、准确性、灵敏度、特异性、阳性预测值(PPV)、阴性预测值(NPV)和 F1 分数用于评估预后模型的性能。基于性能,选择最佳的 ML,并通过 Shapley 加性解释(SHAP)值对特征重要性进行排名。与其他 10 种算法相比,结果表明,AdaBoost 算法在成功预测乳腺癌复发方面具有最佳的预测性能,并被用于建立预测模型。此外,CA125、CEA、Fbg 和肿瘤直径被发现是我们数据集预测乳腺癌复发的最重要特征。更重要的是,我们的研究首次使用 SHAP 方法来提高临床医生基于 AdaBoost 算法预测乳腺癌复发模型的可解释性。AdaBoost 算法提供了一个临床决策支持模型,并成功识别了乳腺癌的复发。