Chen Jinxiang, Wang Miao, Zhao Defeng, Li Fuyi, Wu Hao, Liu Quanzhong, Li Shuqin
College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China.
Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC, 3000, Australia.
Interdiscip Sci. 2023 Mar;15(1):100-110. doi: 10.1007/s12539-022-00544-w. Epub 2022 Nov 9.
Microsatellite instability (MSI), a vital mutator phenotype caused by DNA mismatch repair deficiency, is frequently observed in several tumors. MSI is recognized as a critical molecular biomarker for diagnosis, prognosis, and therapeutic selection in several cancers. Identifying MSI status for current gold standard methods based on experimental analysis is laborious, time-consuming, and costly. Although several computational methods based on machine learning have been proposed to identify MSI status, we need to further understand which machine learning model would favor identification for MSI and which feature subset is strongly related to MSI. On this basis, more effective machine learning-based methods can be developed to improve the performance of MSI status identification. In this work, we present MSINGB, an NGBoost-based method for identifying MSI status from tumor somatic mutation annotation data. MSINGB first evaluates the prediction performance of 11 popular machine learning algorithms and 9 deep learning models to identify MSI. Among 20 models, NGBoost, a novel natural gradient boosting method, achieves the overall best performance. MSINGB then introduces two feature selection strategies to find the compact feature subset, which is strongly related to MSI, and employs the SHAP approach to interpreting how selected features impact the model prediction. MSINGB achieves a better prediction performance on both the tenfold cross-validation test and independent test compared with state-of-the-art methods.
微卫星不稳定性(MSI)是一种由DNA错配修复缺陷引起的重要突变表型,在多种肿瘤中经常被观察到。MSI被认为是几种癌症诊断、预后和治疗选择的关键分子生物标志物。基于实验分析的当前金标准方法来确定MSI状态既费力、耗时又昂贵。尽管已经提出了几种基于机器学习的计算方法来确定MSI状态,但我们需要进一步了解哪种机器学习模型更有利于MSI的识别,以及哪些特征子集与MSI密切相关。在此基础上,可以开发出更有效的基于机器学习的方法来提高MSI状态识别的性能。在这项工作中,我们提出了MSINGB,一种基于NGBoost从肿瘤体细胞突变注释数据中识别MSI状态的方法。MSINGB首先评估11种流行的机器学习算法和9种深度学习模型识别MSI的预测性能。在20种模型中,一种新颖的自然梯度提升方法NGBoost实现了总体最佳性能。然后,MSINGB引入了两种特征选择策略来找到与MSI密切相关的紧凑特征子集,并采用SHAP方法来解释所选特征如何影响模型预测。与现有方法相比,MSINGB在十倍交叉验证测试和独立测试中都取得了更好的预测性能。