Muthamilselvan Sangeetha, Palaniappan Ashok
Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA University, Thanjavur, Tamil Nadu, India.
Front Bioinform. 2023 May 23;3:1103493. doi: 10.3389/fbinf.2023.1103493. eCollection 2023.
Breast cancer is the foremost cancer in worldwide incidence, surpassing lung cancer notwithstanding the gender bias. One in four cancer cases among women are attributable to cancers of the breast, which are also the leading cause of death in women. Reliable options for early detection of breast cancer are needed. Using public-domain datasets, we screened transcriptomic profiles of breast cancer samples, and identified progression-significant linear and ordinal model genes using stage-informed models. We then applied a sequence of machine learning techniques, namely, feature selection, principal components analysis, and k-means clustering, to train a learner to discriminate "cancer" from "normal" based on expression levels of identified biomarkers. Our computational pipeline yielded an optimal set of nine biomarker features for training the learner, namely, NEK2, PKMYT1, MMP11, CPA1, COL10A1, HSD17B13, CA4, MYOC, and LYVE1. Validation of the learned model on an independent test dataset yielded a performance of 99.5% accuracy. Blind validation on an out-of-domain external dataset yielded a balanced accuracy of 95.5%, demonstrating that the model has effectively reduced the dimensionality of the problem, and learnt the solution. The model was rebuilt using the full dataset, and then deployed as a web app for non-profit purposes at: https://apalania.shinyapps.io/brcadx/. To our knowledge, this is the best-performing freely available tool for the high-confidence diagnosis of breast cancer, and represents a promising aid to medical diagnosis.
乳腺癌是全球发病率最高的癌症,尽管存在性别差异,但仍超过肺癌。女性癌症病例中有四分之一归因于乳腺癌,乳腺癌也是女性死亡的主要原因。因此,需要可靠的早期乳腺癌检测方法。我们利用公共领域数据集筛选了乳腺癌样本的转录组图谱,并使用基于分期的模型确定了与进展显著相关的线性和有序模型基因。然后,我们应用了一系列机器学习技术,即特征选择、主成分分析和k均值聚类,以训练一个基于已识别生物标志物的表达水平来区分“癌症”和“正常”的学习器。我们的计算流程产生了一组用于训练学习器的九个最佳生物标志物特征,即NEK2、PKMYT1、MMP11、CPA1、COL10A1、HSD17B13、CA4、MYOC和LYVE1。在独立测试数据集上对学习到的模型进行验证,准确率达到99.5%。在域外外部数据集上进行盲验证,平衡准确率为95.5%,这表明该模型有效地降低了问题的维度,并找到了解决方案。该模型使用完整数据集进行重建,然后作为一个非营利性的网络应用程序部署在:https://apalania.shinyapps.io/brcadx/。据我们所知,这是用于乳腺癌高置信度诊断的性能最佳的免费工具,为医学诊断提供了一个有前景的辅助手段。