Shah Syed Naseer Ahmad, Parveen Rafat
Department of Computer Science, Jamia Millia Islamia, New Delhi, India.
Biomarkers. 2025 Mar;30(2):200-215. doi: 10.1080/1354750X.2025.2461698. Epub 2025 Feb 10.
Lung cancer is a primary global health concern, responsible for a considerable portion of cancer-related fatalities worldwide. Understanding its molecular complexities is crucial for identifying potential targets for treatment. The goal is to slow disease progression and intervene early to prevent the development of advanced lung cancer cases. Hence, there's an urgent need for new biomarkers that can detect lung cancer in its early stages.
The study conducted RNA-Seq analysis of lung cancer samples from the publicly available SRA database (NCBI SRP009408), including both control and tumour samples. The genes with differential expression between tumour and healthy tissues were identified using R and Bioconductor. Machine learning (ML) techniques, Random Forest, Lasso, XGBoost, Gradient Boosting and Elastic Net were employed to pinpoint significant genes followed by classifiers, Multilayer Perceptron (MLP), Support Vector Machines (SVM) and k-Nearest Neighbours (k-NN). Gene ontology and pathway analyses were performed on the significant differentially expressed genes (DEGs). The top genes from DEG and machine learning analyses were combined for protein-protein interaction (PPI) analysis, identifying 10 hub genes essential for lung cancer progression.
The integrated analysis of ML and DEGs revealed the significance of specific genes in lung cancer samples, identified the top 5 upregulated genes (COL11A1, TOP2A, SULF1, DIO2, MIR196A2) and the top 5 downregulated genes (PDK4, FOSB, FLYWCH1, CYB5D2, MIR328), along with their associated genes implicated in pathways or co-expression networks were identified. Among the various algorithms employed, Random Forest and XGBoost proved effective in identifying common genes, underscoring their potential significance in lung cancer pathogenesis. The MLP exhibited the highest accuracy in classifying samples using all genes. Additionally, the protein-protein interaction (PPI) analysis identified 10 hub genes that are pivotal in lung cancer pathogenesis: COL1A1, SOX2, SPP1, THBS2, POSTN, COL5A1, COL11A1, TIMP1, TOP2A and PKP1.
The study contributes to the early prediction of lung cancer by identifying potential biomarkers that could enhance early diagnosis and pave the way for practical clinical applications in the future. Integrating DEGs and machine learning-derived significant genes for PPI analysis offers a robust approach to uncovering critical molecular targets for lung cancer treatment.
肺癌是全球主要的健康问题,在全球癌症相关死亡中占相当大的比例。了解其分子复杂性对于确定潜在的治疗靶点至关重要。目标是减缓疾病进展并尽早干预以预防晚期肺癌病例的发生。因此,迫切需要能够在肺癌早期阶段进行检测的新生物标志物。
该研究对来自公开可用的SRA数据库(NCBI SRP009408)的肺癌样本进行了RNA测序分析,包括对照样本和肿瘤样本。使用R和生物导体软件包确定肿瘤组织和健康组织之间差异表达的基因。采用机器学习(ML)技术,随机森林、套索回归、XGBoost、梯度提升和弹性网络,以确定重要基因,随后使用多层感知器(MLP)、支持向量机(SVM)和k近邻(k-NN)分类器。对显著差异表达基因(DEG)进行基因本体和通路分析。将来自DEG分析和机器学习分析的顶级基因进行蛋白质-蛋白质相互作用(PPI)分析,确定了10个对肺癌进展至关重要的枢纽基因。
ML和DEG的综合分析揭示了肺癌样本中特定基因的重要性,确定了前5个上调基因(COL11A1、TOP2A、SULF1、DIO2、MIR196A2)和前5个下调基因(PDK4、FOSB、FLYWCH1、CYB5D2、MIR328),以及它们在通路或共表达网络中涉及的相关基因。在使用的各种算法中,随机森林和XGBoost被证明在识别常见基因方面有效,突出了它们在肺癌发病机制中的潜在重要性。MLP在使用所有基因对样本进行分类时表现出最高的准确性。此外,蛋白质-蛋白质相互作用(PPI)分析确定了10个在肺癌发病机制中起关键作用的枢纽基因:COL1A1、SOX2、SPP1、THBS2、POSTN、COL5A1、COL11A1、TIMP1、TOP2A和PKP1。
该研究通过识别潜在的生物标志物,有助于肺癌的早期预测,这些生物标志物可增强早期诊断,并为未来的实际临床应用铺平道路。将DEG和机器学习衍生的重要基因整合用于PPI分析,为揭示肺癌治疗的关键分子靶点提供了一种强大的方法。