Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
Health Science Center, Hubei Minzu University, Enshi, 445000, China.
Adv Sci (Weinh). 2024 Aug;11(30):e2308243. doi: 10.1002/advs.202308243. Epub 2024 Jun 17.
Cell-free DNA (cfDNA) fragmentation patterns have immense potential for early cancer detection. However, the definition of fragmentation varies, ranging from the entire genome to specific genomic regions. These patterns have not been systematically compared, impeding broader research and practical implementation. Here, 1382 plasma cfDNA sequencing samples from 8 cancer types are collected. Considering that cfDNA within open chromatin regions is more susceptible to fragmentation, 10 fragmentation patterns within open chromatin regions as features and employed machine learning techniques to evaluate their performance are examined. All fragmentation patterns demonstrated discernible classification capabilities, with the end motif showing the highest diagnostic value for cross-validation. Combining cross and independent validation results revealed that fragmentation patterns that incorporated both fragment length and coverage information exhibited robust predictive capacities. Despite their diagnostic potential, the predictive power of these fragmentation patterns is unstable. To address this limitation, an ensemble classifier via integrating all fragmentation patterns is developed, which demonstrated notable improvements in cancer detection and tissue-of-origin determination. Further functional bioinformatics investigations on significant feature intervals in the model revealed its impressive ability to identify critical regulatory regions involved in cancer pathogenesis.
无细胞游离 DNA(cfDNA)片段化模式在癌症早期检测方面具有巨大的潜力。然而,片段化的定义存在差异,从整个基因组到特定的基因组区域都有涉及。这些模式尚未得到系统比较,阻碍了更广泛的研究和实际应用。在这里,收集了来自 8 种癌症类型的 1382 个血浆 cfDNA 测序样本。考虑到开放染色质区域内的 cfDNA 更容易发生片段化,我们研究了 10 种开放染色质区域内的片段化模式作为特征,并采用机器学习技术来评估它们的性能。所有的片段化模式都表现出可区分的分类能力,其中末端模式在交叉验证中具有最高的诊断价值。结合交叉和独立验证结果表明,融合了片段长度和覆盖信息的片段化模式具有稳健的预测能力。尽管这些片段化模式具有诊断潜力,但它们的预测能力并不稳定。为了解决这个局限性,我们通过整合所有的片段化模式开发了一个集成分类器,该分类器在癌症检测和组织起源确定方面表现出了显著的改进。进一步对模型中显著特征区间的功能生物信息学研究揭示了其识别涉及癌症发病机制的关键调控区域的令人印象深刻的能力。