Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA.
Cellino Bio, 750 Main Street, Cambridge, MA, 02143, USA.
Sci Rep. 2024 Mar 25;14(1):7028. doi: 10.1038/s41598-024-57439-7.
Accurate indel calling plays an important role in precision medicine. A benchmarking indel set is essential for thoroughly evaluating the indel calling performance of bioinformatics pipelines. A reference sample with a set of known-positive variants was developed in the FDA-led Sequencing Quality Control Phase 2 (SEQC2) project, but the known indels in the known-positive set were limited. This project sought to provide an enriched set of known indels that would be more translationally relevant by focusing on additional cancer related regions. A thorough manual review process completed by 42 reviewers, two advisors, and a judging panel of three researchers significantly enriched the known indel set by an additional 516 indels. The extended benchmarking indel set has a large range of variant allele frequencies (VAFs), with 87% of them having a VAF below 20% in reference Sample A. The reference Sample A and the indel set can be used for comprehensive benchmarking of indel calling across a wider range of VAF values in the lower range. Indel length was also variable, but the majority were under 10 base pairs (bps). Most of the indels were within coding regions, with the remainder in the gene regulatory regions. Although high confidence can be derived from the robust study design and meticulous human review, this extensive indel set has not undergone orthogonal validation. The extended benchmarking indel set, along with the indels in the previously published known-positive set, was the truth set used to benchmark indel calling pipelines in a community challenge hosted on the precisionFDA platform. This benchmarking indel set and reference samples can be utilized for a comprehensive evaluation of indel calling pipelines. Additionally, the insights and solutions obtained during the manual review process can aid in improving the performance of these pipelines.
准确的插入缺失(indel)调用在精准医学中起着重要作用。基准插入缺失数据集对于彻底评估生物信息学管道的插入缺失调用性能至关重要。美国食品和药物管理局(FDA)领导的测序质量控制阶段 2(SEQC2)项目开发了一个具有一组已知阳性变体的参考样本,但已知阳性集中的已知插入缺失有限。该项目旨在通过关注额外的癌症相关区域,提供一组更具翻译相关性的丰富已知插入缺失。由 42 名评审员、两名顾问以及由三名研究人员组成的评审小组完成的彻底手动审查过程,通过另外 516 个插入缺失显著丰富了已知插入缺失集。扩展的基准插入缺失集具有广泛的变异等位基因频率(VAF)范围,其中 87%的 VAF 在参考样本 A 中低于 20%。参考样本 A 和插入缺失集可用于在较低范围内更广泛的 VAF 值范围内全面基准插入缺失调用。插入缺失长度也不同,但大多数小于 10 个碱基对(bps)。大多数插入缺失位于编码区域内,其余位于基因调控区域。尽管可以从稳健的研究设计和细致的人工审查中得出高置信度,但这个广泛的插入缺失集尚未经过正交验证。扩展的基准插入缺失集以及之前发布的已知阳性集中的插入缺失是用于在 precisionFDA 平台上举办的社区挑战赛中基准插入缺失调用管道的真实集。该基准插入缺失集和参考样本可用于全面评估插入缺失调用管道。此外,在手动审查过程中获得的见解和解决方案可以帮助提高这些管道的性能。