Suppr超能文献

系统基准测试最先进的变异调用管道,确定影响编码序列变异发现准确性的主要因素。

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery.

机构信息

Bioinformatics Institute, St. Petersburg, Russia.

Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg, Russia.

出版信息

BMC Genomics. 2022 Feb 22;23(1):155. doi: 10.1186/s12864-022-08365-3.

Abstract

BACKGROUND

Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from next-generation sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. Although variant caller benchmarks are published constantly, no previous publications have leveraged the full extent of available gold standard whole-genome (WGS) and whole-exome (WES) sequencing datasets.

RESULTS

In this work, we systematically evaluated the performance of 4 popular short read aligners (Bowtie2, BWA, Isaac, and Novoalign) and 9 novel and well-established variant calling and filtering methods (Clair3, DeepVariant, Octopus, GATK, FreeBayes, and Strelka2) using a set of 14 "gold standard" WES and WGS datasets available from Genome In A Bottle (GIAB) consortium. Additionally, we have indirectly evaluated each pipeline's performance using a set of 6 non-GIAB samples of African and Russian ethnicity. In our benchmark, Bowtie2 performed significantly worse than other aligners, suggesting it should not be used for medical variant calling. When other aligners were considered, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. Among the tested variant callers, DeepVariant consistently showed the best performance and the highest robustness. Other actively developed tools, such as Clair3, Octopus, and Strelka2, also performed well, although their efficiency had greater dependence on the quality and type of the input data. We have also compared the consistency of variant calls in GIAB and non-GIAB samples. With few important caveats, best-performing tools have shown little evidence of overfitting.

CONCLUSIONS

The results show surprisingly large differences in the performance of cutting-edge tools even in high confidence regions of the coding genome. This highlights the importance of regular benchmarking of quickly evolving tools and pipelines. We also discuss the need for a more diverse set of gold standard genomes that would include samples of African, Hispanic, or mixed ancestry. Additionally, there is also a need for better variant caller assessment in the repetitive regions of the coding genome.

摘要

背景

准确检测人类基因组编码区的变异是孟德尔疾病分子诊断的关键要求。从下一代测序(NGS)数据中发现变异的效率取决于多个因素,包括 NGS 方法的可重复性覆盖偏差和读序列比对和变异调用软件的性能。尽管不断发布变异调用器基准测试,但以前的出版物都没有充分利用可用的全基因组(WGS)和全外显子组(WES)测序数据集的全部范围。

结果

在这项工作中,我们使用来自基因组瓶(GIAB)联盟的一组 14 个“金标准”WES 和 WGS 数据集,系统地评估了 4 种流行的短读序列比对器(Bowtie2、BWA、Isaac 和 Novoalign)和 9 种新颖且成熟的变异调用和过滤方法(Clair3、DeepVariant、Octopus、GATK、FreeBayes 和 Strelka2)的性能。此外,我们还使用一组来自非洲和俄罗斯血统的 6 个非 GIAB 样本间接评估了每个管道的性能。在我们的基准测试中,Bowtie2 的表现明显逊于其他比对器,表明它不应用于医学变异调用。当考虑其他比对器时,变异发现的准确性主要取决于变异调用器,而不是读序列比对器。在测试的变异调用器中,DeepVariant 始终表现出最佳的性能和最高的稳健性。其他正在积极开发的工具,如 Clair3、Octopus 和 Strelka2,也表现良好,尽管它们的效率对输入数据的质量和类型有更大的依赖性。我们还比较了 GIAB 和非 GIAB 样本中变异调用的一致性。除了一些重要的注意事项外,表现最好的工具几乎没有过度拟合的证据。

结论

即使在编码基因组的高置信区,最先进的工具的性能也存在惊人的差异。这突出表明需要定期基准测试快速发展的工具和管道。我们还讨论了需要更具多样性的金标准基因组集,包括非洲、西班牙裔或混合血统的样本。此外,还需要在编码基因组的重复区域更好地评估变异调用器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1c22/8862519/aad72e306a23/12864_2022_8365_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验