Division of Genetics and Epidemiology, Institute of Cancer Research, Surrey, United Kingdom.
PLoS One. 2012;7(11):e49110. doi: 10.1371/journal.pone.0049110. Epub 2012 Nov 12.
Pipelines for the analysis of Next-Generation Sequencing (NGS) data are generally composed of a set of different publicly available software, configured together in order to map short reads of a genome and call variants. The fidelity of pipelines is variable. We have developed ArtificialFastqGenerator, which takes a reference genome sequence as input and outputs artificial paired-end FASTQ files containing Phred quality scores. Since these artificial FASTQs are derived from the reference genome, it provides a gold-standard for read-alignment and variant-calling, thereby enabling the performance of any NGS pipeline to be evaluated. The user can customise DNA template/read length, the modelling of coverage based on GC content, whether to use real Phred base quality scores taken from existing FASTQ files, and whether to simulate sequencing errors. Detailed coverage and error summary statistics are outputted. Here we describe ArtificialFastqGenerator and illustrate its implementation in evaluating a typical bespoke NGS analysis pipeline under different experimental conditions. ArtificialFastqGenerator was released in January 2012. Source code, example files and binaries are freely available under the terms of the GNU General Public License v3.0. from https://sourceforge.net/projects/artfastqgen/.
用于分析下一代测序(NGS)数据的流程通常由一组不同的公开可用软件组成,这些软件经过配置,以便映射基因组的短读序列并调用变体。流程的保真度是可变的。我们开发了 ArtificialFastqGenerator,它以参考基因组序列作为输入,并输出包含 Phred 质量分数的人工成对的 FASTQ 文件。由于这些人工 FASTQ 是从参考基因组中派生出来的,因此它为读序列比对和变体调用提供了黄金标准,从而可以评估任何 NGS 流程的性能。用户可以自定义 DNA 模板/读取长度、基于 GC 含量的覆盖模型、是否使用来自现有 FASTQ 文件的真实 Phred 碱基质量分数,以及是否模拟测序错误。输出详细的覆盖和错误汇总统计信息。在这里,我们描述了 ArtificialFastqGenerator,并说明了它在不同实验条件下评估典型定制 NGS 分析流程的实现。ArtificialFastqGenerator 于 2012 年 1 月发布。源代码、示例文件和二进制文件可根据 GNU 通用公共许可证 v3.0 的条款在 https://sourceforge.net/projects/artfastqgen/ 免费获得。