Cormier Nathan, Kolisnik Tyler, Bieda Mark
Department of Biochemistry and Molecular Biology, University of Calgary Cumming School of Medicine, Rm HSC1151, 3330 Hospital Dr. NW, Calgary, AB, T2N4N1, Canada.
BMC Bioinformatics. 2016 Jul 5;17(1):270. doi: 10.1186/s12859-016-1125-3.
There has been an enormous expansion of use of chromatin immunoprecipitation followed by sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and production of several specialized graphical outputs. A number of systems have emphasized custom development of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking.
We present a set of 20 ChIP-seq analysis software modules implemented in the Kepler workflow system; most (18/20) were also implemented as standalone, fully functional R scripts. The set consists of four full turnkey pipelines and 16 component modules. The turnkey pipelines in Kepler allow data provenance tracking. Implementation emphasized use of common R packages and widely-used external tools (e.g., MACS for peak finding), along with custom programming. This software presents comprehensive solutions and easily repurposed code blocks for ChIP-seq analysis and pipeline creation. Tasks include mapping raw reads, peakfinding via MACS, summary statistics, peak location statistics, summary plots centered on the transcription start site (TSS), gene ontology, pathway analysis, and de novo motif finding, among others.
These pipelines range from those performing a single task to those performing full analyses of ChIP-seq data. The pipelines are supplied as both Kepler workflows, which allow data provenance tracking, and, in the majority of cases, as standalone R scripts. These pipelines are designed for ease of modification and repurposing.
染色质免疫沉淀测序(ChIP-seq)技术的应用范围已大幅扩展。大规模ChIP-seq数据集的分析涉及一系列复杂步骤,并会生成多种专门的图形输出。许多系统都侧重于ChIP-seq流程的定制开发。这些系统主要基于单个复杂流程的定制编程或提供模块库,无法生成ChIP-seq数据集通常会产生的全部输出。需要更全面的流程,特别是那些能处理常见元数据任务(如通路分析)的流程,以及能生成标准复杂图形输出的流程。如果这些是高度模块化的系统,既可以作为交钥匙流程,也可以作为单个模块使用,易于理解、修改和扩展,以便能根据这一不断发展的领域中的新分析进展快速调整,那就更好了。此外,如果这些流程能实现数据溯源跟踪则更具优势。
我们展示了一套在开普勒工作流系统中实现的20个ChIP-seq分析软件模块;其中大多数(18/20)也被实现为独立的、功能齐全的R脚本。该套件包括四个完整的交钥匙流程和16个组件模块。开普勒中的交钥匙流程允许进行数据溯源跟踪。实现过程强调使用常见的R包和广泛使用的外部工具(如用于峰值查找的MACS),以及定制编程。该软件为ChIP-seq分析和流程创建提供了全面的解决方案和易于重新利用的代码块。任务包括原始读段映射、通过MACS进行峰值查找、汇总统计、峰值位置统计、以转录起始位点(TSS)为中心的汇总图、基因本体、通路分析和从头基序查找等。
这些流程涵盖了从执行单一任务到对ChIP-seq数据进行全面分析的各种流程。这些流程既可以作为允许进行数据溯源跟踪的开普勒工作流提供,在大多数情况下也可以作为独立的R脚本提供。这些流程的设计便于修改和重新利用。