Kuan Pei Fen, Chung Dongjun, Pan Guangjin, Thomson James A, Stewart Ron, Keleş Sündüz
Departments of Statistics and of Biostatistics and Medical Informatics.
Genome Center of Wisconsin and Morgridge Institute for Research.
J Am Stat Assoc. 2011;106(495):891-903. doi: 10.1198/jasa.2011.ap09706. Epub 2012 Jan 24.
Chromatin immunoprecipitation followed by sequencing (ChIP-Seq) has revolutionalized experiments for genome-wide profiling of DNA-binding proteins, histone modifications, and nucleosome occupancy. As the cost of sequencing is decreasing, many researchers are switching from microarray-based technologies (ChIP-chip) to ChIP-Seq for genome-wide study of transcriptional regulation. Despite its increasing and well-deserved popularity, there is little work that investigates and accounts for sources of biases in the ChIP-Seq technology. These biases typically arise from both the standard pre-processing protocol and the underlying DNA sequence of the generated data. We study data from a naked DNA sequencing experiment, which sequences non-cross-linked DNA after deproteinizing and shearing, to understand factors affecting background distribution of data generated in a ChIP-Seq experiment. We introduce a background model that accounts for apparent sources of biases such as mappability and GC content and develop a flexible mixture model named MOSAiCS for detecting peaks in both one- and two-sample analyses of ChIP-Seq data. We illustrate that our model fits observed ChIP-Seq data well and further demonstrate advantages of MOSAiCS over commonly used tools for ChIP-Seq data analysis with several case studies.
染色质免疫沉淀测序(ChIP-Seq)彻底改变了用于全基因组分析DNA结合蛋白、组蛋白修饰和核小体占据情况的实验。随着测序成本的降低,许多研究人员正从基于微阵列的技术(ChIP-chip)转向ChIP-Seq,以进行全基因组转录调控研究。尽管ChIP-Seq越来越受欢迎且实至名归,但很少有工作去研究和解释该技术中偏差的来源。这些偏差通常源于标准的预处理方案和所生成数据的基础DNA序列。我们研究了来自裸DNA测序实验的数据,该实验在使DNA脱蛋白和剪切后对非交联DNA进行测序,以了解影响ChIP-Seq实验中数据背景分布的因素。我们引入了一个背景模型,该模型考虑了诸如可映射性和GC含量等明显的偏差来源,并开发了一种名为MOSAiCS的灵活混合模型,用于在ChIP-Seq数据的单样本和双样本分析中检测峰值。我们表明我们的模型能很好地拟合观察到的ChIP-Seq数据,并通过几个案例研究进一步证明了MOSAiCS相对于常用的ChIP-Seq数据分析工具的优势。