Sun Xiaoxiao, Dalpiaz David, Wu Di, S Liu Jun, Zhong Wenxuan, Ma Ping
Department of Statistics, University of Georgia, 101 Cedar Street, Athens, 30602, USA.
Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street, Champaign, 61820, USA.
BMC Bioinformatics. 2016 Aug 26;17(1):324. doi: 10.1186/s12859-016-1180-9.
Accurate identification of differentially expressed (DE) genes in time course RNA-Seq data is crucial for understanding the dynamics of transcriptional regulatory network. However, most of the available methods treat gene expressions at different time points as replicates and test the significance of the mean expression difference between treatments or conditions irrespective of time. They thus fail to identify many DE genes with different profiles across time. In this article, we propose a negative binomial mixed-effect model (NBMM) to identify DE genes in time course RNA-Seq data. In the NBMM, mean gene expression is characterized by a fixed effect, and time dependency is described by random effects. The NBMM is very flexible and can be fitted to both unreplicated and replicated time course RNA-Seq data via a penalized likelihood method. By comparing gene expression profiles over time, we further classify the DE genes into two subtypes to enhance the understanding of expression dynamics. A significance test for detecting DE genes is derived using a Kullback-Leibler distance ratio. Additionally, a significance test for gene sets is developed using a gene set score.
Simulation analysis shows that the NBMM outperforms currently available methods for detecting DE genes and gene sets. Moreover, our real data analysis of fruit fly developmental time course RNA-Seq data demonstrates the NBMM identifies biologically relevant genes which are well justified by gene ontology analysis.
The proposed method is powerful and efficient to detect biologically relevant DE genes and gene sets in time course RNA-Seq data.
准确识别时间序列RNA-Seq数据中的差异表达(DE)基因对于理解转录调控网络的动态变化至关重要。然而,大多数现有方法将不同时间点的基因表达视为重复样本,并检验处理或条件之间平均表达差异的显著性,而不考虑时间因素。因此,它们无法识别许多随时间具有不同表达模式的DE基因。在本文中,我们提出了一种负二项混合效应模型(NBMM)来识别时间序列RNA-Seq数据中的DE基因。在NBMM中,基因平均表达由固定效应表征,时间依赖性由随机效应描述。NBMM非常灵活,可以通过惩罚似然法拟合到无重复和有重复的时间序列RNA-Seq数据。通过比较基因随时间的表达谱,我们进一步将DE基因分为两个亚型,以增强对表达动态的理解。使用Kullback-Leibler距离比推导了检测DE基因的显著性检验。此外,使用基因集得分开发了基因集的显著性检验。
模拟分析表明,NBMM在检测DE基因和基因集方面优于现有方法。此外,我们对果蝇发育时间序列RNA-Seq数据的实际数据分析表明,NBMM识别出的生物学相关基因通过基因本体分析得到了很好的验证。
所提出的方法在检测时间序列RNA-Seq数据中生物学相关的DE基因和基因集方面强大且高效。