Suppr超能文献

在深度RNA测序基因表达研究中寻找活跃基因。

Finding the active genes in deep RNA-seq gene expression studies.

作者信息

Hart Traver, Komori H Kiyomi, LaMere Sarah, Podshivalova Katie, Salomon Daniel R

机构信息

Donnelly Centre, Banting & Best Department of Medical Research, University of Toronto, Toronto, Canada.

出版信息

BMC Genomics. 2013 Nov 11;14:778. doi: 10.1186/1471-2164-14-778.

Abstract

BACKGROUND

Early application of second-generation sequencing technologies to transcript quantitation (RNA-seq) has hinted at a vast mammalian transcriptome, including transcripts from nearly all known genes, which might be fully measured only by ultradeep sequencing. Subsequent studies suggested that low-abundance transcripts might be the result of technical or biological noise rather than active transcripts; moreover, most RNA-seq experiments did not provide enough read depth to generate high-confidence estimates of gene expression for low-abundance transcripts. As a result, the community adopted several heuristics for RNA-seq analysis, most notably an arbitrary expression threshold of 0.3 - 1 FPKM for downstream analysis. However, advances in RNA-seq library preparation, sequencing technology, and informatic analysis have addressed many of the systemic sources of uncertainty and undermined the assumptions that drove the adoption of these heuristics. We provide an updated view of the accuracy and efficiency of RNA-seq experiments, using genomic data from large-scale studies like the ENCODE project to provide orthogonal information against which to validate our conclusions.

RESULTS

We show that a human cell's transcriptome can be divided into active genes carrying out the work of the cell and other genes that are likely the by-products of biological or experimental noise. We use ENCODE data on chromatin state to show that ultralow-expression genes are predominantly associated with repressed chromatin; we provide a novel normalization metric, zFPKM, that identifies the threshold between active and background gene expression; and we show that this threshold is robust to experimental and analytical variations.

CONCLUSIONS

The zFPKM normalization method accurately separates the biologically relevant genes in a cell, which are associated with active promoters, from the ultralow-expression noisy genes that have repressed promoters. A read depth of twenty to thirty million mapped reads allows high-confidence quantitation of genes expressed at this threshold, providing important guidance for the design of RNA-seq studies of gene expression. Moreover, we offer an example for using extensive ENCODE chromatin state information to validate RNA-seq analysis pipelines.

摘要

背景

第二代测序技术早期应用于转录本定量分析(RNA测序)时,暗示了存在一个庞大的哺乳动物转录组,其中包括几乎所有已知基因的转录本,而这些转录本可能只有通过超深度测序才能被全面测定。后续研究表明,低丰度转录本可能是技术或生物学噪声的结果,而非活跃转录本;此外,大多数RNA测序实验并未提供足够的读长深度,以对低丰度转录本的基因表达进行高可信度估计。因此,学界采用了多种启发式方法进行RNA测序分析,其中最显著的是在下游分析中采用0.3 - 1 FPKM这一任意设定的表达阈值。然而,RNA测序文库制备、测序技术和信息分析方面的进展已经解决了许多系统性的不确定性来源,并削弱了促使采用这些启发式方法的假设。我们利用来自诸如ENCODE项目等大规模研究的基因组数据,提供正交信息以验证我们的结论,从而给出了关于RNA测序实验准确性和效率的最新观点。

结果

我们表明,人类细胞的转录组可分为执行细胞功能的活跃基因和可能是生物学或实验噪声副产物的其他基因。我们利用ENCODE项目中关于染色质状态的数据表明,超低表达基因主要与抑制性染色质相关;我们提供了一种新的标准化指标zFPKM,它可确定活跃基因表达与背景基因表达之间的阈值;并且我们表明该阈值对实验和分析变化具有稳健性。

结论

zFPKM标准化方法能准确地将细胞中与活跃启动子相关的生物学相关基因,与具有抑制性启动子的超低表达噪声基因区分开来。二千万到三千万条比对上的读长深度能够对在此阈值表达的基因进行高可信度定量,为基因表达的RNA测序研究设计提供重要指导。此外,我们提供了一个利用广泛的ENCODE染色质状态信息来验证RNA测序分析流程的示例。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/038d/3870982/b52078449eaf/1471-2164-14-778-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验