Suppr超能文献

单细胞RNA测序(scRNA-seq)统计误差模型的比较与评估

Comparison and evaluation of statistical error models for scRNA-seq.

作者信息

Choudhary Saket, Satija Rahul

机构信息

New York Genome Center, 101 Avenue of the Americas, New York, 100013, USA.

Center for Genomics and Systems Biology, New York University, 12 Waverly Pl, New York, 10003, USA.

出版信息

Genome Biol. 2022 Jan 18;23(1):27. doi: 10.1186/s13059-021-02584-9.

Abstract

BACKGROUND

Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate.

RESULTS

Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation.

CONCLUSIONS

Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.

摘要

背景

单细胞RNA测序(scRNA-seq)数据中的异质性由多种来源驱动,包括细胞状态的生物学变异以及实验处理过程中引入的技术变异。对这些影响进行解卷积是预处理工作流程的关键挑战。最近的工作已经证明了计数模型在scRNA-seq分析中的重要性和实用性,但对于哪种统计分布和参数设置合适,目前尚无共识。

结果

在这里,我们分析了59个scRNA-seq数据集,这些数据集涵盖了广泛的技术、系统和测序深度,以评估不同误差模型的性能。我们发现,虽然泊松误差模型似乎适用于稀疏数据集,但在所有生物系统中,我们观察到有足够测序深度的基因存在明显的过度离散证据,因此需要使用负二项式模型。此外,我们发现过度离散的程度在不同数据集、系统和基因丰度之间差异很大,这表明需要一种数据驱动的参数估计方法。

结论

基于这些分析,我们为scRNA-seq数据变异建模提供了一组建议,特别是在使用广义线性模型或基于似然的方法进行预处理和下游分析时。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4142/8764781/0218d1f757de/13059_2021_2584_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验