Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA.
Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.
Genome Biol. 2022 May 18;23(1):118. doi: 10.1186/s13059-022-02684-0.
Spatial transcriptomics are a set of new technologies that profile gene expression on tissues with spatial localization information. With technological advances, recent spatial transcriptomics data are often in the form of sparse counts with an excessive amount of zero values.
We perform a comprehensive analysis on 20 spatial transcriptomics datasets collected from 11 distinct technologies to characterize the distributional properties of the expression count data and understand the statistical nature of the zero values. Across datasets, we show that a substantial fraction of genes displays overdispersion and/or zero inflation that cannot be accounted for by a Poisson model, with genes displaying overdispersion substantially overlapped with genes displaying zero inflation. In addition, we find that either the Poisson or the negative binomial model is sufficient for modeling the majority of genes across most spatial transcriptomics technologies. We further show major sources of overdispersion and zero inflation in spatial transcriptomics including gene expression heterogeneity across tissue locations and spatial distribution of cell types. In particular, when we focus on a relatively homogeneous set of tissue locations or control for cell type compositions, the number of detected overdispersed and/or zero-inflated genes is substantially reduced, and a simple Poisson model is often sufficient to fit the gene expression data there.
Our study provides the first comprehensive evidence that excessive zeros in spatial transcriptomics are not due to zero inflation, supporting the use of count models without a zero inflation component for modeling spatial transcriptomics.
空间转录组学是一组新技术,可对具有空间定位信息的组织中的基因表达进行分析。随着技术的进步,最近的空间转录组学数据通常以稀疏计数的形式出现,其中包含大量的零值。
我们对从 11 种不同技术中收集的 20 个空间转录组学数据集进行了全面分析,以描述表达计数数据的分布特性,并了解零值的统计性质。在所有数据集上,我们表明相当一部分基因显示过度分散和/或零膨胀,无法用泊松模型来解释,显示过度分散的基因与显示零膨胀的基因有很大的重叠。此外,我们发现泊松或负二项式模型足以对大多数基因进行建模,而这些基因在大多数空间转录组学技术中都有。我们进一步展示了空间转录组学中过度分散和零膨胀的主要来源,包括组织位置之间的基因表达异质性和细胞类型的空间分布。特别是,当我们关注相对同质的组织位置集或控制细胞类型组成时,检测到的过度分散和/或零膨胀基因的数量大大减少,简单的泊松模型通常足以拟合那里的基因表达数据。
我们的研究首次提供了充分的证据表明,空间转录组学中的大量零值不是由于零膨胀造成的,支持使用不包含零膨胀成分的计数模型来对空间转录组学进行建模。