Institute for Ophthalmic Research, University of Tübingen, Tübingen, Germany.
Bernstein Center for Computational Neuroscience, University of Tübingen, Tübingen, Germany.
Nat Commun. 2019 Nov 28;10(1):5416. doi: 10.1038/s41467-019-13056-x.
Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.
单细胞转录组学产生了越来越多的数据,其中包含了多达数百万个细胞的数千个基因的 RNA 表达水平。常见的数据分析管道包括降维步骤,用于在二维空间中可视化数据,最常用的方法是使用 t 分布随机邻域嵌入 (t-SNE)。它擅长揭示高维数据中的局部结构,但在直观应用中往往存在严重的缺陷,例如数据的全局结构不能被准确地表示。在这里,我们描述了如何规避这些陷阱,并开发了一种创建更真实 t-SNE 可视化的方案。它包括 PCA 初始化、高学习率和多尺度相似性核;对于非常大的数据集,我们还使用夸张和基于下采样的初始化。我们使用已发表的单细胞 RNA-seq 数据集来证明与 t-SNE 的直观应用相比,该方案产生了更好的结果。