Lakkis Justin, Schroeder Amelia, Su Kenong, Lee Michelle Y Y, Bashore Alexander C, Reilly Muredach P, Li Mingyao
Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
Nat Mach Intell. 2022 Nov;4(11):940-952. doi: 10.1038/s42256-022-00545-w. Epub 2022 Oct 27.
CITE-seq, a single-cell multi-omics technology that measures RNA and protein expression simultaneously in single cells, has been widely applied in biomedical research, especially in immune related disorders and other diseases such as influenza and COVID-19. Despite the proliferation of CITE-seq, it is still costly to generate such data. Although data integration can increase information content, this raises computational challenges. First, combining multiple datasets is prone to batch effects that need to be addressed. Secondly, it is difficult to combine multiple CITE-seq datasets because the protein panels in different datasets may only partially overlap. Integrating multiple CITE-seq and single-cell RNA-seq (scRNA-seq) datasets is important because this allows the utilization of as many data as possible to uncover cell population heterogeneity. To overcome these challenges, we present sciPENN, a multi-use deep learning approach that supports CITE-seq and scRNA-seq data integration, protein expression prediction for scRNA-seq, protein expression imputation for CITE-seq, quantification of prediction and imputation uncertainty, and cell type label transfer from CITE-seq to scRNA-seq. Comprehensive evaluations spanning multiple datasets demonstrate that sciPENN outperforms other current state-of-the-art methods.
CITE-seq是一种单细胞多组学技术,可在单细胞中同时测量RNA和蛋白质表达,已广泛应用于生物医学研究,尤其是在免疫相关疾病以及流感和COVID-19等其他疾病的研究中。尽管CITE-seq技术不断发展,但生成此类数据的成本仍然很高。虽然数据整合可以增加信息含量,但这也带来了计算方面的挑战。首先,合并多个数据集容易出现批次效应,需要加以解决。其次,合并多个CITE-seq数据集很困难,因为不同数据集中的蛋白质面板可能只是部分重叠。整合多个CITE-seq和单细胞RNA测序(scRNA-seq)数据集很重要,因为这可以利用尽可能多的数据来揭示细胞群体的异质性。为了克服这些挑战,我们提出了sciPENN,这是一种多用途深度学习方法,支持CITE-seq和scRNA-seq数据整合、scRNA-seq的蛋白质表达预测、CITE-seq的蛋白质表达插补、预测和插补不确定性的量化,以及细胞类型标签从CITE-seq到scRNA-seq的转移。对多个数据集的综合评估表明,sciPENN优于其他当前的先进方法。