Caliskan Aylin, Caliskan Deniz, Rasbach Lauritz, Yu Weimeng, Dandekar Thomas, Breitenbach Tim
Department of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, 97074 Würzburg, Germany.
Comput Struct Biotechnol J. 2023 Jun 5;21:3293-3314. doi: 10.1016/j.csbj.2023.06.002. eCollection 2023.
Machine learning techniques are excellent to analyze expression data from single cells. These techniques impact all fields ranging from cell annotation and clustering to signature identification. The presented framework evaluates gene selection sets how far they optimally separate defined phenotypes or cell groups. This innovation overcomes the present limitation to objectively and correctly identify a small gene set of high information content regarding separating phenotypes for which corresponding code scripts are provided. The small but meaningful subset of the original genes (or feature space) facilitates human interpretability of the differences of the phenotypes including those found by machine learning results and may even turn correlations between genes and phenotypes into a causal explanation. For the feature selection task, the principal feature analysis is utilized which reduces redundant information while selecting genes that carry the information for separating the phenotypes. In this context, the presented framework shows explainability of unsupervised learning as it reveals cell-type specific signatures. Apart from a Seurat preprocessing tool and the PFA script, the pipeline uses mutual information to balance accuracy and size of the gene set if desired. A validation part to evaluate the gene selection for their information content regarding the separation of the phenotypes is provided as well, binary and multiclass classification of 3 or 4 groups are studied. Results from different single-cell data are presented. In each, only about ten out of more than 30000 genes are identified as carrying the relevant information. The code is provided in a GitHub repository at https://github.com/AC-PHD/Seurat_PFA_pipeline.
机器学习技术在分析单细胞表达数据方面表现出色。这些技术影响着从细胞注释、聚类到特征识别的所有领域。所提出的框架评估基因选择集在多大程度上能最佳地分离定义的表型或细胞组。这一创新克服了当前的局限性,能够客观、正确地识别出一小套关于分离表型的高信息含量基因集,并提供了相应的代码脚本。原始基因(或特征空间)中这个小而有意义的子集有助于人类解释表型差异,包括机器学习结果所发现的差异,甚至可能将基因与表型之间的相关性转化为因果解释。对于特征选择任务,采用了主特征分析,它在选择携带分离表型信息的基因时减少了冗余信息。在这种情况下,所提出的框架展示了无监督学习的可解释性,因为它揭示了细胞类型特异性特征。除了Seurat预处理工具和PFA脚本外,如果需要,该流程还使用互信息来平衡基因集的准确性和大小。还提供了一个验证部分,用于评估基因选择在分离表型方面的信息含量,研究了3组或4组的二元和多类分类。展示了来自不同单细胞数据的结果。在每个数据中,超过30000个基因中只有大约10个被确定为携带相关信息。代码可在https://github.com/AC-PHD/Seurat_PFA_pipeline的GitHub存储库中获取。