Suppr超能文献

用于生物医学应用的联邦水平分区主成分分析

Federated horizontally partitioned principal component analysis for biomedical applications.

作者信息

Hartebrodt Anne, Röttger Richard

机构信息

Department of Mathematics and Computer Science, University of Southern Denmark, Odense 5230, Denmark.

出版信息

Bioinform Adv. 2022 Apr 26;2(1):vbac026. doi: 10.1093/bioadv/vbac026. eCollection 2022.

Abstract

MOTIVATION

Federated learning enables privacy-preserving machine learning in the medical domain because the sensitive patient data remain with the owner and only parameters are exchanged between the data holders. The federated scenario introduces specific challenges related to the decentralized nature of the data, such as batch effects and differences in study population between the sites. Here, we investigate the challenges of moving classical analysis methods to the federated domain, specifically principal component analysis (PCA), a versatile and widely used tool, often serving as an initial step in machine learning and visualization workflows. We provide implementations of different federated PCA algorithms and evaluate them regarding their accuracy for high-dimensional biological data using realistic sample distributions over multiple data sites, and their ability to preserve downstream analyses.

RESULTS

Federated subspace iteration converges to the centralized solution even for unfavorable data distributions, while approximate methods introduce error. Larger sample sizes at the study sites lead to better accuracy of the approximate methods. Approximate methods may be sufficient for coarse data visualization, but are vulnerable to outliers and batch effects. Before the analysis, the PCA algorithm, as well as the number of eigenvectors should be considered carefully to avoid unnecessary communication overhead.

AVAILABILITY AND IMPLEMENTATION

Simulation code and notebooks for federated PCA can be found at https://gitlab.com/roettgerlab/federatedPCA; the code for the federated app is available at https://github.com/AnneHartebrodt/fc-federated-pca.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

动机

联邦学习能够在医学领域实现隐私保护的机器学习,因为敏感的患者数据保留在所有者手中,数据持有者之间仅交换参数。联邦场景带来了与数据分散性相关的特定挑战,例如批次效应和各站点研究人群的差异。在此,我们研究将经典分析方法迁移到联邦领域的挑战,特别是主成分分析(PCA),这是一种通用且广泛使用的工具,常作为机器学习和可视化工作流程的初始步骤。我们提供了不同联邦主成分分析算法的实现,并使用多个数据站点上的实际样本分布评估它们在高维生物数据方面的准确性,以及它们保留下游分析的能力。

结果

即使对于不利的数据分布,联邦子空间迭代也能收敛到集中式解决方案,而近似方法会引入误差。研究站点中更大的样本量会导致近似方法的准确性更高。近似方法对于粗略的数据可视化可能就足够了,但容易受到异常值和批次效应的影响。在分析之前,应仔细考虑主成分分析算法以及特征向量的数量,以避免不必要的通信开销。

可用性与实现

联邦主成分分析的模拟代码和笔记本可在https://gitlab.com/roettgerlab/federatedPCA找到;联邦应用程序的代码可在https://github.com/AnneHartebrodt/fc-federated-pca获得。

补充信息

补充数据可在网上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a05d/9710634/7eb2316c4382/vbac026f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验