Kolobkov Dmitry, Mishra Sharma Satyarth, Medvedev Aleksandr, Lebedev Mikhail, Kosaretskiy Egor, Vakhitov Ruslan
GENXT, Hinxton, United Kingdom.
Laboratory of Ecological Genetics, Vavilov Institute of General Genetics, Moscow, Russia.
Front Big Data. 2024 Feb 29;7:1266031. doi: 10.3389/fdata.2024.1266031. eCollection 2024.
Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.
合并来自多个来源的训练数据可以增加样本量并减少混杂因素,从而产生更准确且偏差更小的机器学习模型。然而,在医疗保健领域,数据保管人通常不允许直接合并数据,因为他们负责尽量减少敏感信息的暴露。联邦学习通过以分散方式训练模型,为解决此问题提供了一个有前景的解决方案,从而降低了数据泄露的风险。尽管联邦学习在临床数据上的应用越来越多,但其在个体水平基因组数据上的功效尚未得到研究。本研究通过调查联邦学习在两种场景中的适用性,为其在基因组数据中的应用奠定了基础:对英国生物银行数据进行表型预测以及对千人基因组计划数据进行血统预测。我们表明,即使在存在显著节点间异质性的情况下,在拆分为独立节点的数据上训练的联邦模型也能达到接近集中式模型的性能。此外,我们研究了通信频率如何影响联邦模型的准确性,并提出了降低计算复杂度或通信成本的方法。