Department of Cancer Biology, Wake Forest University School of Medicine, Winston Salem, NC.
Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA.
JCO Clin Cancer Inform. 2021 Jan;5:1-11. doi: 10.1200/CCI.20.00060.
Building well-performing machine learning (ML) models in health care has always been exigent because of the data-sharing concerns, yet ML approaches often require larger training samples than is afforded by one institution. This paper explores several federated learning implementations by applying them in both a simulated environment and an actual implementation using electronic health record data from two academic medical centers on a Microsoft Azure Cloud Databricks platform.
Using two separate cloud tenants, ML models were created, trained, and exchanged from one institution to another via a GitHub repository. Federated learning processes were applied to both artificial neural networks (ANNs) and logistic regression (LR) models on the horizontal data sets that are varying in count and availability. Incremental and cyclic federated learning models have been tested in simulation and real environments.
The cyclically trained ANN showed a 3% increase in performance, a significant improvement across most attempts ( < .05). Single weight neural network models showed improvement in some cases. However, LR models did not show much improvement after federated learning processes. The specific process that improved the performance differed based on the ML model and how federated learning was implemented. Moreover, we have confirmed that the order of the institutions during the training did influence the overall performance increase.
Unlike previous studies, our work has shown the implementation and effectiveness of federated learning processes beyond simulation. Additionally, we have identified different federated learning models that have achieved statistically significant performances. More work is needed to achieve effective federated learning processes in biomedicine, while preserving the security and privacy of the data.
由于数据共享问题,医疗保健领域构建性能良好的机器学习 (ML) 模型一直是一项艰巨的任务,但 ML 方法通常需要比一个机构所能提供的更大的训练样本。本文通过在模拟环境和使用来自两个学术医疗中心的电子健康记录数据在 Microsoft Azure Cloud Databricks 平台上的实际实现中应用几种联邦学习实现来探讨这个问题。
使用两个单独的云租户,通过 GitHub 存储库从一个机构到另一个机构创建、训练和交换 ML 模型。在水平数据集上应用联邦学习过程,这些数据集在数量和可用性上存在差异。在模拟和实际环境中测试了增量和循环联邦学习模型。
在模拟环境中,经过循环训练的 ANN 性能提高了 3%,在大多数尝试中都有显著提高(<0.05)。在某些情况下,单个权重神经网络模型也有所改进。然而,经过联邦学习过程后,LR 模型并没有太大的改进。具体的改进过程因 ML 模型和联邦学习的实施方式而异。此外,我们已经确认在训练过程中机构的顺序确实会影响整体性能的提高。
与之前的研究不同,我们的工作展示了联邦学习过程在模拟之外的实现和有效性。此外,我们还确定了一些实现了统计学上显著性能的不同联邦学习模型。需要进一步的工作来实现有效的生物医学联邦学习过程,同时保护数据的安全性和隐私性。