Department of Statistics, Computer Science and Applications "Giuseppe Parenti", University of Florence, 50134, Florence, Italy.
"Nello Carrara" Institute of Applied Physics (IFAC), National Research Council (CNR), 50019, Sesto Fiorentino, Florence, Italy.
Sci Data. 2024 Jan 23;11(1):115. doi: 10.1038/s41597-023-02421-7.
Pooling publicly-available MRI data from multiple sites allows to assemble extensive groups of subjects, increase statistical power, and promote data reuse with machine learning techniques. The harmonization of multicenter data is necessary to reduce the confounding effect associated with non-biological sources of variability in the data. However, when applied to the entire dataset before machine learning, the harmonization leads to data leakage, because information outside the training set may affect model building, and potentially falsely overestimate performance. We propose a 1) measurement of the efficacy of data harmonization; 2) harmonizer transformer, i.e., an implementation of the ComBat harmonization allowing its encapsulation among the preprocessing steps of a machine learning pipeline, avoiding data leakage by design. We tested these tools using brain T-weighted MRI data from 1740 healthy subjects acquired at 36 sites. After harmonization, the site effect was removed or reduced, and we showed the data leakage effect in predicting individual age from MRI data, highlighting that introducing the harmonizer transformer into a machine learning pipeline allows for avoiding data leakage by design.
从多个站点汇集公开可用的 MRI 数据,可以汇集大量的受试者,增加统计能力,并通过机器学习技术促进数据重用。多中心数据的协调对于减少与数据中非生物学来源的变异性相关的混杂效应是必要的。然而,当将其应用于机器学习之前的整个数据集时,协调会导致数据泄露,因为训练集之外的信息可能会影响模型构建,并可能错误地高估性能。我们提出了 1)协调数据的功效的测量;2)协调器转换器,即 ComBat 协调的实现,允许将其封装在机器学习管道的预处理步骤中,通过设计避免数据泄露。我们使用来自 36 个站点的 1740 名健康受试者的大脑 T 加权 MRI 数据测试了这些工具。协调后,去除或减少了站点效应,我们还展示了从 MRI 数据预测个体年龄时的数据泄露效应,这突出表明,将协调器转换器引入机器学习管道可以通过设计避免数据泄露。