Department of Pathology, Boston Children's Hospital, and Department of Pathology, Harvard Medical School, Boston, Massachusetts 02115, United States.
Department of Neuropsychology and Psychopharmacology, EURON, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht 6229ER, The Netherlands.
J Proteome Res. 2022 Nov 4;21(11):2810-2814. doi: 10.1021/acs.jproteome.2c00278. Epub 2022 Oct 6.
Combining robust proteomics instrumentation with high-throughput enabling liquid chromatography (LC) systems (e.g., timsTOF Pro and the Evosep One system, respectively) enabled mapping the proteomes of 1000s of samples. Fragpipe is one of the few computational protein identification and quantification frameworks that allows for the time-efficient analysis of such large data sets. However, it requires large amounts of computational power and data storage space that leave even state-of-the-art workstations underpowered when it comes to the analysis of proteomics data sets with 1000s of LC mass spectrometry runs. To address this issue, we developed and optimized a Fragpipe-based analysis strategy for a high-performance computing environment and analyzed 3348 plasma samples (6.4 TB) that were longitudinally collected from hospitalized COVID-19 patients under the auspice of the Immunophenotyping Assessment in a COVID-19 Cohort (IMPACC) study. Our parallelization strategy reduced the total runtime by ∼90% from 116 (theoretical) days to just 9 days in the high-performance computing environment. All code is open-source and can be deployed in any Simple Linux Utility for Resource Management (SLURM) high-performance computing environment, enabling the analysis of large-scale high-throughput proteomics studies.
将强大的蛋白质组学仪器与高通量的液相色谱 (LC) 系统(例如 timsTOF Pro 和 Evosep One 系统)相结合,能够绘制数千个样本的蛋白质组图谱。Fragpipe 是少数几个允许对如此大规模数据集进行高效分析的计算蛋白质鉴定和定量框架之一。然而,它需要大量的计算能力和数据存储空间,即使是最先进的工作站,在分析具有数千个 LC 质谱运行的蛋白质组数据集时也显得力不从心。为了解决这个问题,我们开发并优化了一种基于 Fragpipe 的分析策略,用于高性能计算环境,并分析了 3348 个血浆样本(6.4 TB),这些样本是在 COVID-19 免疫表型评估队列(IMPACC)研究的主持下从住院 COVID-19 患者中纵向收集的。我们的并行化策略将总运行时间从理论上的 116 天减少到高性能计算环境中的 9 天,减少了约 90%。所有代码都是开源的,可以部署在任何 Simple Linux Utility for Resource Management (SLURM) 高性能计算环境中,从而能够分析大规模高通量蛋白质组学研究。