IMGT®, The International ImMunoGeneTics Information System®, Montpellier, France.
Institute of Human Genetics (IGH), Montpellier, France.
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae552.
The accurate prediction of peptide-major histocompatibility complex (MHC) class I binding probabilities is a critical endeavor in immunoinformatics, with broad implications for vaccine development and immunotherapies. While recent deep neural network based approaches have showcased promise in peptide-MHC (pMHC) prediction, they have two shortcomings: (i) they rely on hand-crafted pseudo-sequence extraction, (ii) they do not generalize well to different datasets, which limits the practicality of these approaches. While existing methods rely on a 34 amino acid pseudo-sequence, our findings uncover the involvement of 147 positions in direct interactions between MHC and peptide. We further show that neural architectures can learn the intricacies of pMHC binding using even full sequences. To this end, we present PerceiverpMHC that is able to learn accurate representations on full-sequences by leveraging efficient transformer based architectures. Additionally, we propose IMGT/RobustpMHC that harnesses the potential of unlabeled data in improving the robustness of pMHC binding predictions through a self-supervised learning strategy. We extensively evaluate RobustpMHC on eight different datasets and showcase an overall improvement of over 6% in binding prediction accuracy compared to state-of-the-art approaches. We compile CrystalIMGT, a crystallography-verified dataset presenting a challenge to existing approaches due to significantly different pMHC distributions. Finally, to mitigate this distribution gap, we further develop a transfer learning pipeline.
准确预测肽-主要组织相容性复合体 (MHC) I 类结合概率是免疫信息学中的一项关键任务,对疫苗开发和免疫疗法具有广泛的影响。虽然最近基于深度神经网络的方法在肽-MHC (pMHC) 预测方面表现出了前景,但它们有两个缺点:(i) 它们依赖于手工制作的伪序列提取,(ii) 它们不能很好地泛化到不同的数据集,这限制了这些方法的实用性。虽然现有的方法依赖于 34 个氨基酸的伪序列,但我们的研究结果揭示了 MHC 和肽之间直接相互作用涉及 147 个位置。我们进一步表明,神经网络架构可以使用完整的序列来学习 pMHC 结合的复杂性。为此,我们提出了 PerceiverpMHC,它能够通过利用高效的基于转换器的架构来学习完整序列上的准确表示。此外,我们提出了 IMGT/RobustpMHC,通过自监督学习策略利用未标记数据的潜力来提高 pMHC 结合预测的稳健性。我们在八个不同的数据集上对 RobustpMHC 进行了广泛的评估,并展示了与最先进的方法相比,在结合预测准确性方面总体提高了 6%以上。我们编译了 CrystalIMGT,这是一个晶体学验证的数据集,由于 pMHC 分布明显不同,对现有方法构成了挑战。最后,为了减轻这种分布差距,我们进一步开发了一个迁移学习管道。