Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Lyngby, Denmark.
Elife. 2024 Mar 4;12:RP93934. doi: 10.7554/eLife.93934.
Predicting the interaction between Major Histocompatibility Complex (MHC) class I-presented peptides and T-cell receptors (TCR) holds significant implications for vaccine development, cancer treatment, and autoimmune disease therapies. However, limited paired-chain TCR data, skewed towards well-studied epitopes, hampers the development of pan-specific machine-learning (ML) models. Leveraging a larger peptide-TCR dataset, we explore various alterations to the ML architectures and training strategies to address data imbalance. This leads to an overall improved performance, particularly for peptides with scant TCR data. However, challenges persist for unseen peptides, especially those distant from training examples. We demonstrate that such ML models can be used to detect potential outliers, which when removed from training, leads to augmented performance. Integrating pan-specific and peptide-specific models alongside with similarity-based predictions, further improves the overall performance, especially when a low false positive rate is desirable. In the context of the IMMREP22 benchmark, this modeling framework attained state-of-the-art performance. Moreover, combining these strategies results in acceptable predictive accuracy for peptides characterized with as little as 15 positive TCRs. This observation places great promise on rapidly expanding the peptide covering of the current models for predicting TCR specificity. The NetTCR 2.2 model incorporating these advances is available on GitHub (https://github.com/mnielLab/NetTCR-2.2) and as a web server at https://services.healthtech.dtu.dk/services/NetTCR-2.2/.
预测主要组织相容性复合体 (MHC) Ⅰ类呈递肽与 T 细胞受体 (TCR) 之间的相互作用,对于疫苗开发、癌症治疗和自身免疫性疾病疗法具有重要意义。然而,有限的配对 TCR 数据,偏向于研究充分的表位,阻碍了泛特异性机器学习 (ML) 模型的发展。利用更大的肽-TCR 数据集,我们探索了对 ML 架构和训练策略的各种改变,以解决数据不平衡问题。这导致整体性能得到提高,特别是对于 TCR 数据稀少的肽。然而,对于未见的肽,尤其是远离训练样本的肽,仍然存在挑战。我们证明,这种 ML 模型可用于检测潜在的异常值,将其从训练中去除后,可提高性能。结合泛特异性和肽特异性模型以及基于相似性的预测,可以进一步提高整体性能,特别是在需要低假阳性率的情况下。在 IMMREP22 基准测试的背景下,这种建模框架实现了最先进的性能。此外,结合这些策略可实现对 TCR 特异性预测的肽具有仅 15 个阳性 TCR 的情况下可接受的预测准确性。这一观察结果为快速扩展当前模型的肽覆盖范围以预测 TCR 特异性带来了巨大的希望。包含这些进展的 NetTCR 2.2 模型可在 GitHub(https://github.com/mnielLab/NetTCR-2.2)上获得,也可在 https://services.healthtech.dtu.dk/services/NetTCR-2.2/ 作为网络服务器使用。