Wan Yat-Tsai Richie, Nielsen Morten
Department of Health Technology, Technical University of Denmark, Kgs Lyngby DK 28002, Denmark.
NAR Genom Bioinform. 2025 May 27;7(2):lqaf065. doi: 10.1093/nargab/lqaf065. eCollection 2025 Jun.
T cells play a vital role in adaptive immunity by targeting pathogen-infected or cancerous cells, but predicting their specificity remains challenging. Encoding T-cell receptor (TCR) sequences into informative feature spaces is therefore crucial for advancing specificity prediction and downstream applications. For this, we developed a variational autoencoder (VAE)-based model trained on paired TCR α-β chain data, incorporating all six complementarity-determining regions. A semi-supervised 'two-stage VAE' framework, integrating cosine triplet loss and a classifier, was found to further refine peptide-specific latent representations, outperforming sequence-based methods in specificity prediction. Clustering analyses leveraging our VAE latent space were evaluated using -means, agglomerative clustering, and a novel graph-based method. Agglomerative clustering achieved the most biologically relevant results, balancing cluster purity and retention despite noise in TCR specificity annotations. We extended these insights to evaluate TCR repertoire data. Across datasets, VAE-based models outperformed sequence-based methods, particularly in retention metrics, with notable improvements in the SARS-CoV-2 repertoire dataset. Moreover, the cancer repertoire analysis highlighted the generalizability of our approach, where the model displayed high performance despite minimal similarity between the training and test data. Collectively, these results demonstrate the potential of VAE-based latent representations to offer a robust framework for prediction, clustering, and repertoire analysis.
T细胞通过靶向病原体感染或癌细胞在适应性免疫中发挥至关重要的作用,但预测它们的特异性仍然具有挑战性。因此,将T细胞受体(TCR)序列编码到信息丰富的特征空间对于推进特异性预测和下游应用至关重要。为此,我们开发了一种基于变分自编码器(VAE)的模型,该模型在配对的TCRα-β链数据上进行训练,纳入了所有六个互补决定区。发现一个整合余弦三元组损失和分类器的半监督“两阶段VAE”框架能够进一步优化肽特异性潜在表示,在特异性预测方面优于基于序列的方法。利用我们的VAE潜在空间进行的聚类分析使用K均值、凝聚聚类和一种新的基于图的方法进行评估。凝聚聚类取得了最具生物学相关性的结果,在TCR特异性注释存在噪声的情况下平衡了聚类纯度和保留率。我们扩展了这些见解以评估TCR库数据。在各个数据集中,基于VAE的模型优于基于序列的方法,特别是在保留指标方面,在SARS-CoV-2库数据集中有显著改进。此外,癌症库分析突出了我们方法的通用性,即尽管训练数据和测试数据之间相似度很低,该模型仍表现出高性能。总体而言,这些结果证明了基于VAE的潜在表示为预测、聚类和库分析提供强大框架的潜力。