Kawasaki Junna, Suzuki Tadaki, Hamada Michiaki
Faculty of Science and Engineering, Waseda University, Tokyo, Japan.
Department of Infectious Disease Pathobiology, Graduate School of Medicine, Chiba University, Chiba, Japan.
Commun Med (Lond). 2025 May 20;5(1):187. doi: 10.1038/s43856-025-00903-w.
Machine learning models have been deployed to assess the zoonotic spillover risk of viruses by identifying their potential for human infectivity. However, the lack of comprehensive datasets for viral infectivity poses a major challenge, limiting the predictable range of viruses.
In this study, we address this limitation through two key strategies: constructing expansive datasets across 26 viral families and developing the BERT-infect model, which leverages large language models pre-trained on extensive nucleotide sequences.
Here we show that our approach substantially boosts model performance. This enhancement is particularly notable in segmented RNA viruses, which are involved with severe zoonoses but have been overlooked due to limited data availability. Our model also exhibits high predictive performance even with partial viral sequences, such as high-throughput sequencing reads or contig sequences from de novo sequence assemblies, indicating the model's applicability for mining zoonotic viruses from virus metagenomic data. Furthermore, models trained on data up to 2018 demonstrate robust predictive capability for most viruses identified post-2018. Nonetheless, high-resolution evaluation based on phylogenetic analysis reveals general limitations in current machine learning models: the difficulty in alerting the human infectious risk in specific zoonotic viral lineages, including SARS-CoV-2.
Our study provides a comprehensive benchmark for viral infectivity prediction models and highlights unresolved issues in fully exploiting machine learning to prepare for future zoonotic threats.
机器学习模型已被用于通过识别病毒的人类感染潜力来评估病毒的人畜共患病溢出风险。然而,缺乏用于病毒感染性的全面数据集构成了一项重大挑战,限制了病毒的可预测范围。
在本研究中,我们通过两个关键策略解决了这一局限性:构建涵盖26个病毒科的广泛数据集,并开发BERT-infect模型,该模型利用在广泛核苷酸序列上预训练的大语言模型。
我们在此表明,我们的方法显著提高了模型性能。这种提升在分节段RNA病毒中尤为显著,这些病毒与严重的人畜共患病有关,但由于数据可用性有限而被忽视。即使使用部分病毒序列,如高通量测序读数或从头序列组装的重叠群序列,我们的模型也表现出较高的预测性能,这表明该模型适用于从病毒宏基因组数据中挖掘人畜共患病病毒。此外,基于2018年以前的数据训练的模型对2018年后鉴定的大多数病毒表现出强大的预测能力。然而,基于系统发育分析的高分辨率评估揭示了当前机器学习模型的一般局限性:难以警示特定人畜共患病病毒谱系(包括SARS-CoV-2)中的人类感染风险。
我们的研究为病毒感染性预测模型提供了一个全面的基准,并突出了在充分利用机器学习以应对未来人畜共患病威胁方面尚未解决的问题。