Ha Anh D, Aylward Frank O
Department of Biological Sciences, Virginia Tech, Blacksburg, VA, 24061, USA.
Center for Emerging, Zoonotic, and Arthropod-Borne Infectious Disease, Virginia Tech, Blacksburg, VA, 24061, USA.
Npj Viruses. 2024 Mar 8;2(1):9. doi: 10.1038/s44298-024-00021-9.
Viruses of the phylum Nucleocytoviricota, often referred to as "giant viruses," are prevalent in various environments around the globe and play significant roles in shaping eukaryotic diversity and activities in global ecosystems. Given the extensive phylogenetic diversity within this viral group and the highly complex composition of their genomes, taxonomic classification of giant viruses, particularly incomplete metagenome-assembled genomes (MAGs) can present a considerable challenge. Here we developed TIGTOG (Taxonomic Information of Giant viruses using Trademark Orthologous Groups), a machine learning-based approach to predict the taxonomic classification of novel giant virus MAGs based on profiles of protein family content. We applied a random forest algorithm to a training set of 1531 quality-checked, phylogenetically diverse Nucleocytoviricota genomes using pre-selected sets of giant virus orthologous groups (GVOGs). The classification models were predictive of viral taxonomic assignments with a cross-validation accuracy of 99.6% at the order level and 97.3% at the family level. We found that no individual GVOGs or genome features significantly influenced the algorithm's performance or the models' predictions, indicating that classification predictions were based on a comprehensive genomic signature, which reduced the necessity of a fixed set of marker genes for taxonomic assigning purposes. Our classification models were validated with an independent test set of 823 giant virus genomes with varied genomic completeness and taxonomy and demonstrated an accuracy of 98.6% and 95.9% at the order and family level, respectively. Our results indicate that protein family profiles can be used to accurately classify large DNA viruses at different taxonomic levels and provide a fast and accurate method for the classification of giant viruses. This approach could easily be adapted to other viral groups.
核质巨DNA病毒门的病毒,通常被称为“巨型病毒”,在全球各种环境中普遍存在,并且在塑造真核生物多样性以及全球生态系统中的活动方面发挥着重要作用。鉴于该病毒群体内广泛的系统发育多样性及其基因组的高度复杂组成,巨型病毒的分类,尤其是不完整的宏基因组组装基因组(MAG)的分类可能是一项相当大的挑战。在此,我们开发了TIGTOG(使用商标直系同源组的巨型病毒分类信息),这是一种基于机器学习的方法,用于根据蛋白质家族含量概况预测新型巨型病毒MAG的分类。我们将随机森林算法应用于一组1531个经过质量检查、系统发育多样的核质巨DNA病毒门基因组的训练集,使用预先选择的巨型病毒直系同源组(GVOG)。分类模型对病毒分类分配具有预测性,在目水平上交叉验证准确率为99.6%,在科水平上为97.3%。我们发现没有单个GVOG或基因组特征会显著影响算法性能或模型预测,这表明分类预测基于综合的基因组特征,从而减少了为分类目的而设置固定一组标记基因的必要性。我们的分类模型通过一组823个具有不同基因组完整性和分类的巨型病毒基因组的独立测试集进行了验证,在目和科水平上的准确率分别为98.6%和95.9%。我们的结果表明,蛋白质家族概况可用于在不同分类水平上准确分类大型DNA病毒,并为巨型病毒的分类提供了一种快速准确的方法。这种方法可以很容易地应用于其他病毒群体。