Suppr超能文献

分类学、宿主依赖性特征和样本偏差对使用机器学习和短序列k-mer进行病毒宿主预测的影响。

The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers.

作者信息

Perelygin Fedor S, Lukashev Alexander N, Aleshina Yulia A

机构信息

Martsinovsky Institute of Medical Parasitology, Tropical and Vector Borne Diseases, First Moscow State Medical University (Sechenov University), Moscow, 119435, Russian Federation.

Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, 119234, Russian Federation.

出版信息

Sci Rep. 2025 Aug 27;15(1):31592. doi: 10.1038/s41598-025-17123-w.

Abstract

Metaviromic studies of potential emerging infection reservoirs led to discovery of many novel viruses. Since metaviromes contain viruses from target host, its food or other sources, fast and robust approaches are needed to predict hosts of unknown viruses based on their genome data. Four machine learning algorithms (random forest, two gradient boosting machines, support vector machine) were used here to predict the hosts of RNA viruses that infect mammals, insects and plants. The prediction efficiency was largely dependent on the dataset composition. In the more challenging task of predicting hosts of unknown virus genera, median weighted F1-score of 0.79 was achieved using support vector machine and 4-mer frequencies, a notable improvement over baseline methods (median weighted F1-scores 0.68 for the homology-based tBLASTx and 0.72 for ML trained on mono-, di- and trinucleotide frequencies). More complicated features and feature combinations provided worse results. When predicting hosts of short virus sequence fragments quality decreased but using same-length fragments instead of full genomes for training consistently produced an improvement of prediction quality. Therefore, short k-mers carry sufficient information to predict hosts of novel RNA virus genera. This algorithm can be useful in rapid analysis of metaviromic data to highlight potential biological threats.

摘要

对潜在新兴感染源的宏病毒组学研究发现了许多新型病毒。由于宏病毒组包含来自目标宿主、其食物或其他来源的病毒,因此需要快速且强大的方法,以便根据未知病毒的基因组数据预测其宿主。本文使用了四种机器学习算法(随机森林、两种梯度提升机、支持向量机)来预测感染哺乳动物、昆虫和植物的RNA病毒的宿主。预测效率在很大程度上取决于数据集的组成。在预测未知病毒属宿主这一更具挑战性的任务中,使用支持向量机和4-mer频率可实现0.79的中位数加权F1分数,相较于基线方法有显著提升(基于同源性的tBLASTx的中位数加权F1分数为0.68,基于单核苷酸、二核苷酸和三核苷酸频率训练的机器学习的中位数加权F1分数为0.72)。更复杂的特征和特征组合得到的结果更差。在预测短病毒序列片段的宿主时质量会下降,但使用等长片段而非完整基因组进行训练始终能提高预测质量。因此,短k-mer携带了足以预测新型RNA病毒属宿主的信息。该算法可用于快速分析宏病毒组数据,以突出潜在的生物威胁。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验