Pike Aiden M C, Amal Saeed, Maginnis Melissa S, Wilczek Michael P
Maine Space Grant Consortium, Augusta, ME 04330, USA.
Life Sciences, Health, and Engineering Department, The Roux Institute, Northeastern University, Portland, ME 04101, USA.
Viruses. 2024 Dec 25;17(1):12. doi: 10.3390/v17010012.
JC polyomavirus (JCPyV) establishes a persistent, asymptomatic kidney infection in most of the population. However, JCPyV can reactivate in immunocompromised individuals and cause progressive multifocal leukoencephalopathy (PML), a fatal demyelinating disease with no approved treatment. Mutations in the hypervariable non-coding control region (NCCR) of the JCPyV genome have been linked to disease outcomes and neuropathogenesis, yet few metanalyses document these associations. Many online sequence entries, including those on NCBI databases, lack sufficient sample information, limiting large-scale analyses of NCCR sequences. Machine learning techniques, however, can augment available data for analysis. This study employs a previously compiled dataset of 989 JCPyV NCCR sequences from GenBank with associated patient PML status and viral tissue source to train multilayer perceptrons for predicting missing information within the dataset. The PML status and tissue source models were 100% and 87.8% accurate, respectively. Within the dataset, 348 samples had an unconfirmed PML status, where 259 were predicted as No PML and 89 as PML sequences. Of the 63 sequences with unconfirmed tissue sources, eight samples were predicted as urine, 13 as blood, and 42 as cerebrospinal fluid. These models can improve viral sequence identification and provide insights into viral mutations and pathogenesis.
JC多瘤病毒(JCPyV)在大多数人群中会引发持续的、无症状的肾脏感染。然而,JCPyV可在免疫功能低下的个体中重新激活,并导致进行性多灶性白质脑病(PML),这是一种致命的脱髓鞘疾病,目前尚无获批的治疗方法。JCPyV基因组高变非编码控制区(NCCR)的突变与疾病结局和神经发病机制有关,但很少有荟萃分析记录这些关联。许多在线序列条目,包括NCBI数据库中的条目,缺乏足够的样本信息,限制了对NCCR序列的大规模分析。然而,机器学习技术可以增加可用于分析的数据。本研究采用了一个先前汇编的数据集,该数据集包含来自GenBank的989条JCPyV NCCR序列以及相关患者的PML状态和病毒组织来源,用于训练多层感知器以预测数据集中缺失的信息。PML状态和组织来源模型的准确率分别为100%和87.8%。在数据集中,348个样本的PML状态未得到确认,其中259个被预测为无PML,89个被预测为PML序列。在63个组织来源未得到确认的序列中,8个样本被预测为尿液,13个为血液,42个为脑脊液。这些模型可以改善病毒序列识别,并为病毒突变和发病机制提供见解。