Yousef Malik, Khalifa Waleed, Acar İlhan Erkin, Allmer Jens
Community Information Systems, Zefat Academic College, Zefat, 13206, Israel.
Computer Science, The College of Sakhnin, Sakhnin, 30810, Israel.
BMC Bioinformatics. 2017 Mar 14;18(1):170. doi: 10.1186/s12859-017-1584-1.
Post-transcriptional gene dysregulation can be a hallmark of diseases like cancer and microRNAs (miRNAs) play a key role in the modulation of translation efficiency. Known pre-miRNAs are listed in miRBase, and they have been discovered in a variety of organisms ranging from viruses and microbes to eukaryotic organisms. The computational detection of pre-miRNAs is of great interest, and such approaches usually employ machine learning to discriminate between miRNAs and other sequences. Many features have been proposed describing pre-miRNAs, and we have previously introduced the use of sequence motifs and k-mers as useful ones. There have been reports of xeno-miRNAs detected via next generation sequencing. However, they may be contaminations and to aid that important decision-making process, we aimed to establish a means to differentiate pre-miRNAs from different species.
To achieve distinction into species, we used one species' pre-miRNAs as the positive and another species' pre-miRNAs as the negative training and test data for the establishment of machine learned models based on sequence motifs and k-mers as features. This approach resulted in higher accuracy values between distantly related species while species with closer relation produced lower accuracy values.
We were able to differentiate among species with increasing success when the evolutionary distance increases. This conclusion is supported by previous reports of fast evolutionary changes in miRNAs since even in relatively closely related species a fairly good discrimination was possible.
转录后基因失调可能是癌症等疾病的一个标志,而微小RNA(miRNA)在翻译效率的调节中起关键作用。已知的前体miRNA列于miRBase中,并且已在从病毒、微生物到真核生物等多种生物体中被发现。前体miRNA的计算检测备受关注,此类方法通常采用机器学习来区分miRNA与其他序列。已经提出了许多描述前体miRNA的特征,我们之前已介绍过使用序列基序和k聚体作为有用的特征。有通过下一代测序检测到异种miRNA的报道。然而,它们可能是污染物,为辅助这一重要的决策过程,我们旨在建立一种区分不同物种前体miRNA的方法。
为实现物种区分,我们将一个物种的前体miRNA作为正样本,另一个物种的前体miRNA作为负样本,用于基于序列基序和k聚体作为特征建立机器学习模型的训练和测试数据。这种方法在亲缘关系较远的物种之间产生了更高的准确率值,而亲缘关系较近的物种产生的准确率值较低。
当进化距离增加时,我们能够越来越成功地区分不同物种。这一结论得到了先前关于miRNA快速进化变化报道的支持,因为即使在亲缘关系相对较近的物种之间也能够进行相当好的区分。