Yan Wu, Tan Li, Meng-Shan Li, Sheng Sheng, Jun Wang, Fu-An Wu
School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China.
School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, Jiangxi, China.
PeerJ. 2023 Oct 4;11:e16192. doi: 10.7717/peerj.16192. eCollection 2023.
Biological sequence data mining is hot spot in bioinformatics. A biological sequence can be regarded as a set of characters. Time series is similar to biological sequences in terms of both representation and mechanism. Therefore, in the article, biological sequences are represented with time series to obtain biological time sequence (BTS). Hybrid ensemble learning framework (SaPt-CNN-LSTM-AR-EA) for BTS is proposed. Single-sequence and multi-sequence models are respectively constructed with self-adaption pre-training one-dimensional convolutional recurrent neural network and autoregressive fractional integrated moving average fused evolutionary algorithm. In DNA sequence experiments with six viruses, SaPt-CNN-LSTM-AR-EA realized the good overall prediction performance and the prediction accuracy and correlation respectively reached 1.7073 and 0.9186. SaPt-CNN-LSTM-AR-EA was compared with other five benchmark models so as to verify its effectiveness and stability. SaPt-CNN-LSTM-AR-EA increased the average accuracy by about 30%. The framework proposed in this article is significant in biology, biomedicine, and computer science, and can be widely applied in sequence splicing, computational biology, bioinformation, and other fields.
生物序列数据挖掘是生物信息学中的热点。生物序列可被视为一组字符。时间序列在表示和机制方面与生物序列相似。因此,在本文中,用时间序列来表示生物序列以获得生物时间序列(BTS)。提出了用于BTS的混合集成学习框架(SaPt-CNN-LSTM-AR-EA)。分别用自适应预训练一维卷积递归神经网络和自回归分数整合移动平均融合进化算法构建单序列和多序列模型。在六种病毒的DNA序列实验中,SaPt-CNN-LSTM-AR-EA实现了良好的整体预测性能,预测准确率和相关性分别达到1.7073和0.9186。将SaPt-CNN-LSTM-AR-EA与其他五个基准模型进行比较,以验证其有效性和稳定性。SaPt-CNN-LSTM-AR-EA使平均准确率提高了约30%。本文提出的框架在生物学、生物医学和计算机科学中具有重要意义,可广泛应用于序列拼接、计算生物学、生物信息学等领域。