Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.
These authors contributed equally.
Cell Rep Methods. 2021 Jul 26;1(3). doi: 10.1016/j.crmeth.2021.100014. Epub 2021 Jun 21.
Structure prediction for proteins lacking homologous templates in the Protein Data Bank (PDB) remains a significant unsolved problem. We developed a protocol, C-I-TASSER, to integrate interresidue contact maps from deep neural-network learning with the cutting-edge I-TASSER fragment assembly simulations. Large-scale benchmark tests showed that C-I-TASSER can fold more than twice the number of non-homologous proteins than the I-TASSER, which does not use contacts. When applied to a folding experiment on 8,266 unsolved Pfam families, C-I-TASSER successfully folded 4,162 domain families, including 504 folds that are not found in the PDB. Furthermore, it created correct folds for 85% of proteins in the SARS-CoV-2 genome, despite the quick mutation rate of the virus and sparse sequence profiles. The results demonstrated the critical importance of coupling whole-genome and metagenome-based evolutionary information with optimal structure assembly simulations for solving the problem of non-homologous protein structure prediction.
在蛋白质数据库 (PDB) 中缺乏同源模板的蛋白质结构预测仍然是一个未解决的重大问题。我们开发了一种名为 C-I-TASSER 的协议,将来自深度神经网络学习的残基间接触图与最先进的 I-TASSER 片段组装模拟相结合。大规模基准测试表明,C-I-TASSER 可以折叠比不使用接触信息的 I-TASSER 多两倍的非同源蛋白质。当应用于 8266 个未解决的 Pfam 家族的折叠实验时,C-I-TASSER 成功折叠了 4162 个结构域家族,其中包括 504 个在 PDB 中未发现的折叠。此外,它为 SARS-CoV-2 基因组中的 85%的蛋白质创建了正确的折叠,尽管病毒的快速突变率和稀疏的序列特征。这些结果表明,将基于全基因组和宏基因组的进化信息与最佳结构组装模拟相结合对于解决非同源蛋白质结构预测问题至关重要。