Mutti Giacomo, Ocaña-Pallarès Eduard, Gabaldón Toni
Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain.
Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain.
Mol Biol Evol. 2025 Jul 1;42(7). doi: 10.1093/molbev/msaf149.
Recent developments in protein structure prediction have allowed the use of this previously limited source of information at genome-wide scales. It has been proposed that the use of structural information may offer advantages over sequences in phylogenetic reconstruction, due to their slower rate of evolution and direct correlation to function. Here, we examined how recently developed methods for structure-based homology search and tree reconstruction compare with current state-of-the-art sequence-based methods in reconstructing genome-wide collections of gene phylogenies (i.e. phylomes). While structure-based methods can be useful in specific scenarios, we found that their current performance does not justify using the newly developed structure-based methods as a default choice in large-scale phylogenetic studies. On the one hand, the best performing sequence-based tree reconstruction methods still outperform structure-based methods for this task. On the other hand, structure-based homology detection methods provide larger lists of candidate homologs, as previously reported. However, this comes at the expense of missing hits identified by sequence-based methods, as well as providing sets of homolog candidates with higher fractions of false positives. These insights help to guide the use of structural data in comparative genomics and highlight the need to continue improving structure-based approaches. Our pipeline is fully reproducible and has been implemented in a Snakemake workflow. This will facilitate a continuous assessment of future improvements of structure-based tools in the AlphaFold era.
蛋白质结构预测的最新进展使得在全基因组范围内能够利用这一先前有限的信息来源。有人提出,由于结构信息的进化速度较慢且与功能直接相关,因此在系统发育重建中使用结构信息可能比序列具有优势。在此,我们研究了最近开发的基于结构的同源性搜索和树重建方法与当前基于序列的最先进方法在重建全基因组基因系统发育集合(即系统发育组)方面的比较情况。虽然基于结构的方法在特定情况下可能有用,但我们发现它们目前的性能并不足以证明在大规模系统发育研究中将新开发的基于结构的方法作为默认选择是合理的。一方面,对于这项任务,性能最佳的基于序列的树重建方法仍然优于基于结构的方法。另一方面,如先前报道的那样,基于结构的同源性检测方法提供了更大的候选同源物列表。然而,这是以错过基于序列的方法识别的命中结果为代价的,同时还提供了具有更高假阳性率的同源候选物集。这些见解有助于指导在比较基因组学中使用结构数据,并突出了继续改进基于结构的方法的必要性。我们的流程是完全可重复的,并且已在Snakemake工作流程中实现。这将有助于在AlphaFold时代对基于结构的工具的未来改进进行持续评估。