Swiss Institute of Bioinformatics, Quartier Sorge Batiment Genopode, 1015 Lausanne, Switzerland and Department of Computer Science, ETH Zürich, Universitätstrasse 6, 8092 Zürich, Switzerland.
Nucleic Acids Res. 2013 Sep;41(17):e162. doi: 10.1093/nar/gkt628. Epub 2013 Jul 22.
Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein family.
串联重复 (TRs) 通常存在于具有关键功能的蛋白质中,这些蛋白质负责抵抗、致病性,并与传染病或神经退行性疾病有关。这促使了对 TRs 及其进化的大量研究,需要准确的多重序列比对。TRs 可能通过复制滑动或重组而在 TR 区域的任何位置丢失或插入,但当前的方法假设固定的单位边界,而且复杂度很高。我们提出了一种新的基于全局图的对齐方法,该方法不通过单位边界限制 TR 单位的插入缺失。TR 插入缺失分别建模,并使用具有系统发育感知的对齐算法进行惩罚。这确保了重建对齐的准确性得到增强,以生物上有意义的方式解开 TRs 并测量插入缺失事件和速率。我们的方法不仅可以检测到重复事件,还可以检测到由于重组、链滑动和其他插入或删除 TR 单位的事件而导致的 TR 区域的所有变化。我们通过模拟包含 TR 进化的方法来评估我们的方法,要么从轮廓隐马尔可夫模型中采样 TRs,要么通过重复来模拟链滑动。新方法在 Ralstonia solanacearum 等农业上重要细菌的 III 型效应子家族上进行了说明。我们表明,TR 插入缺失率的变化导致了该蛋白质家族的多样化。