Institute for Cardiovascular Regeneration, Goethe University, Frankfurt am Main 60590, Germany.
German Center for Cardiovascular Regeneration, Partner Site Rhein-Main, Frankfurt am Main 60590, Germany.
Bioinformatics. 2020 Mar 1;36(6):1655-1662. doi: 10.1093/bioinformatics/btz855.
A central aim of molecular biology is to identify mechanisms of transcriptional regulation. Transcription factors (TFs), which are DNA-binding proteins, are highly involved in these processes, thus a crucial information is to know where TFs interact with DNA and to be aware of the TFs' DNA-binding motifs. For that reason, computational tools exist that link DNA-binding motifs to TFs either without sequence information or based on TF-associated sequences, e.g. identified via a chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiment.In this paper, we present MASSIF, a novel method to improve the performance of existing tools that link motifs to TFs relying on TF-associated sequences. MASSIF is based on the idea that a DNA-binding motif, which is correctly linked to a TF, should be assigned to a DNA-binding domain (DBD) similar to that of the mapped TF. Because DNA-binding motifs are in general not linked to DBDs, it is not possible to compare the DBD of a TF and the motif directly. Instead we created a DBD collection, which consist of TFs with a known DBD and an associated motif. This collection enables us to evaluate how likely it is that a linked motif and a TF of interest are associated to the same DBD. We named this similarity measure domain score, and represent it as a P-value. We developed two different ways to improve the performance of existing tools that link motifs to TFs based on TF-associated sequences: (i) using meta-analysis to combine P-values from one or several of these tools with the P-value of the domain score and (ii) filter unlikely motifs based on the domain score.
We demonstrate the functionality of MASSIF on several human ChIP-seq datasets, using either motifs from the HOCOMOCO database or de novo identified ones as input motifs. In addition, we show that both variants of our method improve the performance of tools that link motifs to TFs based on TF-associated sequences significantly independent of the considered DBD type.
MASSIF is freely available online at https://github.com/SchulzLab/MASSIF.
Supplementary data are available at Bioinformatics online.
分子生物学的一个主要目标是确定转录调控的机制。转录因子(TFs)是 DNA 结合蛋白,它们在这些过程中高度参与,因此一个关键信息是知道 TF 在哪里与 DNA 相互作用,并了解 TF 的 DNA 结合基序。为此,存在一些计算工具可以将 DNA 结合基序与 TF 相关联,无论是否具有序列信息,或者基于 TF 相关的序列,例如通过染色质免疫沉淀测序(ChIP-seq)实验识别。在本文中,我们提出了 MASSIF,这是一种改进现有基于 TF 相关序列将基序与 TF 相关联的工具性能的新方法。MASSIF 基于这样的想法,即正确链接到 TF 的 DNA 结合基序应该被分配到与映射 TF 相似的 DNA 结合域(DBD)。因为 DNA 结合基序通常不与 DBD 相关联,所以不可能直接比较 TF 的 DBD 和基序。相反,我们创建了一个 DBD 集合,其中包含具有已知 DBD 和相关基序的 TF。这个集合使我们能够评估链接基序和感兴趣的 TF 与相同 DBD 相关联的可能性有多大。我们将这种相似性度量称为域分数,并表示为 P 值。我们开发了两种不同的方法来改进现有基于 TF 相关序列将基序与 TF 相关联的工具的性能:(i)使用元分析将来自一个或多个这些工具的 P 值与域分数的 P 值相结合,(ii)基于域分数过滤不太可能的基序。
我们使用来自 HOCOMOCO 数据库的基序或从头鉴定的基序作为输入基序,在几个人类 ChIP-seq 数据集上展示了 MASSIF 的功能。此外,我们表明,我们方法的两种变体都可以显著提高基于 TF 相关序列将基序与 TF 相关联的工具的性能,而与所考虑的 DBD 类型无关。
MASSIF 可在 https://github.com/SchulzLab/MASSIF 上免费在线使用。
补充数据可在 Bioinformatics 在线获得。