Department of Computer Science, University of Toronto, Toronto, ON, Canada.
Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada.
Genome Biol. 2024 Jan 8;25(1):11. doi: 10.1186/s13059-023-03070-0.
Transcription factors bind DNA in specific sequence contexts. In addition to distinguishing one nucleobase from another, some transcription factors can distinguish between unmodified and modified bases. Current models of transcription factor binding tend not to take DNA modifications into account, while the recent few that do often have limitations. This makes a comprehensive and accurate profiling of transcription factor affinities difficult.
Here, we develop methods to identify transcription factor binding sites in modified DNA. Our models expand the standard A/C/G/T DNA alphabet to include cytosine modifications. We develop Cytomod to create modified genomic sequences and we also enhance the MEME Suite, adding the capacity to handle custom alphabets. We adapt the well-established position weight matrix (PWM) model of transcription factor binding affinity to this expanded DNA alphabet. Using these methods, we identify modification-sensitive transcription factor binding motifs. We confirm established binding preferences, such as the preference of ZFP57 and C/EBPβ for methylated motifs and the preference of c-Myc for unmethylated E-box motifs.
Using known binding preferences to tune model parameters, we discover novel modified motifs for a wide array of transcription factors. Finally, we validate our binding preference predictions for OCT4 using cleavage under targets and release using nuclease (CUT&RUN) experiments across conventional, methylation-, and hydroxymethylation-enriched sequences. Our approach readily extends to other DNA modifications. As more genome-wide single-base resolution modification data becomes available, we expect that our method will yield insights into altered transcription factor binding affinities across many different modifications.
转录因子在特定的序列环境中结合 DNA。除了区分一个碱基与另一个碱基之外,一些转录因子还可以区分未修饰的和修饰的碱基。目前的转录因子结合模型往往没有考虑 DNA 修饰,而最近的少数模型虽然考虑了修饰,但通常存在局限性。这使得全面准确地分析转录因子亲和力变得困难。
在这里,我们开发了识别修饰 DNA 中转录因子结合位点的方法。我们的模型将标准的 A/C/G/T DNA 字母表扩展到包括胞嘧啶修饰。我们开发了 Cytomod 来创建修饰的基因组序列,并且还增强了 MEME 套件,增加了处理自定义字母表的能力。我们将转录因子结合亲和力的成熟位置权重矩阵(PWM)模型应用于这个扩展的 DNA 字母表。使用这些方法,我们确定了修饰敏感的转录因子结合基序。我们证实了已建立的结合偏好,例如 ZFP57 和 C/EBPβ 对甲基化基序的偏好,以及 c-Myc 对未甲基化 E 盒基序的偏好。
使用已知的结合偏好来调整模型参数,我们为广泛的转录因子发现了新的修饰基序。最后,我们使用在常规、甲基化和羟甲基化富集序列上的靶标切割和释放核酸酶(CUT&RUN)实验,验证了我们对 OCT4 结合偏好预测的有效性。我们的方法可以很容易地扩展到其他 DNA 修饰。随着更多全基因组单碱基分辨率修饰数据的出现,我们预计我们的方法将为许多不同修饰下改变的转录因子结合亲和力提供深入的见解。