Javed Nauman, Weingarten Thomas, Sehanobish Arijit, Roberts Adam, Dubey Avinava, Choromanski Krzysztof, Bernstein Bradley E
The Gene Regulation Observatory, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
Google, Mountain View, CA 94043, USA.
Cell Genom. 2025 Feb 12;5(2):100762. doi: 10.1016/j.xgen.2025.100762. Epub 2025 Jan 29.
Sequence-based deep learning models have emerged as powerful tools for deciphering the cis-regulatory grammar of the human genome but cannot generalize to unobserved cellular contexts. Here, we present EpiBERT, a multi-modal transformer that learns generalizable representations of genomic sequence and cell type-specific chromatin accessibility through a masked accessibility-based pre-training objective. Following pre-training, EpiBERT can be fine-tuned for gene expression prediction, achieving accuracy comparable to the sequence-only Enformer model, while also being able to generalize to unobserved cell states. The learned representations are interpretable and useful for predicting chromatin accessibility quantitative trait loci (caQTLs), regulatory motifs, and enhancer-gene links. Our work represents a step toward improving the generalization of sequence-based deep neural networks in regulatory genomics.
基于序列的深度学习模型已成为破译人类基因组顺式调控语法的强大工具,但无法推广到未观察到的细胞环境中。在此,我们展示了EpiBERT,这是一种多模态变换器,它通过基于掩码可及性的预训练目标来学习基因组序列和细胞类型特异性染色质可及性的可推广表示。预训练后,EpiBERT可针对基因表达预测进行微调,实现与仅基于序列的Enformer模型相当的准确性,同时还能够推广到未观察到的细胞状态。所学习的表示是可解释的,并且有助于预测染色质可及性数量性状位点(caQTL)、调控基序和增强子-基因联系。我们的工作朝着提高基于序列的深度神经网络在调控基因组学中的泛化能力迈出了一步。