Brown Margaret, Ferrari Alessandro, Dodd Anne, Shi Fang, Kolachala Vasantha L, Kugathasan Subra, Wolfinger Russell D, Gibson Greg
bioRxiv. 2025 Jul 25:2025.07.25.666784. doi: 10.1101/2025.07.25.666784.
Single cell multi-omic investigation opens-up new opportunities to understand mechanisms of gene regulation. Existing methods for inferring transcript abundance from chromatin accessibility fail to prioritize the most relevant peaks and tend to assume positive associations between ATAC peaks and RNA counts. We hypothesize that gene regulation can be modeled as a function of combined positive and negative interactions among peaks and that causal regulatory variants are enriched in the vicinity of the most critical peaks.
A machine learning pipeline leveraging single nuclear multiomic transcriptome and chromatin accessibility data is developed to model gene expression as a function of ATAC peak intensity. Multiome data was available for 18 immune cell types from 29 donors, 19 with Crohn's disease. The pipeline aggregates results from three machine learning approaches (random forest regression, XGBoost, and Light GBM) as well as linear regression to identify which ATAC peaks contribute to explaining variation among donors and cell types in pseudobulk gene expression. The coefficient of determination with cross-validation was used to identify robust models which typically explain between 5% and 40% of transcript abundance, utilizing on average 47% of the ATAC peaks, representing a significant gain in predictive accuracy. The most important peaks are enriched in GWAS variants for inflammatory bowel disease and the autoimmune disease systemic lupus erythematosus, but not for rheumatoid arthritis.
Atlanta Plots visualize the proportion of ATAC peaks contributing to a predictive model of gene expression as well as the proportion of variance explained by the model. Software implementing our pipeline, "snATAC-Express", is freely available on GitHub.
单细胞多组学研究为理解基因调控机制带来了新机遇。现有的从染色质可及性推断转录本丰度的方法未能对最相关的峰进行优先级排序,并且倾向于假设ATAC峰与RNA计数之间存在正相关。我们假设基因调控可以建模为峰之间正负相互作用组合的函数,并且因果调控变异在最关键峰的附近富集。
开发了一种利用单核多组学转录组和染色质可及性数据的机器学习流程,将基因表达建模为ATAC峰强度的函数。来自29名供体的18种免疫细胞类型有多组学数据,其中19名患有克罗恩病。该流程汇总了三种机器学习方法(随机森林回归、XGBoost和Light GBM)以及线性回归的结果,以确定哪些ATAC峰有助于解释伪批量基因表达中供体和细胞类型之间的差异。使用交叉验证的决定系数来识别稳健的模型,这些模型通常可以解释5%至40%的转录本丰度,平均利用47%的ATAC峰,这代表预测准确性有显著提高。最重要的峰在炎症性肠病和自身免疫性疾病系统性红斑狼疮的全基因组关联研究(GWAS)变异中富集,但在类风湿性关节炎中未富集。
亚特兰大图可视化了对基因表达预测模型有贡献的ATAC峰的比例以及模型解释的方差比例。实现我们流程的软件“snATAC-Express”可在GitHub上免费获取。