College of Intelligent Systems Science and Engineering, Harbin Engineering University, Nantong Street, Harbin, China.
College of Science, Heilongjiang Bayi Agricultural University, Xinfeng Road, Daqing, China.
BMC Bioinformatics. 2022 Feb 21;23(1):81. doi: 10.1186/s12859-022-04609-x.
To construct gene co-expression networks, it is necessary to evaluate the correlation between different gene expression profiles. However, commonly used correlation metrics, including both linear (such as Pearson's correlation) and monotonic (such as Spearman's correlation) dependence metrics, are not enough to observe the nature of real biological systems. Hence, introducing a more informative correlation metric when constructing gene co-expression networks is still an interesting topic.
In this paper, we test distance correlation, a correlation metric integrating both linear and non-linear dependence, with other three typical metrics (Pearson's correlation, Spearman's correlation, and maximal information coefficient) on four different arrays (macrophage and liver) and RNA-seq (cervical cancer and pancreatic cancer) datasets. Among all the metrics, distance correlation is distribution free and can provide better performance on complex relationships and anti-outlier. Furthermore, distance correlation is applied to Weighted Gene Co-expression Network Analysis (WGCNA) for constructing a gene co-expression network analysis method which we named Distance Correlation-based Weighted Gene Co-expression Network Analysis (DC-WGCNA). Compared with traditional WGCNA, DC-WGCNA can enhance the result of enrichment analysis and improve the module stability.
Distance correlation is better at revealing complex biological relationships between gene profiles compared with other correlation metrics, which contribute to more meaningful modules when analyzing gene co-expression networks. However, due to the high time complexity of distance correlation, the implementation requires more computer memory.
构建基因共表达网络,需要评估不同基因表达谱之间的相关性。然而,常用的相关性度量方法,包括线性(如 Pearson 相关)和单调(如 Spearman 相关)依赖性度量方法,不足以观察真实生物系统的性质。因此,在构建基因共表达网络时引入更具信息量的相关性度量方法仍然是一个有趣的话题。
在本文中,我们测试了距离相关系数,这是一种整合了线性和非线性依赖关系的相关性度量方法,与其他三种典型度量方法(Pearson 相关系数、Spearman 相关系数和最大信息系数)在四个不同的数组(巨噬细胞和肝脏)和 RNA-seq(宫颈癌和胰腺癌)数据集上进行了比较。在所有的度量方法中,距离相关系数是无分布的,可以在复杂关系和抗离群值方面提供更好的性能。此外,距离相关系数被应用于加权基因共表达网络分析(WGCNA)中,构建了一种基因共表达网络分析方法,我们称之为基于距离相关系数的加权基因共表达网络分析(DC-WGCNA)。与传统的 WGCNA 相比,DC-WGCNA 可以增强富集分析的结果,并提高模块的稳定性。
与其他相关性度量方法相比,距离相关系数更擅长揭示基因谱之间复杂的生物学关系,有助于在分析基因共表达网络时产生更有意义的模块。然而,由于距离相关系数的时间复杂度较高,实现需要更多的计算机内存。