Department of Computer and Information Science, Fordham University, Lincoln Center, New York, NY 10023, USA.
Department of Public Health, Xi'an Medical University, Xi'an 710021, China.
J Biomed Inform. 2018 Sep;85:80-92. doi: 10.1016/j.jbi.2018.07.016. Epub 2018 Jul 21.
With the surge of next generation high-throughput technologies, RNA-seq data is playing an increasingly important role in disease diagnosis, in which normalization is assumed as an essential procedure to produce comparable samples. Recent studies have seen different normalization methods proposed to remove various technical biases in RNA sequencing. However, there are no previous studies evaluating the impacts of normalization on RNA-seq disease diagnosis. In this study, we investigate this problem by analyzing structured big data: RNA-seq data acquired from the TCGA portal for its popularity in RNA-seq disease diagnosis. We propose a novel normalization effect test algorithm, diagnostic index (d-index), and data entropy to analyze and evaluate the impacts of normalization on RNA-seq disease diagnosis by using state-of-the-art machine learning models. Furthermore, we present an original visualization analysis to compare the performance of normalized data versus raw data. We have found that normalized data yields generally an equivalent or even lower level diagnosis than its raw data. Moreover, some normalization approaches (e.g. RPKM) even bring negative effects in disease diagnosis. On the other hand, raw data seems to have the potential to decipher pathological status better or at least comparable than when the data is normalized. Our visualization analysis also shows that some normalization methods even bring 'outliers', which unavoidably decreases sample detectability in diagnosis. More importantly, our data entropy analysis shows that normalized data usually demonstrates equivalent or lower entropy values than raw data. Those data with high entropy values tend to achieve better diagnosis than those with low entropy values. In addition, we found that high-dimensional imbalance (HDI) data is unaffected by any normalization procedures in diagnosis, and fails almost all machine learning models by only recognizing majority types in spite of raw or normalized data. Our results suggest that normalized data may not demonstrate statistically significant advantages in disease diagnosis than its raw form. It further implies that normalization may not be an indispensable procedure in RNA-seq disease diagnosis or at least some normalization processes may not be. Instead, raw data may perform better for capturing more original transcriptome patterns in different pathological conditions.
随着下一代高通量技术的涌现,RNA-seq 数据在疾病诊断中发挥着越来越重要的作用,其中标准化被认为是产生可比样本的必要步骤。最近的研究提出了不同的标准化方法来消除 RNA 测序中的各种技术偏差。然而,以前没有研究评估标准化对 RNA-seq 疾病诊断的影响。在这项研究中,我们通过分析结构化大数据来研究这个问题:从 TCGA 门户获取的 RNA-seq 数据,因其在 RNA-seq 疾病诊断中的普及而受到欢迎。我们提出了一种新的标准化效果测试算法,即诊断指数 (d-index) 和数据熵,以使用最先进的机器学习模型分析和评估标准化对 RNA-seq 疾病诊断的影响。此外,我们提出了一种原始的可视化分析方法来比较标准化数据与原始数据的性能。我们发现,标准化数据的诊断效果通常与原始数据相当,甚至更低。此外,一些标准化方法(例如 RPKM)甚至在疾病诊断中带来负面影响。另一方面,原始数据似乎有潜力更好地或至少与标准化数据一样破译病理状态。我们的可视化分析还表明,一些标准化方法甚至会带来“异常值”,这不可避免地会降低诊断中的样本可检测性。更重要的是,我们的数据熵分析表明,标准化数据通常表现出与原始数据相当或更低的熵值。那些具有高熵值的数据往往比具有低熵值的数据具有更好的诊断效果。此外,我们发现高维不平衡 (HDI) 数据在诊断中不受任何标准化程序的影响,并且仅通过识别大多数类型而忽略原始或标准化数据,几乎无法通过所有机器学习模型。我们的结果表明,标准化数据在疾病诊断中的表现并不比原始数据具有统计学上的显著优势。这进一步表明,在 RNA-seq 疾病诊断中,标准化可能不是一个不可或缺的步骤,或者至少一些标准化过程可能不是。相反,原始数据可能在捕获不同病理条件下更多原始转录组模式方面表现更好。