The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
Department of Statistics, The Hebrew University of Jerusalem, Jerusalem, Israel.
BMC Cancer. 2019 Aug 7;19(1):783. doi: 10.1186/s12885-019-5994-5.
In recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from > 10,000 patients.
Our hypothesis in this study is that whole exome sequences from blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity.
We report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants.
TCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.
近年来,癌症易感性种系变异的研究已成为一个突出的领域。体细胞突变的特征基于对患者种系变异的可靠映射。此外,健康个体和癌症患者种系变异频率的统计数据是寻找癌症易感性基因候选者的基础。癌症基因组图谱(TCGA)是此类数据的主要来源之一,提供了包括来自 > 10000 名患者的 30 多种癌症的深度测序在内的多种分子数据。
我们在这项研究中的假设是,癌症患者的血液样本外显子组序列不应显示出癌症类型之间的系统差异。为了检验这一假设,我们分析了来自 TCGA 的六种癌症类型的常见和罕见种系变异,涵盖了 2241 个样本。在我们的分析中,我们考虑了数据中的固有变量,包括不同的变异调用协议、测序平台和种族。
我们报告了与癌症类型相关的种系变异存在大量批次效应。我们将这种影响归因于产生数据的特定测序中心。具体来说,我们测量了跨测序中心每个样本报告的种系变异数量的 30%可变性。该批处理效应进一步表现在核苷酸组成和变异频率上。重要的是,该批处理效应导致了许多基因中的种系变异分布模式的显著差异,包括 BRCA1、RET、MAX 和 KRAS 等重要的癌症易感性基因。对于大多数已知的癌症易感性基因,我们发现种系变异存在明显的批次依赖性差异。
TCGA 种系数据受到强烈的批次效应影响,TCGA 测序中心之间存在很大的可变性。我们声称,这些批次效应对 TCGA 的许多泛癌研究具有重要意义。特别是,这些效应可能会影响到检测新的癌症易感性基因的可靠性和有效性。此外,在考虑到基因组数据的来源后,应该重新审视泛癌分析的解释,以解释报告的批次效应。