Suppr超能文献

体细胞序列变异中系统批次效应的泛癌分析。

Pan-cancer analysis of systematic batch effects on somatic sequence variations.

作者信息

Choi Ji-Hye, Hong Seong-Eui, Woo Hyun Goo

机构信息

Department of Physiology, Ajou University School of Medicine, 164 Worldcup-ro, Yeongtong-gu, Suwon, South Korea.

Department of Biomedical Science, Graduate School, Ajou University, Suwon, South Korea.

出版信息

BMC Bioinformatics. 2017 Apr 11;18(1):211. doi: 10.1186/s12859-017-1627-7.

Abstract

BACKGROUND

The Cancer Genome Atlas (TCGA) is a comprehensive database that includes multi-layered cancer genome profiles. Large-scale collection of data inevitably generates batch effects introduced by differences in processing at various stages from sample collection to data generation. However, batch effects on the sequence variation and its characteristics have not been studied extensively.

RESULTS

We systematically evaluated batch effects on somatic sequence variations in pan-cancer TCGA data, revealing 999 somatic variants that were batch-biased with statistical significance (P < 0.00001, Fisher's exact test, false discovery rate ≤ 0.0027). Most of the batch-biased variants were associated with specific sample plates. The batch-biased variants, which had a unique mutational spectrum with frequent indel-type mutations, preferentially occurred at sites prone to sequencing errors, e.g., in long homopolymer runs. Non-indel type batch-biased variants were frequent at splicing sites with the unique consensus motif sequence 'TTDTTTAGTT'. Furthermore, some batch-biased variants occur in known cancer genes, potentially causing misinterpretation of mutation profiles.

CONCLUSIONS

Our strategy for identifying batch-biased variants and characterising sequence patterns might be useful in eliminating false variants and facilitating correct interpretation of sequence profiles.

摘要

背景

癌症基因组图谱(TCGA)是一个全面的数据库,包含多层癌症基因组图谱。大规模的数据收集不可避免地会产生批次效应,这些效应是由从样本采集到数据生成的各个阶段的处理差异所引入的。然而,批次效应在序列变异及其特征方面尚未得到广泛研究。

结果

我们系统地评估了TCGA泛癌数据中体细胞序列变异的批次效应,发现了999个具有统计学意义的批次偏差体细胞变异(P < 0.00001,Fisher精确检验,错误发现率≤0.0027)。大多数批次偏差变异与特定的样本板相关。这些批次偏差变异具有独特的突变谱,频繁出现插入缺失型突变,优先发生在容易出现测序错误的位点,例如长同聚物序列中。非插入缺失型批次偏差变异在具有独特共有基序序列“TTDTTTAGTT”的剪接位点处很常见。此外,一些批次偏差变异出现在已知的癌症基因中,可能导致对突变谱的错误解读。

结论

我们识别批次偏差变异和表征序列模式的策略可能有助于消除假变异,并促进对序列图谱的正确解读。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a98/5387285/194b7d8028ba/12859_2017_1627_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验