Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA.
Bioinformatics. 2010 Jan 15;26(2):242-9. doi: 10.1093/bioinformatics/btp624. Epub 2009 Nov 11.
Genome-wide association studies (GWAS) are used to discover genes underlying complex, heritable disorders for which less powerful study designs have failed in the past. The number of GWAS has skyrocketed recently with findings reported in top journals and the mainstream media. Microarrays are the genotype calling technology of choice in GWAS as they permit exploration of more than a million single nucleotide polymorphisms (SNPs) simultaneously. The starting point for the statistical analyses used by GWAS to determine association between loci and disease is making genotype calls (AA, AB or BB). However, the raw data, microarray probe intensities, are heavily processed before arriving at these calls. Various sophisticated statistical procedures have been proposed for transforming raw data into genotype calls. We find that variability in microarray output quality across different SNPs, different arrays and different sample batches have substantial influence on the accuracy of genotype calls made by existing algorithms. Failure to account for these sources of variability can adversely affect the quality of findings reported by the GWAS.
We developed a method based on an enhanced version of the multi-level model used by CRLMM version 1. Two key differences are that we now account for variability across batches and improve the call-specific assessment of each call. The new model permits the development of quality metrics for SNPs, samples and batches of samples. Using three independent datasets, we demonstrate that the CRLMM version 2 outperforms CRLMM version 1 and the algorithm provided by Affymetrix, Birdseed. The main advantage of the new approach is that it enables the identification of low-quality SNPs, samples and batches.
Software implementing of the method described in this article is available as free and open source code in the crlmm R/BioConductor package.
Supplementary data are available at Bioinformatics online.
全基因组关联研究(GWAS)用于发现复杂的、可遗传的疾病的相关基因,过去这些疾病的研究设计不太强大。最近,GWAS 的数量激增,顶级期刊和主流媒体都有报道。微阵列是 GWAS 中首选的基因型检测技术,因为它们可以同时探索超过一百万的单核苷酸多态性(SNP)。GWAS 用于确定基因座与疾病之间关联的统计分析的起点是进行基因型调用(AA、AB 或 BB)。然而,在得出这些调用之前,原始数据(微阵列探针强度)需要经过大量处理。已经提出了各种复杂的统计程序来将原始数据转换为基因型调用。我们发现,不同 SNP、不同微阵列和不同样本批次之间的微阵列输出质量的可变性对现有算法做出的基因型调用的准确性有很大影响。如果不考虑这些可变性来源,可能会对 GWAS 报告的发现质量产生不利影响。
我们开发了一种基于 CRLMM 版本 1 中使用的增强多级模型的方法。两个关键区别是,我们现在考虑了批次之间的可变性,并改进了每个调用的特定调用评估。新模型允许为 SNP、样本和样本批次开发质量指标。使用三个独立的数据集,我们证明 CRLMM 版本 2 优于 CRLMM 版本 1 和 Affymetrix 的 Birdseed 算法。新方法的主要优点是它能够识别低质量的 SNP、样本和批次。
本文中描述的方法的软件实现在 crlmm R/BioConductor 包中作为免费的开源代码提供。
补充数据可在生物信息学在线获得。