Datta Susmita, Datta Somnath
Department of Mathematics and Statistics, Department of Biology, Georgia State University, Atlanta, 30303, USA.
Bioinformatics. 2005 May 1;21(9):1987-94. doi: 10.1093/bioinformatics/bti301. Epub 2005 Feb 2.
Statistical tests for the detection of differentially expressed genes lead to a large collection of p-values one for each gene comparison. Without any further adjustment, these p-values may lead to a large number of false positives, simply because the number of genes to be tested is huge, which might mean wastage of laboratory resources. To account for multiple hypotheses, these p-values are typically adjusted using a single step method or a step-down method in order to achieve an overall control of the error rate (the so-called familywise error rate). In many applications, this may lead to an overly conservative strategy leading to too few genes being flagged.
In this paper we introduce a novel empirical Bayes screening (EBS) technique to inspect a large number of p-values in an effort to detect additional positive cases. In effect, each case borrows strength from an overall picture of the alternative hypotheses computed from all the p-values, while the entire procedure is calibrated by a step-down method so that the familywise error rate at the complete null hypothesis is still controlled. It is shown that the EBS has substantially higher sensitivity than the standard step-down approach for multiple comparison at the cost of a modest increase in the false discovery rate (FDR). The EBS procedure also compares favorably when compared with existing FDR control procedures for multiple testing. The EBS procedure is particularly useful in situations where it is important to identify all possible potentially positive cases which can be subjected to further confirmatory testing in order to eliminate the false positives. We illustrated this screening procedure using a data set on human colorectal cancer where we show that the EBS method detected additional genes related to colon cancer that were missed by other methods. This novel empirical Bayes procedure is advantageous over our earlier proposed empirical Bayes adjustments due to the following reasons: (i) it offers an automatic screening of the p-values the user may obtain from a univariate (i.e., gene by gene) analysis package making it extremely easy to use for a non-statistician, (ii) since it applies to the p-values, the tests do not have to be t-tests; in particular they could be F-tests which might arise in certain ANOVA formulations with expression data or even nonparametric tests, (iii) the empirical Bayes adjustment uses nonparametric function estimation techniques to estimate the marginal density of the transformed p-values rather than using a parametric model for the prior distribution and is therefore robust against model mis-specification.
R code for EBS is available from the authors upon request.
用于检测差异表达基因的统计检验会产生大量的p值,每个基因比较都有一个p值。如果不做进一步调整,这些p值可能会导致大量的假阳性结果,这仅仅是因为要测试的基因数量巨大,这可能意味着实验室资源的浪费。为了考虑多重假设,通常使用单步方法或逐步下降方法来调整这些p值,以实现对错误率(即所谓的族错误率)的总体控制。在许多应用中,这可能会导致过于保守的策略,导致被标记的基因过少。
在本文中,我们引入了一种新颖的经验贝叶斯筛选(EBS)技术来检查大量的p值,以努力检测出更多的阳性案例。实际上,每个案例都从根据所有p值计算出的备择假设的整体情况中借用力量,而整个过程通过逐步下降方法进行校准,以便在完全零假设下的族错误率仍然得到控制。结果表明,EBS在以适度增加错误发现率(FDR)为代价的情况下,比用于多重比较的标准逐步下降方法具有更高的灵敏度。与现有的用于多重检验的FDR控制程序相比,EBS程序也表现出色。EBS程序在识别所有可能的潜在阳性案例方面特别有用,这些案例可以进行进一步的验证性测试以消除假阳性。我们使用人类结直肠癌的数据集说明了这种筛选程序,在该数据集中我们表明EBS方法检测到了其他方法遗漏的与结肠癌相关的基因。这种新颖的经验贝叶斯程序比我们之前提出的经验贝叶斯调整方法更具优势,原因如下:(i)它提供了对用户可能从单变量(即逐个基因)分析软件包中获得的p值的自动筛选,使得非统计学家使用起来极其容易;(ii)由于它适用于p值,测试不必是t检验;特别是它们可以是在某些具有表达数据的方差分析公式中可能出现的F检验,甚至是非参数检验;(iii)经验贝叶斯调整使用非参数函数估计技术来估计变换后的p值的边际密度,而不是使用先验分布的参数模型,因此对模型错误设定具有鲁棒性。
可根据作者要求提供EBS的R代码。