Translational Centre for Regenerative Medicine Leipzig, University of Leipzig, Semmelweisstr. 14, Leipzig 04103, Germany.
BMC Bioinformatics. 2013 Mar 2;14:75. doi: 10.1186/1471-2105-14-75.
Microarrays have become a routine tool to address diverse biological questions. Therefore, different types and generations of microarrays have been produced by several manufacturers over time. Likewise, the diversity of raw data deposited in public databases such as NCBI GEO or EBI ArrayExpress has grown enormously.This has resulted in databases currently containing several hundred thousand microarray samples clustered by different species, manufacturers and chip generations. While one of the original goals of these databases was to make the data available to other researchers for independent analysis and, where appropriate, integration with their own data, current software implementations could not provide that feature.Only those data sets generated on the same chip platform can be readily combined and even here there are batch effects to be taken care of. A straightforward approach to deal with multiple chip types and batch effects has been missing.The software presented here was designed to solve both of these problems in a convenient and user friendly way.
The virtualArray software package can combine raw data sets using almost any chip types based on current annotations from NCBI GEO or Bioconductor. After establishing congruent annotations for the raw data, virtualArray can then directly employ one of seven implemented methods to adjust for batch effects in the data resulting from differences between the chip types used. Both steps can be tuned to the preferences of the user. When the run is finished, the whole dataset is presented as a conventional Bioconductor "ExpressionSet" object, which can be used as input to other Bioconductor packages.
Using this software package, researchers can easily integrate their own microarray data with data from public repositories or other sources that are based on different microarray chip types. Using the default approach a robust and up-to-date batch effect correction technique is applied to the data.
微阵列已成为解决各种生物学问题的常规工具。因此,不同类型和代际的微阵列已经由几家制造商生产。同样,在 NCBI GEO 或 EBI ArrayExpress 等公共数据库中存储的原始数据的多样性也大大增加。这导致当前的数据库包含了由不同物种、制造商和芯片代际聚类的数十万个微阵列样本。虽然这些数据库的原始目标之一是将数据提供给其他研究人员进行独立分析,并在适当的情况下与他们自己的数据进行集成,但当前的软件实现无法提供该功能。只有那些在同一芯片平台上生成的数据集才能方便地组合,即使在这里也需要处理批次效应。缺乏一种直接的方法来处理多种芯片类型和批次效应。这里提出的软件旨在以方便和用户友好的方式解决这两个问题。
virtualArray 软件包可以使用基于 NCBI GEO 或 Bioconductor 当前注释的几乎任何芯片类型组合原始数据集。在为原始数据建立一致的注释之后,virtualArray 可以直接使用七种实现方法之一来调整由于使用的芯片类型之间的差异导致的数据中的批次效应。这两个步骤都可以根据用户的喜好进行调整。运行完成后,整个数据集将作为传统的 Bioconductor“ExpressionSet”对象呈现,可作为其他 Bioconductor 包的输入。
使用这个软件包,研究人员可以轻松地将自己的微阵列数据与基于不同微阵列芯片类型的公共存储库或其他来源的数据集成。使用默认方法,对数据应用了强大且最新的批次效应校正技术。