Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA.
Department of Neurosurgery, Stanford University School of Medicine, Stanford, California 94305, USA.
Sci Data. 2017 Sep 19;4:170125. doi: 10.1038/sdata.2017.125.
The Gene Expression Omnibus (GEO) contains more than two million digital samples from functional genomics experiments amassed over almost two decades. However, individual sample meta-data remains poorly described by unstructured free text attributes preventing its largescale reanalysis. We introduce the Search Tag Analyze Resource for GEO as a web application (http://STARGEO.org) to curate better annotations of sample phenotypes uniformly across different studies, and to use these sample annotations to define robust genomic signatures of disease pathology by meta-analysis. In this paper, we target a small group of biomedical graduate students to show rapid crowd-curation of precise sample annotations across all phenotypes, and we demonstrate the biological validity of these crowd-curated annotations for breast cancer. STARGEO.org makes GEO data findable, accessible, interoperable and reusable (i.e., FAIR) to ultimately facilitate knowledge discovery. Our work demonstrates the utility of crowd-curation and interpretation of open 'big data' under FAIR principles as a first step towards realizing an ideal paradigm of precision medicine.
基因表达综合数据库(GEO)包含了近二十年来功能基因组实验中积累的超过两百万个数字样本。然而,由于个体样本的元数据仍然是用非结构化的自由文本属性来描述,导致其无法进行大规模的重新分析。我们引入了基因表达综合数据库搜索标签分析资源(Search Tag Analyze Resource for GEO)作为一个网络应用程序(http://STARGEO.org),以统一的方式在不同的研究中对样本表型进行更好的注释,并使用这些样本注释通过荟萃分析来定义稳健的疾病病理基因组特征。在本文中,我们以一小群生物医学研究生为目标,展示了在所有表型上快速进行精确样本注释的众包方法,并且我们证明了这些众包注释对于乳腺癌的生物学有效性。STARGEO.org 使 GEO 数据变得可查找、可访问、可互操作和可重复使用(即 FAIR),从而最终促进知识发现。我们的工作展示了在 FAIR 原则下,对开放的“大数据”进行众包和解释的实用性,这是实现精准医学理想范式的第一步。