Marti Jacques, Piquemal David, Manchon Laurent, Commes Thérèse
Institut de Génétique Humaine, UPR CNRS 1142, 141 rue de la Cardonille, 34396 Montpellier.
J Soc Biol. 2002;196(4):303-7.
The availability of the sequences for whole genomes is changing our understanding of cell biology. Functional genomics refers to the comprehensive analysis, at the protein level (proteome) and at the mRNA level (transcriptome) of all events associated with the expression of whole sets of genes. New methods have been developed for transcriptome analysis. Serial Analysis of Gene Expression (SAGE) is based on the massive sequential analysis of short cDNA sequence tags. Each tag is derived from a defined position within a transcript. Its size (14 bp) is sufficient to identify the corresponding gene and the number of times each tag is observed provides an accurate measurement of its expression level. Since tag populations can be widely amplified without altering their relative proportions, SAGE may be performed with minute amounts of biological extract. Dealing with the mass of data generated by SAGE necessitates computer analysis. A software is required to automatically detect and count tags from sequence files. Criterias allowing to assess the quality of experimental data can be included at this stage. To identify the corresponding genes, a database is created registering all virtual tags susceptible to be observed, based on the present status of the genome knowledge. By using currently available database functions, it is easy to match experimental and virtual tags, thus generating a new database registering identified tags, together with their expression levels. As an open system, SAGE is able to reveal new, yet unknown, transcripts. Their identification will become increasingly easier with the progress of genome annotation. However, their direct characterization can be attempted, since tag information may be sufficient to design primers allowing to extend unknown sequences. A major advantage of SAGE is that, by measuring expression levels without reference to an arbitrary standard, data are definitively acquired and cumulative. All publicly available data can thus be stored in a unique database, facilitating whole-genome analysis of differential expression between cell types, normal and diseased samples, or samples with and without drug treatment. SAGE data are readily amenable to statistical comparisons, allowing to determine the level of confidence of the observed variations. A major limitation of SAGE is that, because each analysis is obligatory performed on the whole set of expressed genes, it can hardly be performed on multiple samples, for example in kinetics studies or to compare the effects of large numbers of drugs. To overcome this limitation, high-throughput detection of a subset of mRNAs is more rapidly performed by parallel hybridization of mRNAs on arrays of nucleic acids immobilized on solid supports. From this point of view, a SAGE platform is a powerful instrument for selecting the most informative subset of genes, assembling them to design microarrays dedicated to a specific problem and calibrating measurement by comparison with a standard cell model for which SAGE data are available. This approach is an attractive alternative to strategies based exclusively on pangenomic arrays. A very large amount of SAGE data are already available and the problem is now to extract their biological meaning. Knowledge on metabolic pathways is already organized so that its successful integration in a SAGE platform can be undertaken. For other cell components and pathways, the problem lies on the lack of controlled vocabulary to describe gene activities, starting form a clear definition of the concept of biological function itself. Progress in gene and cell ontology is expected to facilitate computer-based extraction of biological knowledge from existing and forthcoming SAGE data.
全基因组序列的可得性正在改变我们对细胞生物学的理解。功能基因组学是指在蛋白质水平(蛋白质组)和mRNA水平(转录组)对与整套基因表达相关的所有事件进行全面分析。已经开发出了用于转录组分析的新方法。基因表达序列分析(SAGE)基于对短cDNA序列标签的大规模顺序分析。每个标签都来自转录本内的一个确定位置。其大小(14bp)足以识别相应基因,并且观察到每个标签的次数提供了其表达水平的准确测量。由于标签群体可以在不改变其相对比例的情况下广泛扩增,因此SAGE可以用微量生物提取物进行。处理SAGE产生的大量数据需要计算机分析。需要一个软件来自动从序列文件中检测和计数标签。在此阶段可以纳入允许评估实验数据质量的标准。为了识别相应基因,基于基因组知识的现状创建一个数据库,登记所有可能被观察到的虚拟标签。通过使用当前可用的数据库功能,很容易将实验标签与虚拟标签进行匹配,从而生成一个新的数据库,登记已识别的标签及其表达水平。作为一个开放系统,SAGE能够揭示新的、未知的转录本。随着基因组注释的进展,它们的识别将变得越来越容易。然而,可以尝试对它们进行直接表征,因为标签信息可能足以设计引物来延伸未知序列。SAGE的一个主要优点是,通过在不参考任意标准的情况下测量表达水平,可以明确获取并累积数据。因此,所有公开可用的数据都可以存储在一个唯一的数据库中,便于对细胞类型、正常和患病样本或有和没有药物处理的样本之间的差异表达进行全基因组分析。SAGE数据很容易进行统计比较,从而可以确定观察到的变异的置信水平。SAGE的一个主要局限性是,由于每次分析都必须对整套表达基因进行,因此很难对多个样本进行分析,例如在动力学研究中或比较大量药物的效果时。为了克服这一局限性,通过将mRNA与固定在固体支持物上的核酸阵列进行平行杂交,可以更快速地对mRNA的一个子集进行高通量检测。从这个角度来看,SAGE平台是一种强大的工具,用于选择最具信息性的基因子集,将它们组装起来以设计针对特定问题的微阵列,并通过与具有SAGE数据的标准细胞模型进行比较来校准测量。这种方法是完全基于泛基因组阵列的策略的一个有吸引力的替代方案。已经有大量的SAGE数据可用,现在的问题是提取它们的生物学意义。关于代谢途径的知识已经有了组织,因此可以成功地将其整合到SAGE平台中。对于其他细胞成分和途径,问题在于缺乏用于描述基因活动的受控词汇,这首先需要对生物学功能本身的概念有一个清晰的定义。预计基因和细胞本体论的进展将有助于从现有和即将出现的SAGE数据中基于计算机提取生物学知识。