School of Biological Sciences, Nanyang Technological University.
National University of Singapore.
Brief Bioinform. 2019 Jan 18;20(1):347-355. doi: 10.1093/bib/bbx128.
Mass spectrometry (MS)-based proteomics has undergone rapid advancements in recent years, creating challenging problems for bioinformatics. We focus on four aspects where bioinformatics plays a crucial role (and proteomics is needed for clinical application): peptide-spectra matching (PSM) based on the new data-independent acquisition (DIA) paradigm, resolving missing proteins (MPs), dealing with biological and technical heterogeneity in data and statistical feature selection (SFS). DIA is a brute-force strategy that provides greater width and depth but, because it indiscriminately captures spectra such that signal from multiple peptides is mixed, getting good PSMs is difficult. We consider two strategies: simplification of DIA spectra to pseudo-data-dependent acquisition spectra or, alternatively, brute-force search of each DIA spectra against known reference libraries. The MP problem arises when proteins are never (or inconsistently) detected by MS. When observed in at least one sample, imputation methods can be used to guess the approximate protein expression level. If never observed at all, network/protein complex-based contextualization provides an independent prediction platform. Data heterogeneity is a difficult problem with two dimensions: technical (batch effects), which should be removed, and biological (including demography and disease subpopulations), which should be retained. Simple normalization is seldom sufficient, while batch effect-correction algorithms may create errors. Batch effect-resistant normalization methods are a viable alternative. Finally, SFS is vital for practical applications. While many methods exist, there is no best method, and both upstream (e.g. normalization) and downstream processing (e.g. multiple-testing correction) are performance confounders. We also discuss signal detection when class effects are weak.
近年来,基于质谱(MS)的蛋白质组学发展迅速,给生物信息学带来了挑战。我们重点关注生物信息学在四个方面发挥关键作用的地方(以及蛋白质组学在临床应用中的需求):基于新的数据非依赖性采集(DIA)范式的肽谱匹配(PSM)、解决缺失蛋白(MP)问题、处理数据中的生物学和技术异质性以及统计特征选择(SFS)。DIA 是一种盲目策略,提供了更大的宽度和深度,但由于它不加区分地捕获谱,使得来自多个肽的信号混合,因此很难获得良好的 PSM。我们考虑了两种策略:将 DIA 谱简化为伪数据依赖性采集谱,或者对每个 DIA 谱进行针对已知参考库的盲目搜索。当蛋白质从未(或不一致)被 MS 检测到时,就会出现 MP 问题。当在至少一个样本中观察到蛋白质时,可以使用插补方法来猜测蛋白质表达水平的近似值。如果根本没有观察到,则基于网络/蛋白质复合物的上下文化提供了一个独立的预测平台。数据异质性是一个具有两个维度的难题:技术(批次效应),应该消除;生物(包括人口统计学和疾病亚群),应该保留。简单的归一化很少足够,而批处理效应校正算法可能会产生错误。抗批处理效应的归一化方法是一种可行的替代方法。最后,SFS 对于实际应用至关重要。虽然有许多方法存在,但没有最佳方法,上游(例如归一化)和下游处理(例如多重测试校正)都是性能混杂因素。我们还讨论了当类效应较弱时的信号检测。