Schymanski Emma L, Kondić Todor, Neumann Steffen, Thiessen Paul A, Zhang Jian, Bolton Evan E
Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 avenue du Swing, 4367, Belvaux, Luxembourg.
Bioinformatics and Scientific Data, Leibniz Institute of Plant Biochemistry (IPB Halle), 06120, Halle, Germany.
J Cheminform. 2021 Mar 8;13(1):19. doi: 10.1186/s13321-021-00489-0.
Compound (or chemical) databases are an invaluable resource for many scientific disciplines. Exposomics researchers need to find and identify relevant chemicals that cover the entirety of potential (chemical and other) exposures over entire lifetimes. This daunting task, with over 100 million chemicals in the largest chemical databases, coupled with broadly acknowledged knowledge gaps in these resources, leaves researchers faced with too much-yet not enough-information at the same time to perform comprehensive exposomics research. Furthermore, the improvements in analytical technologies and computational mass spectrometry workflows coupled with the rapid growth in databases and increasing demand for high throughput "big data" services from the research community present significant challenges for both data hosts and workflow developers. This article explores how to reduce candidate search spaces in non-target small molecule identification workflows, while increasing content usability in the context of environmental and exposomics analyses, so as to profit from the increasing size and information content of large compound databases, while increasing efficiency at the same time. In this article, these methods are explored using PubChem, the NORMAN Network Suspect List Exchange and the in silico fragmentation approach MetFrag. A subset of the PubChem database relevant for exposomics, PubChemLite, is presented as a database resource that can be (and has been) integrated into current workflows for high resolution mass spectrometry. Benchmarking datasets from earlier publications are used to show how experimental knowledge and existing datasets can be used to detect and fill gaps in compound databases to progressively improve large resources such as PubChem, and topic-specific subsets such as PubChemLite. PubChemLite is a living collection, updating as annotation content in PubChem is updated, and exported to allow direct integration into existing workflows such as MetFrag. The source code and files necessary to recreate or adjust this are jointly hosted between the research parties (see data availability statement). This effort shows that enhancing the FAIRness (Findability, Accessibility, Interoperability and Reusability) of open resources can mutually enhance several resources for whole community benefit. The authors explicitly welcome additional community input on ideas for future developments.
化合物(或化学物质)数据库对许多科学学科而言都是宝贵的资源。暴露组学研究人员需要查找并识别相关化学物质,这些化学物质要涵盖个体一生中所有潜在的(化学及其他)暴露情况。这项艰巨的任务,鉴于最大的化学物质数据库中包含超过1亿种化学物质,再加上这些资源中广泛认可的知识空白,使得研究人员在进行全面的暴露组学研究时,面临着信息过多却又不足的情况。此外,分析技术和计算质谱工作流程的改进,以及数据库的快速增长和研究界对高通量“大数据”服务需求的增加,给数据托管方和工作流程开发者都带来了重大挑战。本文探讨了如何在非目标小分子识别工作流程中减少候选搜索空间,同时在环境和暴露组学分析的背景下提高内容可用性,以便从大型化合物数据库不断增加的规模和信息含量中获益,同时提高效率。在本文中,使用美国国立医学图书馆的化学物质数据库(PubChem)、诺曼网络可疑物质清单交换库以及计算机辅助碎裂方法MetFrag来探索这些方法。作为一种数据库资源,与暴露组学相关的PubChem数据库子集PubChemLite被展示出来,它可以(并且已经)被整合到当前的高分辨率质谱工作流程中。来自早期出版物的基准数据集被用来展示如何利用实验知识和现有数据集来检测并填补化合物数据库中的空白,从而逐步改进像PubChem这样的大型资源,以及像PubChemLite这样的特定主题子集。PubChemLite是一个动态集合,会随着PubChem中注释内容的更新而更新,并可导出以便直接整合到诸如MetFrag等现有工作流程中。重新创建或调整它所需的源代码和文件由研究团队共同托管(见数据可用性声明)。这项工作表明,提高开放资源的FAIR性(可查找性、可访问性、互操作性和可重用性)能够相互促进多种资源,以造福整个科研群体。作者明确欢迎社区就未来发展的想法提供更多意见。