Babb Larry, Bult Carol, Carey Vincent J, Carroll Robert J, Hitz Benjamin C, Mungall Chris J, Rehm Heidi L, Schatz Michael C, Wagner Alex
Broad Institute of MIT and Harvard, Cambridge, MA.
The Jackson Laboratory, Bar Harbor, ME.
ArXiv. 2025 Aug 19:arXiv:2508.13498v1.
In 2024, individuals funded by NHGRI to support genomic community resources completed a Self-Assessment Tool (SAT) to evaluate their application of the FAIR (Findable, Accessible, Interoperable, and Reusable) principles and assess their sustainability. By collecting insights from the self-administered questionnaires and conducting personal interviews, a valuable perspective was gained on the FAIRness and sustainability of the NHGRI resources. The results highlighted several challenges and key areas the NHGRI resource community could improve by working together to form recommendations to address these challenges. The next step was the formation of an Organizing Committee to identify which challenges could lead to best practices or guidelines for the community. The workshop's Organizing Committee comprised four members from the NHGRI resource community: Carol Bult, PhD, Chris Mungall, PhD, Heidi Rehm, PhD, and Michael Schatz, PhD. In December 2024, the Organizing Committee engaged with the NHGRI resource community to refine these challenges further, inviting feedback on potential focus areas for a future workshop. This collaborative approach led to two informative webinars in December 2024, highlighting specific challenges in data curation, data processing, metadata tools, and variant identifiers within the NHGRI resources. Throughout the workshop planning process, the four Organizing Committee members worked together to create and develop themes, design breakout sessions, and create a detailed agenda. The workshop's agenda was intentionally structured to ensure participants could generate implementable recommendations for the NHGRI resource community. The two-day workshop was held in Bethesda, MD, on March 3-4, 2025. The challenges received from NHGRI resources were classified into four key categories, forming the basis of the workshop. The four key categories are variant identifiers, data processing, data curation, and metadata tools. They are briefly described below, with greater details on their challenges and recommendations in subsequent sections. Metadata Tools:While metadata is vital for capturing context in genomic datasets, its usage and relevance can vary by domain, making it difficult to standardize usage. While various methods exist for annotating and extracting metadata, incomplete or inconsistent annotations often result in ineffective data sharing and interoperability, further reducing data usability and reproducibility.Data Curation:Curation of annotations for genomics data is critical for FAIR-ness. Scalable curation solutions are challenging because of the multiple components for curation, including harmonizing data sets, data cleaning, and annotation. The workshop focused on identifying which aspects of data curation could be streamlined using computational methods while considering the barriers to increased automation.Variant Identifiers:Variant identifiers are standardized representations of genetic variants, crucial for sharing and interpreting genomic data in research and clinical work. They ensure consistent referencing and enable data aggregation. Standardizing variant identifiers is difficult due to varied formats, complex data, and distinct environments for generating and disseminating data.Data Processing:Data processing is a necessary first step in a FAIR environment. As there are many variant workflows, streamlining this process will ensure greater accuracy, reproducibility, interoperability, and FAIRness, driving advancements in clinical research. The workshop focused on addressing these aspects with a key focus on improvements and best practices around data processing for an NHGRI resource. Several recommendations were made throughout the workshop's interactive sessions with the resources' participants. While many recommendations were specific to data processing, data curation, metadata tools, or variant identifiers, they can be grouped into core recommendations addressing common challenges within the NHGRI resource community. These core recommendations highlight the key themes that emerged across sessions and are listed in the nine recommendations below. Increase transparency to enable effective sharing/reproducibility (documenting, benchmarking, publishing, mapping)Develop entity schema and ontology mapping tools (between models, identifiers, etc.)Annotate tools using resources to increase findability and reuse (Examples: EDAM Ontology of Bioscientific data analysis and data management)Use standard nomenclature and identifiersMake workflows usable by researchers with limited programming expertiseImplement APIs to improve data connectivityPresent data in an interpretable manner, along with machine readabilityDevelop artificial intelligence/machine learning (AI/ML) methods for scaling curation processesAssess the impact of resources using an independent group that can assess return on investment and impact to health and scientific advancement. An additional key collaborative outcome was the development of Appendix A, which outlines ongoing and future efforts, including additional workshops, webinars, and meetings through the listed events provided by the NHGRI resource community. We hope that these activities will enable further advances in the implementation of FAIR standards and continue to foster collaboration and exchange across NHGRI resources and the global community.
2024年,由美国国立人类基因组研究所(NHGRI)资助以支持基因组社区资源的个人完成了一项自我评估工具(SAT),以评估他们对FAIR(可查找、可访问、可互操作和可重用)原则的应用情况,并评估其可持续性。通过收集自填问卷的见解并进行个人访谈,我们对NHGRI资源的公平性和可持续性有了宝贵的认识。结果突出了几个挑战以及NHGRI资源社区通过共同努力形成应对这些挑战的建议可以改进的关键领域。下一步是成立一个组织委员会,以确定哪些挑战可以为社区带来最佳实践或指导方针。研讨会的组织委员会由NHGRI资源社区的四名成员组成:卡罗尔·布尔博士、克里斯·蒙高尔博士、海蒂·雷姆博士和迈克尔·沙茨博士。2024年12月,组织委员会与NHGRI资源社区合作,进一步完善这些挑战,就未来研讨会的潜在重点领域征求反馈意见。这种协作方式在2024年12月促成了两场内容丰富的网络研讨会,突出了NHGRI资源中数据管理、数据处理、元数据工具和变异标识符方面的具体挑战。在整个研讨会规划过程中,四位组织委员会成员共同努力创建和制定主题、设计分组讨论环节并制定详细议程。研讨会的议程经过精心安排,以确保参与者能够为NHGRI资源社区提出可实施的建议。为期两天的研讨会于2025年3月3日至4日在马里兰州贝塞斯达举行。从NHGRI资源中收到的挑战被分为四个关键类别,构成了研讨会的基础。这四个关键类别是变异标识符、数据处理、数据管理和元数据工具。下面对它们进行简要描述,后续章节将更详细地介绍它们的挑战和建议。元数据工具:虽然元数据对于在基因组数据集中捕捉上下文至关重要,但其使用和相关性可能因领域而异,这使得标准化使用变得困难。虽然存在各种注释和提取元数据的方法,但不完整或不一致的注释往往导致无效的数据共享和互操作性,进一步降低数据的可用性和可重复性。数据管理:基因组数据注释的管理对于实现FAIR原则至关重要。由于管理的多个组成部分,包括协调数据集、数据清理和注释,可扩展的管理解决方案具有挑战性。研讨会侧重于确定哪些数据管理方面可以使用计算方法进行简化,同时考虑增加自动化的障碍。变异标识符:变异标识符是遗传变异的标准化表示,对于研究和临床工作中共享和解释基因组数据至关重要。它们确保一致的引用并实现数据聚合。由于格式多样、数据复杂以及生成和传播数据的不同环境,标准化变异标识符很困难。数据处理:数据处理是FAIR环境中必要的第一步。由于有许多变异工作流程,简化这一过程将确保更高的准确性、可重复性、互操作性和公平性,推动临床研究的进步。研讨会侧重于解决这些方面的问题,重点是围绕NHGRI资源的数据处理改进和最佳实践。在与资源参与者的研讨会互动环节中提出了一些建议。虽然许多建议特定于数据处理、数据管理、元数据工具或变异标识符,但它们可以归纳为核心建议,以应对NHGRI资源社区内的共同挑战。这些核心建议突出了各环节中出现的关键主题,如下列九条建议所示。提高透明度以实现有效的共享/可重复性(记录、基准测试、发布、映射)开发实体模式和本体映射工具(在模型、标识符等之间)使用资源注释工具以提高可查找性和可重用性(示例:生物科学数据分析和数据管理的EDAM本体)使用标准命名法和标识符使编程专业知识有限的研究人员能够使用工作流程实施应用程序编程接口(API)以改善数据连接以可解释的方式呈现数据,并具备机器可读性开发人工智能/机器学习(AI/ML)方法以扩展管理流程使用一个独立的小组评估资源的影响,该小组可以评估投资回报率以及对健康和科学进步的影响。另一个关键的协作成果是附录A的制定,其中概述了正在进行的和未来的工作,包括通过NHGRI资源社区列出的活动举办的更多研讨会、网络研讨会和会议。我们希望这些活动将推动FAIR标准实施方面的进一步进展,并继续促进NHGRI资源与全球社区之间的合作与交流。