Suppr超能文献

TCGA报告:用于基准测试基于文本的人工智能模型的机器可读病理报告资源。

TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models.

作者信息

Kefeli Jenna, Tatonetti Nicholas

机构信息

Department of Systems Biology, Columbia University, New York, NY 10032, USA.

Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA.

出版信息

Patterns (N Y). 2024 Feb 21;5(3):100933. doi: 10.1016/j.patter.2024.100933. eCollection 2024 Mar 8.

Abstract

In cancer research, pathology report text is a largely untapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing the data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using artificial intelligence (AI) allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. Finally, we perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.

摘要

在癌症研究中,病理报告文本在很大程度上是一个未被充分利用的数据源。病理报告是常规生成的,比结构化数据更细致入微,并且包含病理学家的额外见解。然而,目前尚无公开可用的数据集用于对基于报告的模型进行基准测试。最近的两项进展表明迫切需要一个基准数据集。第一,改进的光学字符识别(OCR)技术将使以自动化方式访问旧的病理报告成为可能,从而增加可用于分析的数据。第二,最近在使用人工智能(AI)的自然语言处理(NLP)技术方面的改进使得能够从文本中更准确地预测临床指标。我们对来自癌症基因组图谱的报告PDF应用了最先进的OCR和定制的后处理,生成了一个包含9523份报告的机器可读语料库。最后,我们对32种组织进行了癌症类型分类的原理验证,平均AU-ROC达到0.992。该数据集将对包括研究临床医生、临床试验研究者和临床NLP研究者在内的各专业研究人员有用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33e0/10935496/f16b5d756e0c/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验