ChatGPT-4o可作为系统评价中数据提取的第二评估者。

ChatGPT-4o can serve as the second rater for data extraction in systematic reviews.

作者信息

Motzfeldt Jensen Mette, Brix Danielsen Mathias, Riis Johannes, Assifuah Kristjansen Karoline, Andersen Stig, Okubo Yoshiro, Jørgensen Martin Grønbech

机构信息

Department of Geriatric Medicine, Aalborg University Hospital, Aalborg, Denmark.

Department of Clinical Medicine, Aalborg University, Aalborg, Denmark.

出版信息

PLoS One. 2025 Jan 7;20(1):e0313401. doi: 10.1371/journal.pone.0313401. eCollection 2025.

DOI:10.1371/journal.pone.0313401

PMID:39774443

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11706374/

Abstract

BACKGROUND

Systematic reviews provide clarity of a bulk of evidence and support the transfer of knowledge from clinical trials to guidelines. Yet, they are time-consuming. Artificial intelligence (AI), like ChatGPT-4o, may streamline processes of data extraction, but its efficacy requires validation.

OBJECTIVE

This study aims to (1) evaluate the validity of ChatGPT-4o for data extraction compared to human reviewers, and (2) test the reproducibility of ChatGPT-4o's data extraction.

METHODS

We conducted a comparative study using papers from an ongoing systematic review on exercise to reduce fall risk. Data extracted by ChatGPT-4o were compared to a reference standard: data extracted by two independent human reviewers. The validity was assessed by categorizing the extracted data into five categories ranging from completely correct to false data. Reproducibility was evaluated by comparing data extracted in two separate sessions using different ChatGPT-4o accounts.

RESULTS

ChatGPT-4o extracted a total of 484 data points across 11 papers. The AI's data extraction was 92.4% accurate (95% CI: 89.5% to 94.5%) and produced false data in 5.2% of cases (95% CI: 3.4% to 7.4%). The reproducibility between the two sessions was high, with an overall agreement of 94.1%. Reproducibility decreased when information was not reported in the papers, with an agreement of 77.2%.

CONCLUSION

Validity and reproducibility of ChatGPT-4o was high for data extraction for systematic reviews. ChatGPT-4o was qualified as a second reviewer for systematic reviews and showed potential for future advancements when summarizing data.

摘要

背景

系统评价提供了大量证据的清晰度，并支持知识从临床试验向指南的转化。然而，它们耗时较长。像ChatGPT-4o这样的人工智能可能会简化数据提取过程，但其有效性需要验证。

目的

本研究旨在（1）与人工审阅者相比，评估ChatGPT-4o进行数据提取的有效性，以及（2）测试ChatGPT-4o数据提取的可重复性。

方法

我们使用一项正在进行的关于运动以降低跌倒风险的系统评价中的论文进行了一项比较研究。将ChatGPT-4o提取的数据与参考标准进行比较：由两名独立的人工审阅者提取的数据。通过将提取的数据分为从完全正确到错误数据的五类来评估有效性。通过比较使用不同ChatGPT-4o账户在两个单独会话中提取的数据来评估可重复性。

结果

ChatGPT-4o在11篇论文中总共提取了484个数据点。人工智能的数据提取准确率为92.4%（95%CI：89.5%至94.5%），在5.2%的案例中产生了错误数据（95%CI：3.4%至7.4%）。两个会话之间的可重复性很高，总体一致性为94.1%。当论文中未报告信息时，可重复性降低，一致性为77.2%。