Motzfeldt Jensen Mette, Brix Danielsen Mathias, Riis Johannes, Assifuah Kristjansen Karoline, Andersen Stig, Okubo Yoshiro, Jørgensen Martin Grønbech
Department of Geriatric Medicine, Aalborg University Hospital, Aalborg, Denmark.
Department of Clinical Medicine, Aalborg University, Aalborg, Denmark.
PLoS One. 2025 Jan 7;20(1):e0313401. doi: 10.1371/journal.pone.0313401. eCollection 2025.
Systematic reviews provide clarity of a bulk of evidence and support the transfer of knowledge from clinical trials to guidelines. Yet, they are time-consuming. Artificial intelligence (AI), like ChatGPT-4o, may streamline processes of data extraction, but its efficacy requires validation.
This study aims to (1) evaluate the validity of ChatGPT-4o for data extraction compared to human reviewers, and (2) test the reproducibility of ChatGPT-4o's data extraction.
We conducted a comparative study using papers from an ongoing systematic review on exercise to reduce fall risk. Data extracted by ChatGPT-4o were compared to a reference standard: data extracted by two independent human reviewers. The validity was assessed by categorizing the extracted data into five categories ranging from completely correct to false data. Reproducibility was evaluated by comparing data extracted in two separate sessions using different ChatGPT-4o accounts.
ChatGPT-4o extracted a total of 484 data points across 11 papers. The AI's data extraction was 92.4% accurate (95% CI: 89.5% to 94.5%) and produced false data in 5.2% of cases (95% CI: 3.4% to 7.4%). The reproducibility between the two sessions was high, with an overall agreement of 94.1%. Reproducibility decreased when information was not reported in the papers, with an agreement of 77.2%.
Validity and reproducibility of ChatGPT-4o was high for data extraction for systematic reviews. ChatGPT-4o was qualified as a second reviewer for systematic reviews and showed potential for future advancements when summarizing data.
系统评价提供了大量证据的清晰度,并支持知识从临床试验向指南的转化。然而,它们耗时较长。像ChatGPT-4o这样的人工智能可能会简化数据提取过程,但其有效性需要验证。
本研究旨在(1)与人工审阅者相比,评估ChatGPT-4o进行数据提取的有效性,以及(2)测试ChatGPT-4o数据提取的可重复性。
我们使用一项正在进行的关于运动以降低跌倒风险的系统评价中的论文进行了一项比较研究。将ChatGPT-4o提取的数据与参考标准进行比较:由两名独立的人工审阅者提取的数据。通过将提取的数据分为从完全正确到错误数据的五类来评估有效性。通过比较使用不同ChatGPT-4o账户在两个单独会话中提取的数据来评估可重复性。
ChatGPT-4o在11篇论文中总共提取了484个数据点。人工智能的数据提取准确率为92.4%(95%CI:89.5%至94.5%),在5.2%的案例中产生了错误数据(95%CI:3.4%至7.4%)。两个会话之间的可重复性很高,总体一致性为94.1%。当论文中未报告信息时,可重复性降低,一致性为77.2%。
ChatGPT-4o在系统评价数据提取方面的有效性和可重复性很高。ChatGPT-4o有资格作为系统评价的第二审阅者,并且在总结数据时显示出未来进步的潜力。