医学生、ChatGPT-3.5和ChatGPT-4.0在回答巴西国家医学考试问题中的表现比较：横断面问卷调查研究

Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study.

作者信息

Rodrigues Alessi Mateus, Gomes Heitor Augusto, Oliveira Gabriel, Lopes de Castro Matheus, Grenteski Fabiano, Miyashiro Leticia, do Valle Camila, Tozzini Tavares da Silva Leticia, Okamoto Cristina

机构信息

School of Medicine, Universidade Positivo, R. Prof. Pedro Viriato Parigot de Souza, 5300, Curitiba, 81280-330, Brazil, (41) 3317-3010.

出版信息

JMIR AI. 2025 May 8;4:e66552. doi: 10.2196/66552.

DOI:10.2196/66552

PMID:40607498

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12223693/

Abstract

BACKGROUND

Artificial intelligence has advanced significantly in various fields, including medicine, where tools like ChatGPT (GPT) have demonstrated remarkable capabilities in interpreting and synthesizing complex medical data. Since its launch in 2019, GPT has evolved, with version 4.0 offering enhanced processing power, image interpretation, and more accurate responses. In medicine, GPT has been used for diagnosis, research, and education, achieving significant milestones like passing the United States Medical Licensing Examination. Recent studies show that GPT 4.0 outperforms earlier versions and even medical students on medical exams.

OBJECTIVE

This study aimed to evaluate and compare the performance of GPT versions 3.5 and 4.0 on Brazilian Progress Tests (PT) from 2021 to 2023, analyzing their accuracy compared to medical students.

METHODS

A cross-sectional observational study was conducted using 333 multiple-choice questions from the PT, excluding questions with images and those nullified or repeated. All questions were presented sequentially without modification to their structure. The performance of GPT versions was compared using statistical methods and medical students' scores were included for context.

RESULTS

There was a statistically significant difference in total performance scores across the 2021, 2022, and 2023 exams between GPT-3.5 and GPT-4.0 (P=.03). However, this significance did not remain after Bonferroni correction. On average, GPT v3.5 scored 68.4%, whereas v4.0 achieved 87.2%, reflecting an absolute improvement of 18.8% and a relative increase of 27.4% in accuracy. When broken down by subject, the average scores for GPT-3.5 and GPT-4.0, respectively, were as follows: surgery (73.5% vs 88.0%, P=.03), basic sciences (77.5% vs 96.2%, P=.004), internal medicine (61.5% vs 75.1%, P=.14), gynecology and obstetrics (64.5% vs 94.8%, P=.002), pediatrics (58.5% vs 80.0%, P=.02), and public health (77.8% vs 89.6%, P=.02). After Bonferroni correction, only basic sciences and gynecology and obstetrics retained statistically significant differences.

CONCLUSIONS

GPT-4.0 demonstrates superior accuracy compared to its predecessor in answering medical questions on the PT. These results are similar to other studies, indicating that we are approaching a new revolution in medicine.

摘要

背景

人工智能在包括医学在内的各个领域都取得了显著进展，像ChatGPT（GPT）这样的工具在解释和综合复杂医学数据方面展现出了卓越能力。自2019年推出以来，GPT不断发展，4.0版本具有更强的处理能力、图像解读能力以及更准确的回答。在医学领域，GPT已被用于诊断、研究和教育，并取得了诸如通过美国医学执照考试等重大里程碑。最近的研究表明，GPT 4.0在医学考试中的表现优于早期版本，甚至超过了医学生。

目的

本研究旨在评估和比较GPT 3.5和4.0版本在2021年至2023年巴西进阶测试（PT）中的表现，并与医学生的表现进行对比分析其准确性。

方法

采用横断面观察性研究，使用PT中的333道多项选择题，排除带有图像的题目以及无效或重复的题目。所有题目按顺序呈现，结构未作修改。使用统计方法比较GPT各版本的表现，并纳入医学生的分数作为参照。

结果

GPT-3.5和GPT-4.0在2021年、2022年和2023年考试的总体表现得分上存在统计学显著差异（P = 0.03）。然而，经Bonferroni校正后，这种显著性不再存在。平均而言，GPT v3.5得分为68.4%，而v4.0为87.2%，准确性绝对提高了18.8%，相对提高了27.4%。按科目细分，GPT-3.5和GPT-4.0的平均得分分别如下：外科（73.5%对88.0%，P = 0.03）、基础科学（77.5%对96.2%，P = 0.004）、内科（61.5%对75.1%，P = 0.14）、妇产科（64.5%对94.8%，P = 0.002）、儿科（58.5%对80.0%，P = 0.02）和公共卫生（77.8%对89.6%，P = 0.02）。经Bonferroni校正后，只有基础科学和妇产科仍保留统计学显著差异。