Instituto Paulista de Estudos e Pesquisas em Oftalmologia, Vision Institute - São Paulo (SP), Brazil.
Massachusetts Institute of Technology, Institute for Medical Engineering and Science - Cambridge (MA), USA.
Rev Assoc Med Bras (1992). 2023 Sep 25;69(10):e20230848. doi: 10.1590/1806-9282.20230848. eCollection 2023.
The aim of this study was to evaluate the performance of ChatGPT-4.0 in answering the 2022 Brazilian National Examination for Medical Degree Revalidation (Revalida) and as a tool to provide feedback on the quality of the examination.
A total of two independent physicians entered all examination questions into ChatGPT-4.0. After comparing the outputs with the test solutions, they classified the large language model answers as adequate, inadequate, or indeterminate. In cases of disagreement, they adjudicated and achieved a consensus decision on the ChatGPT accuracy. The performance across medical themes and nullified questions was compared using chi-square statistical analysis.
In the Revalida examination, ChatGPT-4.0 answered 71 (87.7%) questions correctly and 10 (12.3%) incorrectly. There was no statistically significant difference in the proportions of correct answers among different medical themes (p=0.4886). The artificial intelligence model had a lower accuracy of 71.4% in nullified questions, with no statistical difference (p=0.241) between non-nullified and nullified groups.
ChatGPT-4.0 showed satisfactory performance for the 2022 Brazilian National Examination for Medical Degree Revalidation. The large language model exhibited worse performance on subjective questions and public healthcare themes. The results of this study suggested that the overall quality of the Revalida examination questions is satisfactory and corroborates the nullified questions.
本研究旨在评估 ChatGPT-4.0 在回答 2022 年巴西医学学位再认证考试(Revalida)中的表现,并作为评估考试质量的工具。
两位独立医生将所有考试问题输入到 ChatGPT-4.0 中。在将输出结果与测试解决方案进行比较后,他们将大语言模型的答案分为充分、不充分或不确定。在存在分歧的情况下,他们进行裁决并就 ChatGPT 的准确性达成共识。使用卡方统计分析比较了不同医学主题和无效问题的性能。
在 Revalida 考试中,ChatGPT-4.0 正确回答了 71 个(87.7%)问题,错误回答了 10 个(12.3%)。不同医学主题的正确答案比例没有统计学差异(p=0.4886)。在无效问题中,人工智能模型的准确率较低,为 71.4%,但在非无效和无效组之间没有统计学差异(p=0.241)。
ChatGPT-4.0 在 2022 年巴西医学学位再认证考试中表现出令人满意的性能。大语言模型在主观问题和公共卫生保健主题上的表现较差。本研究结果表明,Revalida 考试问题的整体质量令人满意,并证实了无效问题。