Güneş Yasin Celal, Cesur Turay, Çamur Eren
Kırıkkale Yüksek İhtisas Hospital, Clinic of Radiology, Kırıkkale, Türkiye.
Mamak State Hospital, Clinic of Radiology, Ankara, Türkiye.
Diagn Interv Radiol. 2025 May 12. doi: 10.4274/dir.2025.253101.
This study aimed to compare six large language models (LLMs) [Chat Generative Pre-trained Transformer (ChatGPT)o1-preview, ChatGPT-4o, ChatGPT-4o with canvas, Google Gemini 1.5 Pro, Claude 3.5 Sonnet, and Claude 3 Opus] in generating radiology references, assessing accuracy, fabrication, and bibliographic completeness.
In this cross-sectional observational study, 120 open-ended questions were administered across eight radiology subspecialties (neuroradiology, abdominal, musculoskeletal, thoracic, pediatric, cardiac, head and neck, and interventional radiology), with 15 questions per subspecialty. Each question prompted the LLMs to provide responses containing four references with in-text citations and complete bibliographic details (authors, title, journal, publication year/month, volume, issue, page numbers, and PubMed Identifier). References were verified using Medline, Google Scholar, the Directory of Open Access Journals, and web searches. Each bibliographic element was scored for correctness, and a composite final score [(FS): 0-36] was calculated by summing the correct elements and multiplying this by a 5-point verification score for content relevance. The FS values were then categorized into a 5-point Likert scale reference accuracy score (RAS: 0 = fabricated; 4 = fully accurate). Non-parametric tests (Kruskal-Wallis, Tamhane's T2, Wilcoxon signed-rank test with Bonferroni correction) were used for statistical comparisons.
Claude 3.5 Sonnet demonstrated the highest reference accuracy, with 80.8% fully accurate references (RAS 4) and a fabrication rate of 3.1%, significantly outperforming all other models ( < 0.001). Claude 3 Opus ranked second, achieving 59.6% fully accurate references and a fabrication rate of 18.3% ( < 0.001). ChatGPT-based models (ChatGPT-4o, ChatGPT-4o with canvas, and ChatGPT o1-preview) exhibited moderate accuracy, with fabrication rates ranging from 27.7% to 52.9% and <8% fully accurate references. Google Gemini 1.5 Pro had the lowest performance, achieving only 2.7% fully accurate references and the highest fabrication rate of 60.6% ( < 0.001). Reference accuracy also varied by subspecialty, with neuroradiology and cardiac radiology outperforming pediatric and head and neck radiology.
Claude 3.5 Sonnet significantly outperformed all other models in generating verifiable radiology references, and Claude 3 Opus showed moderate performance. In contrast, ChatGPT models and Google Gemini 1.5 Pro delivered substantially lower accuracy with higher rates of fabricated references, highlighting current limitations in automated academic citation generation.
The high accuracy of Claude 3.5 Sonnet can improve radiology literature reviews, research, and education with dependable references. The poor performance of other models, with high fabrication rates, risks misinformation in clinical and academic settings and highlights the need for refinement to ensure safe and effective use.
本研究旨在比较六种大语言模型(LLMs)[聊天生成预训练变换器(ChatGPT)o1-preview、ChatGPT-4o、带画布的ChatGPT-4o、谷歌Gemini 1.5 Pro、Claude 3.5 Sonnet和Claude 3 Opus]在生成放射学参考文献、评估准确性、虚构情况和文献完整性方面的表现。
在这项横断面观察性研究中,针对八个放射学亚专业(神经放射学、腹部、肌肉骨骼、胸部、儿科、心脏、头颈和介入放射学)提出了120个开放式问题,每个亚专业15个问题。每个问题促使大语言模型提供包含四个参考文献的回答,并带有文内引用和完整的文献细节(作者、标题、期刊、出版年份/月份、卷、期、页码和PubMed标识符)。使用Medline、谷歌学术、开放获取期刊目录和网络搜索对参考文献进行验证。对每个文献元素的正确性进行评分,并通过将正确元素相加并乘以内容相关性的5分验证分数来计算综合最终得分[(FS):0 - 36]。然后将FS值分类为5点李克特量表参考文献准确性得分(RAS:0 = 虚构;4 = 完全准确)。使用非参数检验(Kruskal-Wallis、Tamhane's T2、带有Bonferroni校正的Wilcoxon符号秩检验)进行统计比较。
Claude 3.5 Sonnet表现出最高的参考文献准确性,80.8%的参考文献完全准确(RAS 4),虚构率为3.1%,显著优于所有其他模型(<0.001)。Claude 3 Opus排名第二,59.6%的参考文献完全准确,虚构率为18.3%(<0.001)。基于ChatGPT的模型(ChatGPT-4o、带画布的ChatGPT-4o和ChatGPT o1-preview)表现出中等准确性,虚构率在27.7%至52.9%之间,完全准确的参考文献<8%。谷歌Gemini 1.5 Pro表现最差,只有2.7%的参考文献完全准确,虚构率最高,为60.6%(<0.001)。参考文献准确性在不同亚专业中也有所不同,神经放射学和心脏放射学的表现优于儿科和头颈放射学。
Claude 3.5 Sonnet在生成可验证的放射学参考文献方面显著优于所有其他模型;Claude 3 Opus表现中等。相比之下,ChatGPT模型和谷歌Gemini 1.5 Pro的准确性要低得多,虚构参考文献的比例更高,凸显了当前自动学术引用生成的局限性。
Claude 3.5 Sonnet的高准确性可以通过可靠的参考文献改进放射学文献综述、研究和教育。其他模型表现不佳,虚构率高,在临床和学术环境中存在错误信息的风险,凸显了改进以确保安全有效使用的必要性。