Salbas Ali, Buyuktoka Rasit Eren
Department of Radiology, Izmir Katip Celebi University, Ataturk Training and Research Hospital, Izmir 35150, Turkey.
Department of Radiology, Foca State Hospital, Izmir 35680, Turkey.
Diagnostics (Basel). 2025 Jul 30;15(15):1919. doi: 10.3390/diagnostics15151919.
: Multimodal large language models (LLMs) are increasingly used in radiology. However, their ability to recognize fundamental imaging features, including modality, anatomical region, imaging plane, contrast-enhancement status, and particularly specific magnetic resonance imaging (MRI) sequences, remains underexplored. This study aims to evaluate and compare the performance of three advanced multimodal LLMs (ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro) in classifying brain MRI sequences. : A total of 130 brain MRI images from adult patients without pathological findings were used, representing 13 standard MRI series. Models were tested using zero-shot prompts for identifying modality, anatomical region, imaging plane, contrast-enhancement status, and MRI sequence. Accuracy was calculated, and differences among models were analyzed using Cochran's Q test and McNemar test with Bonferroni correction. : ChatGPT-4o and Gemini 2.5 Pro achieved 100% accuracy in identifying the imaging plane and 98.46% in identifying contrast-enhancement status. MRI sequence classification accuracy was 97.7% for ChatGPT-4o, 93.1% for Gemini 2.5 Pro, and 73.1% for Claude 4 Opus ( < 0.001). The most frequent misclassifications involved fluid-attenuated inversion recovery (FLAIR) sequences, often misclassified as T1-weighted or diffusion-weighted sequences. Claude 4 Opus showed lower accuracy in susceptibility-weighted imaging (SWI) and apparent diffusion coefficient (ADC) sequences. Gemini 2.5 Pro exhibited occasional hallucinations, including irrelevant clinical details such as "hypoglycemia" and "Susac syndrome." : Multimodal LLMs demonstrate high accuracy in basic MRI recognition tasks but vary significantly in specific sequence classification tasks. Hallucinations emphasize caution in clinical use, underlining the need for validation, transparency, and expert oversight.
多模态大语言模型(LLMs)在放射学中的应用越来越广泛。然而,它们识别基本影像特征的能力,包括模态、解剖区域、成像平面、对比增强状态,尤其是特定的磁共振成像(MRI)序列,仍未得到充分探索。本研究旨在评估和比较三种先进的多模态大语言模型(ChatGPT-4o、Claude 4 Opus和Gemini 2.5 Pro)在对脑部MRI序列进行分类时的性能。
共使用了130例来自无病理结果成年患者的脑部MRI图像,代表13个标准MRI序列。使用零样本提示对模型进行测试,以识别模态、解剖区域、成像平面、对比增强状态和MRI序列。计算准确率,并使用Cochran's Q检验和经Bonferroni校正的McNemar检验分析模型之间的差异。
ChatGPT-4o和Gemini 2.5 Pro在识别成像平面方面的准确率达到100%,在识别对比增强状态方面的准确率为98.46%。ChatGPT-4o对MRI序列分类的准确率为97.7%,Gemini 2.5 Pro为93.1%,Claude 4 Opus为73.1%(<0.001)。最常见的错误分类涉及液体衰减反转恢复(FLAIR)序列,常被误分类为T1加权或扩散加权序列。Claude 4 Opus在磁敏感加权成像(SWI)和表观扩散系数(ADC)序列方面的准确率较低。Gemini 2.5 Pro偶尔会出现幻觉,包括诸如“低血糖”和“Susac综合征”等不相关的临床细节。
多模态大语言模型在基本MRI识别任务中表现出较高的准确率,但在特定序列分类任务中差异显著。幻觉凸显了临床使用时需谨慎,强调了验证、透明度和专家监督的必要性。