Li Yuheng, Dong Jiaqi, Liu Dongdong, Huang Yuqing, Jiang Yan, Chen Liangchao, Gong Qiming
Department of General Surgery, Luzhou People's Hospital, Luzhou, China.
Department of Gastroenterology, Deyang People's Hospital, Deyang, China.
Discov Oncol. 2025 Jul 1;16(1):1227. doi: 10.1007/s12672-025-02911-7.
We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines.
Fifty-five frequently posed questions and twenty guideline-oriented queries on cell death processes were collected from recognized gastroenterology and oncology resources. Each model received every question individually. Six independent specialists, each from a distinct hospital, rated responses from 1 to 10, and their scores were summed to a 60-point total. Answers achieving totals above 45 were classified as "good," 30 to 45 as "moderate," and below 30 as "poor." Models that delivered "poor" replies received additional prompts for self‑correction, and revised answers underwent the same review.
DeepSeek‑R1 showed higher total scores than the other two models in almost every topic, particularly for surgical protocols and multimodal therapies. Claude 3.5 ranked second, displaying mostly coherent coverage but occasionally omitting recent guideline updates. DeepSeek‑V3 had difficulty with intricate guideline-based material. In "poor" responses, DeepSeek‑R1 corrected errors markedly, shifting to a "good" rating upon re-evaluation, while DeepSeek‑V3 improved only marginally. Claude 3.5 consistently moved its "poor" answers up into the moderate range.
DeepSeek‑R1 demonstrated the strongest performance for clinical content linked to programmed cell death in gastric cancer, while Claude 3.5 performed moderately well. DeepSeek‑V3 proved adequate for more basic queries but lacked sufficient detail for advanced guideline-based scenarios. These findings highlight the potential and limitations of such automated models when applied in complex oncologic contexts.
我们旨在比较三种语言模型(DeepSeek-V3、DeepSeek-R1和Claude 3.5)在处理胃癌程序性细胞死亡机制方面的能力。我们的目标是确定哪种模型最准确地反映临床标准和指南。
从公认的胃肠病学和肿瘤学资源中收集了55个常见问题和20个关于细胞死亡过程的以指南为导向的问题。每个模型单独接收每个问题。六位分别来自不同医院的独立专家对回答从1到10进行评分,他们的分数总和为60分。总分超过45分的回答被归类为“好”,30至45分为“中等”,低于30分为“差”。给出“差”回答的模型会收到额外的自我纠正提示,修改后的答案会接受同样的评审。
在几乎每个主题上,DeepSeek-R1的总分都高于其他两种模型,特别是在手术方案和多模式治疗方面。Claude 3.5排名第二,大部分内容覆盖连贯,但偶尔会遗漏近期指南更新。DeepSeek-V3在基于复杂指南的材料方面存在困难。在“差”的回答中,DeepSeek-R1显著纠正了错误,重新评估时提升到了“好”的评级,而DeepSeek-V3仅略有改善。Claude 3.5始终将其“差”的回答提升到中等范围。
DeepSeek-R1在与胃癌程序性细胞死亡相关的临床内容方面表现最强,而Claude 3.5表现中等良好。DeepSeek-V3对于更基本的问题足够,但对于基于高级指南的情况缺乏足够的细节。这些发现凸显了此类自动化模型在复杂肿瘤学背景下应用时的潜力和局限性。