程序性细胞死亡导向的胃癌研究中大型语言模型的系统基准测试：DeepSeek-V3、DeepSeek-R1和Claude 3.5的比较分析

Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5.

作者信息

Li Yuheng, Dong Jiaqi, Liu Dongdong, Huang Yuqing, Jiang Yan, Chen Liangchao, Gong Qiming

机构信息

Department of General Surgery, Luzhou People's Hospital, Luzhou, China.

Department of Gastroenterology, Deyang People's Hospital, Deyang, China.

出版信息

Discov Oncol. 2025 Jul 1;16(1):1227. doi: 10.1007/s12672-025-02911-7.

DOI:10.1007/s12672-025-02911-7

PMID:40591121

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12214140/

Abstract

OBJECTIVES

We intended to compare three language models (DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5) regarding their ability to address programmed cell death mechanisms in gastric cancer. We aimed to establish which model most accurately reflects clinical standards and guidelines.

METHODS

Fifty-five frequently posed questions and twenty guideline-oriented queries on cell death processes were collected from recognized gastroenterology and oncology resources. Each model received every question individually. Six independent specialists, each from a distinct hospital, rated responses from 1 to 10, and their scores were summed to a 60-point total. Answers achieving totals above 45 were classified as "good," 30 to 45 as "moderate," and below 30 as "poor." Models that delivered "poor" replies received additional prompts for self‑correction, and revised answers underwent the same review.

RESULTS

DeepSeek‑R1 showed higher total scores than the other two models in almost every topic, particularly for surgical protocols and multimodal therapies. Claude 3.5 ranked second, displaying mostly coherent coverage but occasionally omitting recent guideline updates. DeepSeek‑V3 had difficulty with intricate guideline-based material. In "poor" responses, DeepSeek‑R1 corrected errors markedly, shifting to a "good" rating upon re-evaluation, while DeepSeek‑V3 improved only marginally. Claude 3.5 consistently moved its "poor" answers up into the moderate range.

CONCLUSION

DeepSeek‑R1 demonstrated the strongest performance for clinical content linked to programmed cell death in gastric cancer, while Claude 3.5 performed moderately well. DeepSeek‑V3 proved adequate for more basic queries but lacked sufficient detail for advanced guideline-based scenarios. These findings highlight the potential and limitations of such automated models when applied in complex oncologic contexts.

摘要

目的

我们旨在比较三种语言模型（DeepSeek-V3、DeepSeek-R1和Claude 3.5）在处理胃癌程序性细胞死亡机制方面的能力。我们的目标是确定哪种模型最准确地反映临床标准和指南。

方法

从公认的胃肠病学和肿瘤学资源中收集了55个常见问题和20个关于细胞死亡过程的以指南为导向的问题。每个模型单独接收每个问题。六位分别来自不同医院的独立专家对回答从1到10进行评分，他们的分数总和为60分。总分超过45分的回答被归类为“好”，30至45分为“中等”，低于30分为“差”。给出“差”回答的模型会收到额外的自我纠正提示，修改后的答案会接受同样的评审。

结果

在几乎每个主题上，DeepSeek-R1的总分都高于其他两种模型，特别是在手术方案和多模式治疗方面。Claude 3.5排名第二，大部分内容覆盖连贯，但偶尔会遗漏近期指南更新。DeepSeek-V3在基于复杂指南的材料方面存在困难。在“差”的回答中，DeepSeek-R1显著纠正了错误，重新评估时提升到了“好”的评级，而DeepSeek-V3仅略有改善。Claude 3.5始终将其“差”的回答提升到中等范围。

结论

DeepSeek-R1在与胃癌程序性细胞死亡相关的临床内容方面表现最强，而Claude 3.5表现中等良好。DeepSeek-V3对于更基本的问题足够，但对于基于高级指南的情况缺乏足够的细节。这些发现凸显了此类自动化模型在复杂肿瘤学背景下应用时的潜力和局限性。

相似文献

Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5.

Discov Oncol. 2025 Jul 1;16(1):1227. doi: 10.1007/s12672-025-02911-7.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

Systemic treatments for metastatic cutaneous melanoma.

Cochrane Database Syst Rev. 2018 Feb 6;2(2):CD011123. doi: 10.1002/14651858.CD011123.pub2.

Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.

Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.

A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.

Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.

Impact of residual disease as a prognostic factor for survival in women with advanced epithelial ovarian cancer after primary surgery.

Cochrane Database Syst Rev. 2022 Sep 26;9(9):CD015048. doi: 10.1002/14651858.CD015048.pub2.

Home treatment for mental health problems: a systematic review.

Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.

Does Augmenting Irradiated Autografts With Free Vascularized Fibula Graft in Patients With Bone Loss From a Malignant Tumor Achieve Union, Function, and Complication Rate Comparably to Patients Without Bone Loss and Augmentation When Reconstructing Intercalary Resections in the Lower Extremity?

Clin Orthop Relat Res. 2025 Jun 26. doi: 10.1097/CORR.0000000000003599.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

本文引用的文献

First-Line Sugemalimab Plus Chemotherapy for Advanced Gastric Cancer: The GEMSTONE-303 Randomized Clinical Trial.

JAMA. 2025 Apr 15;333(15):1305-1314. doi: 10.1001/jama.2024.28463.

From GPT to DeepSeek: Significant gaps remain in realizing AI in healthcare.

J Biomed Inform. 2025 Mar;163:104791. doi: 10.1016/j.jbi.2025.104791. Epub 2025 Feb 10.

Enhancing Oncological Surveillance Through Large Language Model-Assisted Analysis: A Comparative Study of GPT-4 and Gemini in Evaluating Oncological Issues From Serial Abdominal CT Scan Reports.

Acad Radiol. 2025 May;32(5):2385-2391. doi: 10.1016/j.acra.2024.10.050. Epub 2024 Dec 9.

The Application of Large Language Models in Gastroenterology: A Review of the Literature.

Cancers (Basel). 2024 Sep 28;16(19):3328. doi: 10.3390/cancers16193328.

Claude 3.5 Sonnet indicated improved TNM classification on radiology report of pancreatic cancer.

Jpn J Radiol. 2025 Jan;43(1):56-57. doi: 10.1007/s11604-024-01681-6. Epub 2024 Oct 15.

Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation.

Diagnostics (Basel). 2024 Aug 30;14(17):1912. doi: 10.3390/diagnostics14171912.

Comparative Evaluation of LLMs in Clinical Oncology.

NEJM AI. 2024 May;1(5). doi: 10.1056/aioa2300151. Epub 2024 Apr 16.

Identification of crucial genes through WGCNA in the progression of gastric cancer.

J Cancer. 2024 Apr 23;15(11):3284-3296. doi: 10.7150/jca.95757. eCollection 2024.

Physician and Artificial Intelligence Chatbot Responses to Cancer Questions From Social Media.

JAMA Oncol. 2024 Jul 1;10(7):956-960. doi: 10.1001/jamaoncol.2024.0836.

MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer.

Comput Struct Biotechnol J. 2024 Mar 27;23:1339-1347. doi: 10.1016/j.csbj.2024.03.021. eCollection 2024 Dec.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

程序性细胞死亡导向的胃癌研究中大型语言模型的系统基准测试：DeepSeek-V3、DeepSeek-R1和Claude 3.5的比较分析

Systematic benchmarking of large Language models in programmed cell death-oriented gastric cancer research: a comparative analysis of DeepSeek‑V3, DeepSeek‑R1, and Claude 3.5.

作者信息

机构信息

出版信息

OBJECTIVES

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献