Schweikhard Frank Philipp, Kosanke Anika, Lange Sandra, Kromrey Marie-Luise, Mankertz Fiona, Gamain Julie, Kirsch Michael, Rosenberg Britta, Hosten Norbert
Institute for Diagnostic Radiology and Neuroradiology, University Medicine of Greifswald, 17475 Greifswald, Germany.
Institute for Psychology, University of Greifswald, 17489 Greifswald, Germany.
Healthcare (Basel). 2024 Mar 23;12(7):706. doi: 10.3390/healthcare12070706.
This retrospective study evaluated a commercial deep learning (DL) software for chest radiographs and explored its performance in different scenarios. A total of 477 patients (284 male, 193 female, mean age 61.4 (44.7-78.1) years) were included. For the reference standard, two radiologists performed independent readings on seven diseases, thus reporting 226 findings in 167 patients. An autonomous DL reading was performed separately and evaluated against the gold standard regarding accuracy, sensitivity and specificity using ROC analysis. The overall average AUC was 0.84 (95%-CI 0.76-0.92) with an optimized DL sensitivity of 85% and specificity of 75.4%. The best results were seen in pleural effusion with an AUC of 0.92 (0.885-0.955) and sensitivity and specificity of each 86.4%. The data also showed a significant influence of sex, age, and comorbidity on the level of agreement between gold standard and DL reading. About 40% of cases could be ruled out correctly when screening for only one specific disease with a sensitivity above 95% in the exploratory analysis. For the combined reading of all abnormalities at once, only marginal workload reduction could be achieved due to insufficient specificity. DL applications like this one bear the prospect of autonomous comprehensive reporting on chest radiographs but for now require human supervision. Radiologists need to consider possible bias in certain patient groups, e.g., elderly and women. By adjusting their threshold values, commercial DL applications could already be deployed for a variety of tasks, e.g., ruling out certain conditions in screening scenarios and offering high potential for workload reduction.
这项回顾性研究评估了一款用于胸部X光片的商用深度学习(DL)软件,并探讨了其在不同场景下的性能。共纳入477例患者(男性284例,女性193例,平均年龄61.4(44.7 - 78.1)岁)。作为参考标准,两名放射科医生对七种疾病进行独立解读,共报告了167例患者的226项检查结果。单独进行自动DL解读,并使用ROC分析对照金标准评估其准确性、敏感性和特异性。总体平均AUC为0.84(95%可信区间0.76 - 0.92),优化后的DL敏感性为85%,特异性为75.4%。在胸腔积液方面取得了最佳结果,AUC为0.92(0.885 - 0.955),敏感性和特异性均为86.4%。数据还显示,性别、年龄和合并症对金标准与DL解读之间的一致性水平有显著影响。在探索性分析中,当仅筛查一种特定疾病且敏感性高于95%时,约40%的病例可以正确排除。对于一次性综合解读所有异常情况,由于特异性不足,只能实现边际工作量的减少。像这样的DL应用有望实现胸部X光片的自动综合报告,但目前仍需要人工监督。放射科医生需要考虑某些患者群体(如老年人和女性)可能存在的偏差。通过调整阈值,商用DL应用已经可以用于各种任务,例如在筛查场景中排除某些情况,并具有很大的工作量减少潜力。