National Center for Tumor Diseases, German Cancer Research Center, Heidelberg, Germany.
Department of Dermatology, University Hospital Kiel, Kiel, Germany.
Eur J Cancer. 2019 Sep;119:57-65. doi: 10.1016/j.ejca.2019.06.013. Epub 2019 Aug 14.
Recently, convolutional neural networks (CNNs) systematically outperformed dermatologists in distinguishing dermoscopic melanoma and nevi images. However, such a binary classification does not reflect the clinical reality of skin cancer screenings in which multiple diagnoses need to be taken into account.
Using 11,444 dermoscopic images, which covered dermatologic diagnoses comprising the majority of commonly pigmented skin lesions commonly faced in skin cancer screenings, a CNN was trained through novel deep learning techniques. A test set of 300 biopsy-verified images was used to compare the classifier's performance with that of 112 dermatologists from 13 German university hospitals. The primary end-point was the correct classification of the different lesions into benign and malignant. The secondary end-point was the correct classification of the images into one of the five diagnostic categories.
Sensitivity and specificity of dermatologists for the primary end-point were 74.4% (95% confidence interval [CI]: 67.0-81.8%) and 59.8% (95% CI: 49.8-69.8%), respectively. At equal sensitivity, the algorithm achieved a specificity of 91.3% (95% CI: 85.5-97.1%). For the secondary end-point, the mean sensitivity and specificity of the dermatologists were at 56.5% (95% CI: 42.8-70.2%) and 89.2% (95% CI: 85.0-93.3%), respectively. At equal sensitivity, the algorithm achieved a specificity of 98.8%. Two-sided McNemar tests revealed significance for the primary end-point (p < 0.001). For the secondary end-point, outperformance (p < 0.001) was achieved except for basal cell carcinoma (on-par performance).
Our findings show that automated classification of dermoscopic melanoma and nevi images is extendable to a multiclass classification problem, thus better reflecting clinical differential diagnoses, while still outperforming dermatologists at a significant level (p < 0.001).
最近,卷积神经网络(CNN)在区分皮肤镜黑素瘤和痣图像方面系统地优于皮肤科医生。然而,这种二分类并不能反映皮肤癌筛查的临床实际情况,在这种情况下需要考虑多种诊断。
使用 11444 张涵盖皮肤科诊断的皮肤镜图像,这些图像包括皮肤癌筛查中常见的大多数色素性皮肤病变,通过新的深度学习技术对 CNN 进行了训练。使用 300 张活检证实的图像测试集来比较分类器的性能与 13 家德国大学医院的 112 名皮肤科医生的性能。主要终点是将不同病变正确分类为良性和恶性。次要终点是将图像正确分类为五个诊断类别之一。
皮肤科医生对主要终点的敏感性和特异性分别为 74.4%(95%置信区间[CI]:67.0-81.8%)和 59.8%(95% CI:49.8-69.8%)。在相等的敏感性下,该算法的特异性为 91.3%(95% CI:85.5-97.1%)。对于次要终点,皮肤科医生的平均敏感性和特异性分别为 56.5%(95% CI:42.8-70.2%)和 89.2%(95% CI:85.0-93.3%)。在相等的敏感性下,该算法的特异性为 98.8%。双侧 McNemar 检验显示主要终点具有统计学意义(p<0.001)。对于次要终点,除基底细胞癌(表现相当)外,均取得了优于皮肤科医生的结果(p<0.001)。
我们的研究结果表明,皮肤镜黑素瘤和痣图像的自动分类可扩展到多类别分类问题,从而更好地反映临床鉴别诊断,同时仍在显著水平上优于皮肤科医生(p<0.001)。