Shi Haoming, Book Wendy M, Ivey Lindsey C, Rodriguez Fred H, Raskind-Hood Cheryl, Downing Karrie F, Farr Sherry L, McCracken Courtney E, Leedom Vinita O, Haynes Susan E, Amouzou Sandra, Sameni Reza, Kamaleswaran Rishikesan
Department of Biomedical Engineering, Georgia Institute Technology, Atlanta, Georgia, USA.
Department of Biomedical Engineering, Duke University, Durham, North Carolina, USA.
Birth Defects Res. 2025 Feb;117(2):e2440. doi: 10.1002/bdr2.2440.
International Classification of Diseases (ICD) codes utilized for congenital heart defect (CHD) case identification in datasets have substantial false-positive (FP) rates. Incorporating machine learning (ML) algorithms following case selection by ICD codes may improve the accuracy of CHD identification, enhancing surveillance efforts.
Traditional ML methods were applied to four encounter-level datasets, 2010-2019, for 3334 patients with validated diagnoses and with at least one CHD ICD code identified. A 5-fold cross-validation approach was applied to the dataset to determine the set of overlapping important features best classifying CHD cases. Training and testing combinations were explored to determine the approach yielding the most accurate CHD classification.
CHD ICD positive predictive values (PPVs) by site ranged from 53.2% to 84.0%. The ML algorithm achieved a PPV of 95% (1273/1340) for the four-site dataset with a false-negative (FN) rate of 33% (639/1912) by choosing an operating point prioritizing PPV from the PPV-FN rate curve. XGBoost reduced 2105 Clinical Classification Software (CCS) features to 137 that identified those with true-positive (TP) CHD and false-positive FP classification.
Applying ML algorithms following case selection by CHD-related ICD codes improved the accuracy of identifying TP true-positive CHD cases.
数据集中用于先天性心脏病(CHD)病例识别的国际疾病分类(ICD)编码有相当高的假阳性(FP)率。在通过ICD编码进行病例选择后纳入机器学习(ML)算法,可能会提高CHD识别的准确性,加强监测工作。
将传统的ML方法应用于2010 - 2019年的四个就诊级数据集,这些数据集包含3334例经证实诊断且至少有一个CHD ICD编码被识别的患者。对数据集采用5折交叉验证方法,以确定能最佳分类CHD病例的重叠重要特征集。探索训练和测试组合,以确定产生最准确CHD分类的方法。
按部位划分的CHD ICD阳性预测值(PPV)范围为53.2%至84.0%。通过从PPV - FN率曲线中选择优先考虑PPV的操作点,ML算法对四部位数据集实现了95%(1273/1340)的PPV,假阴性(FN)率为33%(639/1912)。XGBoost将2105个临床分类软件(CCS)特征减少到137个,这些特征可识别真正的阳性(TP)CHD和假阳性(FP)分类。
在通过与CHD相关的ICD编码进行病例选择后应用ML算法,提高了识别真正阳性(TP)CHD病例的准确性。