Guo Yuting, Shi Haoming, Book Wendy M, Ivey Lindsey Carrie, Rodriguez Fred H, Sameni Reza, Raskind-Hood Cheryl, Robichaux Chad, Downing Karrie F, Sarker Abeed
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, Georgia, USA.
Department of Biomedical Engineering, Georgia Institute Technology, Atlanta, Georgia, USA.
Birth Defects Res. 2025 Mar;117(3):e2451. doi: 10.1002/bdr2.2451.
International Classification of Disease (ICD) codes can accurately identify patients with certain congenital heart defects (CHDs). In ICD-defined CHD data sets, the code for secundum atrial septal defect (ASD) is the most common, but it has a low positive predictive value for CHD, potentially resulting in the drawing of erroneous conclusions from such data sets. Methods with reduced false positive rates for CHD among individuals captured with the ASD ICD code are needed for public health surveillance.
We propose a two-level classification system, which includes a CHD and an ASD classification model, to categorize cases with an ASD ICD code into three groups: ASD, other CHD, or no CHD (including patent foramen ovale). In the proposed approach, a machine learning model that leverages structured data is combined with a text classification system. We compare performances for three text classification strategies: support vector machines (SVMs) using text-based features, a robustly optimized Transformer-based model (RoBERTa), and a scalable tree boosting system using non-text-based features (XGBoost).
Using SVM for both CHD and ASD resulted in the best performance for the ASD and no CHD group, achieving F scores of 0.53 (±0.05) and 0.78 (±0.02), respectively. XGBoost for CHD and SVM for ASD classification performed best for the other CHD group (F score: 0.39 [±0.03]).
This study demonstrates that it is feasible to use patients' clinical notes and machine learning to perform more fine-grained classification compared to ICD codes, particularly with higher PPV for CHD. The proposed approach can improve CHD surveillance.
国际疾病分类(ICD)编码能够准确识别患有某些先天性心脏病(CHD)的患者。在ICD定义的CHD数据集中,继发孔房间隔缺损(ASD)的编码最为常见,但它对CHD的阳性预测值较低,可能导致从此类数据集中得出错误结论。公共卫生监测需要降低被ASD ICD编码捕获的个体中CHD假阳性率的方法。
我们提出了一个两级分类系统,包括CHD和ASD分类模型,将具有ASD ICD编码的病例分为三组:ASD、其他CHD或无CHD(包括卵圆孔未闭)。在所提出的方法中,一个利用结构化数据的机器学习模型与一个文本分类系统相结合。我们比较了三种文本分类策略的性能:使用基于文本特征的支持向量机(SVM)、经过稳健优化的基于Transformer的模型(RoBERTa)以及使用非文本特征的可扩展树提升系统(XGBoost)。
CHD和ASD均使用SVM时,ASD组和无CHD组的性能最佳,F分数分别为0.53(±0.05)和0.78(±0.02)。CHD使用XGBoost且ASD分类使用SVM时,其他CHD组的性能最佳(F分数:0.39 [±0.03])。
本研究表明,与ICD编码相比,使用患者的临床记录和机器学习进行更细粒度的分类是可行的,尤其是对CHD具有更高的阳性预测值。所提出的方法可以改善CHD监测。