Center for Data Science and Outcomes Research, Veterans Affairs Medical Center, Washington, DC, USA.
George Washington University, Washington, DC, USA.
ESC Heart Fail. 2024 Oct;11(5):3155-3166. doi: 10.1002/ehf2.14787. Epub 2024 Jun 14.
Heart failure (HF) is a clinical syndrome with no definitive diagnostic tests. HF registries are often based on manual reviews of medical records of hospitalized HF patients identified using International Classification of Diseases (ICD) codes. However, most HF patients are not hospitalized, and manual review of big electronic health record (EHR) data is not practical. The US Department of Veterans Affairs (VA) has the largest integrated healthcare system in the nation, and an estimated 1.5 million patients have ICD codes for HF (HF ICD-code universe) in their VA EHR. The objective of our study was to develop artificial intelligence (AI) models to phenotype HF in these patients.
The model development cohort (n = 20 000: training, 16 000; validation 2000; testing, 2000) included 10 000 patients with HF and 10 000 without HF who were matched by age, sex, race, inpatient/outpatient status, hospital, and encounter date (within 60 days). HF status was ascertained by manual chart reviews in VA's External Peer Review Program for HF (EPRP-HF) and non-HF status was ascertained by the absence of ICD codes for HF in VA EHR. Two clinicians annotated 1000 random snippets with HF-related keywords and labelled 436 as HF, which was then used to train and test a natural language processing (NLP) model to classify HF (positive predictive value or PPV, 0.81; sensitivity, 0.77). A machine learning (ML) model using linear support vector machine architecture was trained and tested to classify HF using EPRP-HF as cases (PPV, 0.86; sensitivity, 0.86). From the 'HF ICD-code universe', we randomly selected 200 patients (gold standard cohort) and two clinicians manually adjudicated HF (gold standard HF) in 145 of those patients by chart reviews. We calculated NLP, ML, and NLP + ML scores and used weighted F scores to derive their optimal threshold values for HF classification, which resulted in PPVs of 0.83, 0.77, and 0.85 and sensitivities of 0.86, 0.88, and 0.83, respectively. HF patients classified by the NLP + ML model were characteristically and prognostically similar to those with gold standard HF. All three models performed better than ICD code approaches: one principal hospital discharge diagnosis code for HF (PPV, 0.97; sensitivity, 0.21) or two primary outpatient encounter diagnosis codes for HF (PPV, 0.88; sensitivity, 0.54).
These findings suggest that NLP and ML models are efficient AI tools to phenotype HF in big EHR data to create contemporary HF registries for clinical studies of effectiveness, quality improvement, and hypothesis generation.
心力衰竭(HF)是一种没有明确诊断测试的临床综合征。HF 登记处通常基于对使用国际疾病分类(ICD)代码识别的住院 HF 患者的病历进行人工审查。然而,大多数 HF 患者没有住院,并且对大型电子健康记录(EHR)数据进行人工审查是不切实际的。美国退伍军人事务部(VA)拥有全美最大的综合医疗体系,据估计,其 EHR 中有 150 万患者有 HF 的 ICD 代码(HF ICD 代码宇宙)。我们研究的目的是开发人工智能(AI)模型来对这些患者进行 HF 表型分析。
模型开发队列(n=20000:训练,16000;验证 2000;测试 2000)包括 10000 名 HF 患者和 10000 名非 HF 患者,他们通过年龄、性别、种族、住院/门诊状态、医院和就诊日期(在 60 天内)进行匹配。HF 状态通过 VA 的外部同行审查计划进行 HF(EPRP-HF)的人工图表审查来确定,而非 HF 状态通过 VA EHR 中 HF 的 ICD 代码缺失来确定。两名临床医生对 1000 个随机片段进行了 HF 相关关键字注释,并将 436 个标记为 HF,然后使用这些数据来训练和测试自然语言处理(NLP)模型以对 HF 进行分类(阳性预测值或 PPV,0.81;灵敏度,0.77)。使用线性支持向量机架构的机器学习(ML)模型用于使用 EPRP-HF 作为病例来对 HF 进行分类(PPV,0.86;灵敏度,0.86)。从“HF ICD 代码宇宙”中,我们随机选择了 200 名患者(黄金标准队列),并让两名临床医生通过图表审查对其中 145 名患者的 HF 进行手动裁决(黄金标准 HF)。我们计算了 NLP、ML 和 NLP+ML 分数,并使用加权 F 分数得出它们对 HF 分类的最佳阈值,从而分别得到 0.83、0.77 和 0.85 的 PPV 和 0.86、0.88 和 0.83 的灵敏度。使用 NLP+ML 模型分类的 HF 患者在特征和预后方面与具有黄金标准 HF 的患者相似。所有三种模型的表现均优于 ICD 代码方法:HF 的一个主要住院诊断代码(PPV,0.97;灵敏度,0.21)或两个主要门诊就诊诊断代码(PPV,0.88;灵敏度,0.54)。
这些发现表明,NLP 和 ML 模型是在大型 EHR 数据中对 HF 进行表型分析的有效 AI 工具,可用于创建当代 HF 登记处,以进行有效性、质量改进和假设生成的临床研究。