Department of Population Medicine Harvard Medical School & Harvard Pilgrim Health Care Institute Boston MA.
Computational Health Informatics Program Boston Children's Hospital Boston MA.
J Am Heart Assoc. 2020 Oct 20;9(19):e016648. doi: 10.1161/JAHA.120.016648. Epub 2020 Sep 29.
Background Real-world healthcare data are an important resource for epidemiologic research. However, accurate identification of patient cohorts-a crucial first step underpinning the validity of research results-remains a challenge. We developed and evaluated claims-based case ascertainment algorithms for pulmonary hypertension (PH), comparing conventional decision rules with state-of-the-art machine-learning approaches. Methods and Results We analyzed an electronic health record-Medicare linked database from two large academic tertiary care hospitals (years 2007-2013). Electronic health record charts were reviewed to form a gold standard cohort of patients with (n=386) and without PH (n=164). Using health encounter data captured in Medicare claims (including patients' demographics, diagnoses, medications, and procedures), we developed and compared 2 approaches for identifying patients with PH: decision rules and machine-learning algorithms using penalized lasso regression, random forest, and gradient boosting machine. The most optimal rule-based algorithm-having ≥3 PH-related healthcare encounters and having undergone right heart catheterization-attained an area under the receiver operating characteristic curve of 0.64 (sensitivity, 0.75; specificity, 0.48). All 3 machine-learning algorithms outperformed the most optimal rule-based algorithm (<0.001). A model derived from the random forest algorithm achieved an area under the receiver operating characteristic curve of 0.88 (sensitivity, 0.87; specificity, 0.70), and gradient boosting machine achieved comparable results (area under the receiver operating characteristic curve, 0.85; sensitivity, 0.87; specificity, 0.70). Penalized lasso regression achieved an area under the receiver operating characteristic curve of 0.73 (sensitivity, 0.70; specificity, 0.68). Conclusions Research-grade case identification algorithms for PH can be derived and rigorously validated using machine-learning algorithms. Simple decision rules commonly applied in published literature performed poorly; more complex rule-based algorithms may potentially address the limitation of this approach. PH research using claims data would be considerably strengthened through the use of validated algorithms for cohort ascertainment.
真实世界的医疗保健数据是进行流行病学研究的重要资源。然而,准确识别患者队列——这是支撑研究结果有效性的关键第一步——仍然是一个挑战。我们开发并评估了基于索赔的肺动脉高压 (PH) 病例确定算法,将传统决策规则与最先进的机器学习方法进行了比较。
我们分析了来自两家大型学术三级保健医院的电子健康记录-医疗保险链接数据库(2007-2013 年)。审查电子健康记录图表,形成 PH 患者(n=386)和无 PH 患者(n=164)的金标准队列。使用医疗保险索赔中捕获的健康就诊数据(包括患者的人口统计学信息、诊断、药物和程序),我们开发并比较了两种用于识别 PH 患者的方法:决策规则和使用惩罚套索回归、随机森林和梯度提升机的机器学习算法。基于规则的最佳算法——有≥3 次与 PH 相关的医疗保健就诊经历且接受过右心导管检查——获得了 0.64 的受试者工作特征曲线下面积(敏感性,0.75;特异性,0.48)。所有 3 种机器学习算法的表现均优于基于规则的最佳算法(<0.001)。基于随机森林算法的模型获得了 0.88 的受试者工作特征曲线下面积(敏感性,0.87;特异性,0.70),梯度提升机也取得了类似的结果(受试者工作特征曲线下面积为 0.85;敏感性,0.87;特异性,0.70)。惩罚套索回归获得了 0.73 的受试者工作特征曲线下面积(敏感性,0.70;特异性,0.68)。
可以使用机器学习算法来开发和严格验证用于 PH 的研究级病例识别算法。在已发表文献中常用的简单决策规则表现不佳;更复杂的基于规则的算法可能会解决该方法的局限性。通过使用经过验证的队列确定算法,使用索赔数据进行 PH 研究将得到极大加强。