Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
Education Authority, Chaim Sheba Medical Center, Faculty of Health Science and Medicine, Tel-Aviv University, Tel-Aviv, Israel.
J Med Internet Res. 2024 Jul 30;26:e48595. doi: 10.2196/48595.
Under- or late identification of pulmonary embolism (PE)-a thrombosis of 1 or more pulmonary arteries that seriously threatens patients' lives-is a major challenge confronting modern medicine.
We aimed to establish accurate and informative machine learning (ML) models to identify patients at high risk for PE as they are admitted to the hospital, before their initial clinical checkup, by using only the information in their medical records.
We collected demographics, comorbidities, and medications data for 2568 patients with PE and 52,598 control patients. We focused on data available prior to emergency department admission, as these are the most universally accessible data. We trained an ML random forest algorithm to detect PE at the earliest possible time during a patient's hospitalization-at the time of his or her admission. We developed and applied 2 ML-based methods specifically to address the data imbalance between PE and non-PE patients, which causes misdiagnosis of PE.
The resulting models predicted PE based on age, sex, BMI, past clinical PE events, chronic lung disease, past thrombotic events, and usage of anticoagulants, obtaining an 80% geometric mean value for the PE and non-PE classification accuracies. Although on hospital admission only 4% (1942/46,639) of the patients had a diagnosis of PE, we identified 2 clustering schemes comprising subgroups with more than 61% (705/1120 in clustering scheme 1; 427/701 and 340/549 in clustering scheme 2) positive patients for PE. One subgroup in the first clustering scheme included 36% (705/1942) of all patients with PE who were characterized by a definite past PE diagnosis, a 6-fold higher prevalence of deep vein thrombosis, and a 3-fold higher prevalence of pneumonia, compared with patients of the other subgroups in this scheme. In the second clustering scheme, 2 subgroups (1 of only men and 1 of only women) included patients who all had a past PE diagnosis and a relatively high prevalence of pneumonia, and a third subgroup included only those patients with a past diagnosis of pneumonia.
This study established an ML tool for early diagnosis of PE almost immediately upon hospital admission. Despite the highly imbalanced scenario undermining accurate PE prediction and using information available only from the patient's medical history, our models were both accurate and informative, enabling the identification of patients already at high risk for PE upon hospital admission, even before the initial clinical checkup was performed. The fact that we did not restrict our patients to those at high risk for PE according to previously published scales (eg, Wells or revised Genova scores) enabled us to accurately assess the application of ML on raw medical data and identify new, previously unidentified risk factors for PE, such as previous pulmonary disease, in general populations.
肺栓塞(PE)的漏诊或延迟诊断——即肺动脉内发生 1 个或多个血栓,严重威胁着患者的生命——是现代医学面临的一大挑战。
我们旨在建立准确且信息丰富的机器学习(ML)模型,以便在患者住院但尚未接受初始临床检查时,仅使用其病历中的信息来识别发生 PE 的高风险患者。
我们收集了 2568 例 PE 患者和 52598 例对照患者的人口统计学、合并症和用药数据。我们重点关注急诊入院前的数据,因为这些数据是最普遍可获得的数据。我们训练了一个 ML 随机森林算法,以便在患者住院期间尽早(入院时)检测到 PE。我们开发并应用了 2 种基于 ML 的方法,专门解决 PE 患者和非 PE 患者之间数据不平衡的问题,这种不平衡会导致 PE 的误诊。
该模型基于年龄、性别、BMI、既往临床 PE 事件、慢性肺部疾病、既往血栓事件和抗凝药物使用情况来预测 PE,PE 和非 PE 分类准确率的几何平均值为 80%。尽管入院时只有 4%(1942/46639)的患者被诊断为 PE,但我们确定了 2 种聚类方案,包括具有超过 61%(聚类方案 1 中为 705/1120;聚类方案 2 中为 427/701 和 340/549)阳性患者的亚组。第一个聚类方案中的一个亚组包括所有 PE 患者中的 36%(705/1942),这些患者具有明确的既往 PE 诊断、深静脉血栓形成的患病率高 6 倍以及肺炎的患病率高 3 倍,与该方案中其他亚组的患者相比。在第二个聚类方案中,2 个亚组(均为男性或均为女性)包括所有既往有 PE 诊断和相对较高肺炎患病率的患者,第三个亚组仅包括既往患有肺炎的患者。
本研究建立了一种用于在患者入院后几乎立即进行 PE 早期诊断的 ML 工具。尽管准确预测 PE 的情况受到严重的不平衡状态的影响,并且仅使用患者的病史信息,但我们的模型既准确又提供了丰富的信息,能够识别出即使在进行初始临床检查之前,患者入院时已经处于高风险的 PE。我们没有根据先前发表的量表(如 Wells 或改良 Genova 量表)将患者限制在 PE 高危人群中,这使我们能够准确评估 ML 在原始医疗数据上的应用,并识别出肺栓塞的新的、以前未被识别的风险因素,如既往肺部疾病,在一般人群中。