Department of Primary- and Long-Term Care, University Medical Center Groningen, Groningen, Netherlands.
Data Science Center in Health, University Medical Center Groningen, Groningen, Netherlands.
J Med Internet Res. 2023 Oct 4;25:e49944. doi: 10.2196/49944.
Natural language processing (NLP) models such as bidirectional encoder representations from transformers (BERT) hold promise in revolutionizing disease identification from electronic health records (EHRs) by potentially enhancing efficiency and accuracy. However, their practical application in practice settings demands a comprehensive and multidisciplinary approach to development and validation. The COVID-19 pandemic highlighted challenges in disease identification due to limited testing availability and challenges in handling unstructured data. In the Netherlands, where general practitioners (GPs) serve as the first point of contact for health care, EHRs generated by these primary care providers contain a wealth of potentially valuable information. Nonetheless, the unstructured nature of free-text entries in EHRs poses challenges in identifying trends, detecting disease outbreaks, or accurately pinpointing COVID-19 cases.
This study aims to develop and validate a BERT model for detecting COVID-19 consultations in general practice EHRs in the Netherlands.
The BERT model was initially pretrained on Dutch language data and fine-tuned using a comprehensive EHR data set comprising confirmed COVID-19 GP consultations and non-COVID-19-related consultations. The data set was partitioned into a training and development set, and the model's performance was evaluated on an independent test set that served as the primary measure of its effectiveness in COVID-19 detection. To validate the final model, its performance was assessed through 3 approaches. First, external validation was applied on an EHR data set from a different geographic region in the Netherlands. Second, validation was conducted using results of polymerase chain reaction (PCR) test data obtained from municipal health services. Lastly, correlation between predicted outcomes and COVID-19-related hospitalizations in the Netherlands was assessed, encompassing the period around the outbreak of the pandemic in the Netherlands, that is, the period before widespread testing.
The model development used 300,359 GP consultations. We developed a highly accurate model for COVID-19 consultations (accuracy 0.97, F-score 0.90, precision 0.85, recall 0.85, specificity 0.99). External validations showed comparable high performance. Validation on PCR test data showed high recall but low precision and specificity. Validation using hospital data showed significant correlation between COVID-19 predictions of the model and COVID-19-related hospitalizations (F-score 96.8; P<.001; R=0.69). Most importantly, the model was able to predict COVID-19 cases weeks before the first confirmed case in the Netherlands.
The developed BERT model was able to accurately identify COVID-19 cases among GP consultations even preceding confirmed cases. The validated efficacy of our BERT model highlights the potential of NLP models to identify disease outbreaks early, exemplifying the power of multidisciplinary efforts in harnessing technology for disease identification. Moreover, the implications of this study extend beyond COVID-19 and offer a blueprint for the early recognition of various illnesses, revealing that such models could revolutionize disease surveillance.
自然语言处理(NLP)模型,如来自变压器的双向编码器表示(BERT),有望通过提高效率和准确性来彻底改变从电子健康记录(EHR)中识别疾病的方式。然而,它们在实际环境中的实际应用需要全面的跨学科方法来进行开发和验证。新冠肺炎疫情凸显了由于检测能力有限以及处理非结构化数据的挑战而导致的疾病识别方面的挑战。在荷兰,全科医生(GP)是医疗保健的第一接触点,这些初级保健提供者生成的 EHR 包含了大量潜在有价值的信息。然而,EHR 中自由文本条目的非结构化性质给识别趋势、检测疾病爆发或准确确定 COVID-19 病例带来了挑战。
本研究旨在开发和验证一种用于在荷兰全科医生 EHR 中检测 COVID-19 就诊的 BERT 模型。
BERT 模型最初在荷兰语数据上进行预训练,并使用包含确诊 COVID-19 GP 就诊和非 COVID-19 相关就诊的综合 EHR 数据集进行微调。数据集被划分为训练集和开发集,模型在独立测试集上的性能评估作为其在 COVID-19 检测中的有效性的主要衡量标准。为了验证最终模型,通过 3 种方法评估了其性能。首先,在荷兰另一个地理区域的 EHR 数据集上进行了外部验证。其次,使用从市立卫生服务部门获得的聚合酶链反应(PCR)测试数据进行了验证。最后,评估了预测结果与荷兰 COVID-19 相关住院之间的相关性,包括荷兰疫情爆发前后的时期,即广泛检测之前的时期。
该模型的开发使用了 300,359 次 GP 就诊。我们开发了一种用于 COVID-19 就诊的高度准确的模型(准确性 0.97、F 分数 0.90、精度 0.85、召回率 0.85、特异性 0.99)。外部验证显示出类似的高性能。对 PCR 测试数据的验证显示出高召回率,但精度和特异性较低。使用住院数据进行的验证显示模型的 COVID-19 预测与 COVID-19 相关住院之间存在显著相关性(F 分数 96.8;P<.001;R=0.69)。最重要的是,该模型能够在荷兰首例确诊病例之前数周预测 COVID-19 病例。
开发的 BERT 模型能够准确识别全科医生就诊中的 COVID-19 病例,甚至在确诊病例之前。我们的 BERT 模型验证的功效突出了 NLP 模型在早期识别疾病爆发方面的潜力,体现了多学科努力在利用技术进行疾病识别方面的力量。此外,这项研究的意义超出了 COVID-19 的范围,并为各种疾病的早期识别提供了蓝图,表明这些模型可能彻底改变疾病监测。