Department of Computer Science, Loyola University Chicago, Chicago, IL, USA.
Center for Health Outcomes and Informatics Research, Loyola University Chicago, 2160 S. First Avenue, Maywood, IL, 60156, USA.
BMC Med Inform Decis Mak. 2020 Apr 29;20(1):79. doi: 10.1186/s12911-020-1099-y.
Automated de-identification methods for removing protected health information (PHI) from the source notes of the electronic health record (EHR) rely on building systems to recognize mentions of PHI in text, but they remain inadequate at ensuring perfect PHI removal. As an alternative to relying on de-identification systems, we propose the following solutions: (1) Mapping the corpus of documents to standardized medical vocabulary (concept unique identifier [CUI] codes mapped from the Unified Medical Language System) thus eliminating PHI as inputs to a machine learning model; and (2) training character-based machine learning models that obviate the need for a dictionary containing input words/n-grams. We aim to test the performance of models with and without PHI in a use-case for an opioid misuse classifier.
An observational cohort sampled from adult hospital inpatient encounters at a health system between 2007 and 2017. A case-control stratified sampling (n = 1000) was performed to build an annotated dataset for a reference standard of cases and non-cases of opioid misuse. Models for training and testing included CUI codes, character-based, and n-gram features. Models applied were machine learning with neural network and logistic regression as well as expert consensus with a rule-based model for opioid misuse. The area under the receiver operating characteristic curves (AUROC) were compared between models for discrimination. The Hosmer-Lemeshow test and visual plots measured model fit and calibration.
Machine learning models with CUI codes performed similarly to n-gram models with PHI. The top performing models with AUROCs > 0.90 included CUI codes as inputs to a convolutional neural network, max pooling network, and logistic regression model. The top calibrated models with the best model fit were the CUI-based convolutional neural network and max pooling network. The top weighted CUI codes in logistic regression has the related terms 'Heroin' and 'Victim of abuse'.
We demonstrate good test characteristics for an opioid misuse computable phenotype that is void of any PHI and performs similarly to models that use PHI. Herein we share a PHI-free, trained opioid misuse classifier for other researchers and health systems to use and benchmark to overcome privacy and security concerns.
从电子健康记录 (EHR) 的源注释中自动去除受保护健康信息 (PHI) 的去识别方法依赖于构建系统来识别文本中 PHI 的提及,但它们仍然不足以确保完美的 PHI 去除。作为依赖去识别系统的替代方法,我们提出以下解决方案:(1)将文档语料库映射到标准化的医学词汇(从统一医学语言系统映射的概念唯一标识符 [CUI] 代码),从而消除 PHI 作为机器学习模型的输入;(2)训练基于字符的机器学习模型,避免使用包含输入单词/ngram 的字典。我们旨在针对阿片类药物滥用分类器的用例测试具有和不具有 PHI 的模型的性能。
从 2007 年至 2017 年期间在医疗系统的成年住院患者就诊中抽取观察队列。采用病例对照分层抽样(n=1000)构建阿片类药物滥用病例和非病例的标注数据集。用于训练和测试的模型包括 CUI 代码、基于字符和 n-gram 特征。应用的模型包括机器学习与神经网络和逻辑回归以及专家共识与基于规则的阿片类药物滥用模型。比较了用于区分的模型的接收者操作特征曲线(AUROC)下面积。Hosmer-Lemeshow 检验和可视化图测量了模型拟合度和校准度。
具有 CUI 代码的机器学习模型与具有 PHI 的 n-gram 模型表现相似。AUROC>0.90 的表现最佳的模型包括将 CUI 代码作为输入的卷积神经网络、最大池网络和逻辑回归模型。具有最佳拟合度的最佳校准模型是基于 CUI 的卷积神经网络和最大池网络。逻辑回归中加权的最佳 CUI 代码具有相关术语“海洛因”和“滥用受害者”。
我们证明了一种阿片类药物滥用可计算表型的良好测试特征,该表型没有任何 PHI,并且与使用 PHI 的模型表现相似。在此,我们共享一个 PHI 免费的、经过训练的阿片类药物滥用分类器,供其他研究人员和医疗系统使用和基准测试,以克服隐私和安全问题。