Department of Electrical Engineering, Vanderbilt University, Nashville, TN, USA.
Department of Computer Science, Vanderbilt University, Nashville, TN, USA.
Neuroinformatics. 2022 Apr;20(2):483-505. doi: 10.1007/s12021-021-09553-4. Epub 2022 Jan 3.
Along with the increasing availability of electronic medical record (EMR) data, phenome-wide association studies (PheWAS) and phenome-disease association studies (PheDAS) have become a prominent, first-line method of analysis for uncovering the secrets of EMR. Despite this recent growth, there is a lack of approachable software tools for conducting these analyses on large-scale EMR cohorts. In this article, we introduce pyPheWAS, an open-source python package for conducting PheDAS and related analyses. This toolkit includes 1) data preparation, such as cohort censoring and age-matching; 2) traditional PheDAS analysis of ICD-9 and ICD-10 billing codes; 3) PheDAS analysis applied to a novel EMR phenotype mapping: current procedural terminology (CPT) codes; and 4) novelty analysis of significant disease-phenotype associations found through PheDAS. The pyPheWAS toolkit is approachable and comprehensive, encapsulating data prep through result visualization all within a simple command-line interface. The toolkit is designed for the ever-growing scale of available EMR data, with the ability to analyze cohorts of 100,000 + patients in less than 2 h. Through a case study of Down Syndrome and other intellectual developmental disabilities, we demonstrate the ability of pyPheWAS to discover both known and potentially novel disease-phenotype associations across different experiment designs and disease groups. The software and user documentation are available in open source at https://github.com/MASILab/pyPheWAS .
随着电子病历 (EMR) 数据的日益普及,表型全基因组关联研究 (PheWAS) 和表型疾病关联研究 (PheDAS) 已成为揭示 EMR 秘密的一种突出的、首选的分析方法。尽管最近取得了这一进展,但对于在大规模 EMR 队列上进行这些分析,仍然缺乏易于使用的软件工具。在本文中,我们介绍了 pyPheWAS,这是一个用于进行 PheDAS 和相关分析的开源 Python 包。该工具包包括 1)数据准备,如队列删失和年龄匹配;2)ICD-9 和 ICD-10 计费代码的传统 PheDAS 分析;3)应用于新的 EMR 表型映射的 PheDAS 分析:当前程序术语 (CPT) 代码;4)通过 PheDAS 发现的显著疾病-表型关联的新颖性分析。pyPheWAS 工具包易于使用且全面,封装了从数据准备到结果可视化的所有内容,仅需一个简单的命令行界面即可完成。该工具包专为日益增长的 EMR 数据规模而设计,能够在不到 2 小时的时间内分析 100,000 多名患者的队列。通过唐氏综合征和其他智力发育障碍的案例研究,我们展示了 pyPheWAS 发现不同实验设计和疾病组之间已知和潜在新的疾病-表型关联的能力。软件和用户文档可在 https://github.com/MASILab/pyPheWAS 上获得开源访问。