Anibal James, Landa Adam, Nguyen Hang, Daoud Veronica, Le Tram, Huth Hannah, Song Miranda, Peltekian Alec, Shin Ashley, Hazen Lindsey, Christou Anna, Rivera Jocelyne, Morhard Robert, Brenner Jacqueline, Bagci Ulas, Li Ming, Bensoussan Yael, Clifton David, Wood Bradford
Center for Interventional Oncology, Radiology and Imaging Sciences, NIH Clinical Center, Bethesda, USA.
Computational Health Informatics Lab, Oxford Institute of Biomedical Engineering, University of Oxford, Oxford, UK.
Npj Health Syst. 2025;2(1):19. doi: 10.1038/s44401-025-00022-7. Epub 2025 Jun 2.
In this study, transcribed videos about personal experiences with COVID-19 were used for variant classification. The o1 LLM was used to summarize the transcripts, excluding references to dates, vaccinations, testing methods, and other variables that were correlated with specific variants but unrelated to changes in the disease. This step was necessary to effectively simulate model deployment in the early days of a pandemic when subtle changes in symptomatology may be the only viable biomarkers of disease mutations. The embedded summaries were used for training a neural network to predict the variant status of the speaker as "Omicron" or "Pre-Omicron", resulting in an AUROC score of 0.823. This was compared to a neural network model trained on binary symptom data, which obtained a lower AUROC score of 0.769. Results of the study illustrated the future value of LLMs and audio data in the design of pandemic management tools for health systems.
在本研究中,关于新冠病毒(COVID-19)个人经历的转录视频被用于变异株分类。使用o1语言模型对转录本进行总结,排除对日期、疫苗接种、检测方法以及其他与特定变异株相关但与疾病变化无关的变量的提及。这一步骤对于在疫情早期有效模拟模型部署是必要的,因为症状的细微变化可能是疾病突变的唯一可行生物标志物。嵌入的摘要用于训练神经网络,以预测说话者的变异株状态为“奥密克戎”或“前奥密克戎”,曲线下面积(AUROC)得分为0.823。将其与基于二元症状数据训练的神经网络模型进行比较,该模型的AUROC得分较低,为0.769。研究结果说明了语言模型和音频数据在卫生系统大流行管理工具设计中的未来价值。