Prescott Maximo R, Yeager Samantha, Ham Lillian, Rivera Saldana Carlos D, Serrano Vanessa, Narez Joey, Paltin Dafna, Delgado Jorge, Moore David J, Montoya Jessica
HIV Neurobehavioral Research Program, University of California, San Diego, San Diego, CA, United States.
San Diego State University/University of California San Diego Joint Doctoral Program in Clinical Psychology, San Diego, CA, United States.
JMIR AI. 2024 Aug 2;3:e54482. doi: 10.2196/54482.
Qualitative methods are incredibly beneficial to the dissemination and implementation of new digital health interventions; however, these methods can be time intensive and slow down dissemination when timely knowledge from the data sources is needed in ever-changing health systems. Recent advancements in generative artificial intelligence (GenAI) and their underlying large language models (LLMs) may provide a promising opportunity to expedite the qualitative analysis of textual data, but their efficacy and reliability remain unknown.
The primary objectives of our study were to evaluate the consistency in themes, reliability of coding, and time needed for inductive and deductive thematic analyses between GenAI (ie, ChatGPT and Bard) and human coders.
The qualitative data for this study consisted of 40 brief SMS text message reminder prompts used in a digital health intervention for promoting antiretroviral medication adherence among people with HIV who use methamphetamine. Inductive and deductive thematic analyses of these SMS text messages were conducted by 2 independent teams of human coders. An independent human analyst conducted analyses following both approaches using ChatGPT and Bard. The consistency in themes (or the extent to which the themes were the same) and reliability (or agreement in coding of themes) between methods were compared.
The themes generated by GenAI (both ChatGPT and Bard) were consistent with 71% (5/7) of the themes identified by human analysts following inductive thematic analysis. The consistency in themes was lower between humans and GenAI following a deductive thematic analysis procedure (ChatGPT: 6/12, 50%; Bard: 7/12, 58%). The percentage agreement (or intercoder reliability) for these congruent themes between human coders and GenAI ranged from fair to moderate (ChatGPT, inductive: 31/66, 47%; ChatGPT, deductive: 22/59, 37%; Bard, inductive: 20/54, 37%; Bard, deductive: 21/58, 36%). In general, ChatGPT and Bard performed similarly to each other across both types of qualitative analyses in terms of consistency of themes (inductive: 6/6, 100%; deductive: 5/6, 83%) and reliability of coding (inductive: 23/62, 37%; deductive: 22/47, 47%). On average, GenAI required significantly less overall time than human coders when conducting qualitative analysis (20, SD 3.5 min vs 567, SD 106.5 min).
The promising consistency in the themes generated by human coders and GenAI suggests that these technologies hold promise in reducing the resource intensiveness of qualitative thematic analysis; however, the relatively lower reliability in coding between them suggests that hybrid approaches are necessary. Human coders appeared to be better than GenAI at identifying nuanced and interpretative themes. Future studies should consider how these powerful technologies can be best used in collaboration with human coders to improve the efficiency of qualitative research in hybrid approaches while also mitigating potential ethical risks that they may pose.
定性方法对于新的数字健康干预措施的传播和实施极为有益;然而,这些方法可能耗时较长,并且在不断变化的卫生系统中需要及时获取数据源的知识时,会减缓传播速度。生成式人工智能(GenAI)及其底层的大语言模型(LLMs)的最新进展可能为加快文本数据的定性分析提供一个有前景的机会,但其有效性和可靠性仍然未知。
我们研究的主要目的是评估GenAI(即ChatGPT和Bard)与人类编码员在主题一致性、编码可靠性以及归纳和演绎主题分析所需时间方面的差异。
本研究的定性数据包括40条简短的短信文本提醒提示,这些提示用于一项数字健康干预措施,以促进使用甲基苯丙胺的艾滋病毒感染者坚持服用抗逆转录病毒药物。由2个独立的人类编码员团队对这些短信进行归纳和演绎主题分析。一名独立的人类分析师使用ChatGPT和Bard按照这两种方法进行分析。比较了不同方法之间主题的一致性(即主题相同的程度)和可靠性(即主题编码的一致性)。
GenAI(ChatGPT和Bard)生成的主题与人类分析师在归纳主题分析后确定的71%(5/7)的主题一致。在演绎主题分析过程中,人类与GenAI之间的主题一致性较低(ChatGPT:6/12,50%;Bard:7/12,58%)。人类编码员与GenAI之间这些一致主题的百分比一致性(或编码员间可靠性)从中等到中等偏下不等(ChatGPT,归纳:31/66,47%;ChatGPT,演绎:22/59,37%;Bard,归纳:20/54,37%;Bard,演绎:21/58,36%)。总体而言,ChatGPT和Bard在两种定性分析类型中,在主题一致性(归纳:6/6,100%;演绎:5/6,83%)和编码可靠性(归纳:23/62,37%;演绎:22/47,47%)方面表现相似。平均而言,在进行定性分析时,GenAI所需的总时间明显少于人类编码员(20,标准差3.5分钟对567,标准差106.5分钟)。
人类编码员和GenAI生成的主题具有有前景的一致性,这表明这些技术在降低定性主题分析的资源密集度方面具有潜力;然而,它们之间编码可靠性相对较低表明需要采用混合方法。在识别细微和解释性主题方面,人类编码员似乎比GenAI表现更好。未来的研究应考虑如何将这些强大的技术与人类编码员最佳地结合使用,以提高混合方法中定性研究的效率,同时减轻它们可能带来的潜在伦理风险。