Suppr超能文献

X(前身为 Twitter)上处方药引用的数字流行病学:神经网络主题建模和情感分析。

Digital Epidemiology of Prescription Drug References on X (Formerly Twitter): Neural Network Topic Modeling and Sentiment Analysis.

机构信息

Department of Epidemiology & Biostatistics, School of Public Health Bloomington, Indiana University Bloomington, Bloomington, IN, United States.

Department of Applied Health Science, School of Public Health Bloomington, Indiana University Bloomington, Bloomington, IN, United States.

出版信息

J Med Internet Res. 2024 Aug 23;26:e57885. doi: 10.2196/57885.

Abstract

BACKGROUND

Data from the social media platform X (formerly Twitter) can provide insights into the types of language that are used when discussing drug use. In past research using latent Dirichlet allocation (LDA), we found that tweets containing "street names" of prescription drugs were difficult to classify due to the similarity to other colloquialisms and lack of clarity over how the terms were used. Conversely, "brand name" references were more amenable to machine-driven categorization.

OBJECTIVE

This study sought to use next-generation techniques (beyond LDA) from natural language processing to reprocess X data and automatically cluster groups of tweets into topics to differentiate between street- and brand-name data sets. We also aimed to analyze the differences in emotional valence between the 2 data sets to study the relationship between engagement on social media and sentiment.

METHODS

We used the Twitter application programming interface to collect tweets that contained the street and brand name of a prescription drug within the tweet. Using BERTopic in combination with Uniform Manifold Approximation and Projection and k-means, we generated topics for the street-name corpus (n=170,618) and brand-name corpus (n=245,145). Valence Aware Dictionary and Sentiment Reasoner (VADER) scores were used to classify whether tweets within the topics had positive, negative, or neutral sentiments. Two different logistic regression classifiers were used to predict the sentiment label within each corpus. The first model used a tweet's engagement metrics and topic ID to predict the label, while the second model used those features in addition to the top 5000 tweets with the largest term-frequency-inverse document frequency score.

RESULTS

Using BERTopic, we identified 40 topics for the street-name data set and 5 topics for the brand-name data set, which we generalized into 8 and 5 topics of discussion, respectively. Four of the general themes of discussion in the brand-name corpus referenced drug use, while 2 themes of discussion in the street-name corpus referenced drug use. From the VADER scores, we found that both corpora were inclined toward positive sentiment. Adding the vectorized tweet text increased the accuracy of our models by around 40% compared with the models that did not incorporate the tweet text in both corpora.

CONCLUSIONS

BERTopic was able to classify tweets well. As with LDA, the discussion using brand names was more similar between tweets than the discussion using street names. VADER scores could only be logically applied to the brand-name corpus because of the high prevalence of non-drug-related topics in the street-name data. Brand-name tweets either discussed drugs positively or negatively, with few posts having a neutral emotionality. From our machine learning models, engagement alone was not enough to predict the sentiment label; the added context from the tweets was needed to understand the emotionality of a tweet.

摘要

背景

社交媒体平台 X(前身为 Twitter)上的数据可以提供有关讨论药物使用时使用的语言类型的见解。在过去使用潜在狄利克雷分配(LDA)的研究中,我们发现包含处方药物“街头名称”的推文由于与其他俗语相似以及对这些术语的使用方式缺乏明确性而难以分类。相比之下,“品牌名称”的参考更适合机器驱动的分类。

目的

本研究旨在使用自然语言处理的下一代技术(超越 LDA)重新处理 X 数据,并自动将推文群组聚类为主题,以区分街头名称和品牌名称数据集。我们还旨在分析两个数据集之间情感效价的差异,以研究社交媒体参与度和情感之间的关系。

方法

我们使用 Twitter 应用程序编程接口收集包含推文中药物的街头名称和品牌名称的推文。我们使用 BERTopic 与均匀流形逼近和投影以及 k-均值相结合,为街头名称语料库(n=170618)和品牌名称语料库(n=245145)生成主题。使用 Valence Aware Dictionary and Sentiment Reasoner (VADER) 分数来分类主题内的推文是否具有积极、消极或中性情绪。使用两种不同的逻辑回归分类器来预测每个语料库中的情绪标签。第一个模型使用推文的参与度指标和主题 ID 来预测标签,而第二个模型除了使用具有最大术语频率逆文档频率得分的前 5000 条推文之外,还使用了这些功能。

结果

使用 BERTopic,我们为街头名称数据集确定了 40 个主题,为品牌名称数据集确定了 5 个主题,我们将其概括为讨论的 8 个和 5 个主题,分别。品牌名称语料库中有 4 个主题提到了药物使用,而街头名称语料库中有 2 个主题提到了药物使用。从 VADER 分数来看,两个语料库都倾向于积极的情绪。与未将推文文本纳入两个语料库的模型相比,加入向量化推文文本将我们模型的准确性提高了约 40%。

结论

BERTopic 能够很好地对推文进行分类。与 LDA 一样,使用品牌名称的讨论比使用街头名称的讨论在推文之间更为相似。由于街头名称数据集中存在大量与药物无关的主题,因此只能逻辑地将 VADER 分数应用于品牌名称语料库。品牌名称推文要么积极讨论药物,要么消极讨论药物,很少有帖子具有中性情感。从我们的机器学习模型来看,仅参与度不足以预测情绪标签;需要从推文获取上下文来理解推文的情感。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2de9/11380061/06f89d9c3ecf/jmir_v26i1e57885_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验