Bayani Azadeh, Ayotte Alexandre, Nikiema Jean Noel
Laboratoire Transformation Numérique en Santé, LabTNS, Montreal, QC, Canada.
Centre de recherche en santé publique, Université de Montréal et CIUSSS du Centre-Sud-de-l'Île-de-Montréal, Montreal, QC, Canada.
JMIR Infodemiology. 2025 Feb 21;5:e56831. doi: 10.2196/56831.
Many people seek health-related information online. The significance of reliable information became particularly evident due to the potential dangers of misinformation. Therefore, discerning true and reliable information from false information has become increasingly challenging.
This study aimed to present a pilot study in which we introduced a novel approach to automate the fact-checking process, leveraging PubMed resources as a source of truth using natural language processing transformer models to enhance the process.
A total of 538 health-related web pages, covering 7 different disease subjects, were manually selected by Factually Health Company. The process included the following steps: (1) using transformer models of bidirectional encoder representations from transformers (BERT), BioBERT, and SciBERT, and traditional models of random forests and support vector machines, to classify the contents of web pages into 3 thematic categories (semiology, epidemiology, and management), (2) for each category in the web pages, a PubMed query was automatically produced using a combination of the "WellcomeBertMesh" and "KeyBERT" models, (3) top 20 related literatures were automatically extracted from PubMed, and finally, (4) the similarity checking techniques of cosine similarity and Jaccard distance were applied to compare the content of extracted literature and web pages.
The BERT model for the categorization of web page contents had good performance, with F-scores and recall of 93% and 94% for semiology and epidemiology, respectively, and 96% for both the recall and F-score for management. For each of the 3 categories in a web page, 1 PubMed query was generated and with each query, the 20 most related, open access articles within the category of systematic reviews and meta-analyses were extracted. Less than 10% of the extracted literature was irrelevant; those were deleted. For each web page, an average of 23% of the sentences were found to be very similar to the literature. Moreover, during the evaluation, it was found that cosine similarity outperformed the Jaccard distance measure when comparing the similarity between sentences from web pages and academic papers vectorized by BERT. However, there was a significant issue with false positives in the retrieved sentences when compared with accurate similarities, as some sentences had a similarity score exceeding 80%, but they could not be considered similar sentences.
In this pilot study, we have proposed an approach to automate the fact-checking of health-related online information. Incorporating content from PubMed or other scientific article databases as trustworthy resources can automate the discovery of similarly credible information in the health domain.
许多人在网上搜索与健康相关的信息。由于错误信息存在潜在危害,可靠信息的重要性愈发明显。因此,辨别真假信息变得越来越具有挑战性。
本研究旨在进行一项初步研究,引入一种新颖的方法来自动化事实核查过程,利用PubMed资源作为真相来源,使用自然语言处理变压器模型来改进这一过程。
事实健康公司手动挑选了538个与健康相关的网页,涵盖7个不同的疾病主题。该过程包括以下步骤:(1)使用双向编码器表征来自变压器(BERT)的变压器模型、BioBERT和SciBERT,以及随机森林和支持向量机的传统模型,将网页内容分类为3个主题类别(症状学、流行病学和管理);(2)对于网页中的每个类别,使用“WellcomeBertMesh”和“KeyBERT”模型的组合自动生成一个PubMed查询;(3)从PubMed中自动提取前20篇相关文献,最后,(4)应用余弦相似度和杰卡德距离的相似性检查技术来比较提取文献和网页的内容。
用于网页内容分类的BERT模型表现良好,症状学和流行病学的F分数和召回率分别为93%和94%,管理类别的召回率和F分数均为96%。对于网页中的3个类别中的每一个,生成了1个PubMed查询,并且对于每个查询,提取了系统评价和荟萃分析类别中20篇最相关的开放获取文章。提取的文献中不到10%是不相关的;这些被删除。对于每个网页,平均发现23%的句子与文献非常相似。此外,在评估过程中发现,在比较由BERT向量化的网页句子和学术论文之间的相似性时,余弦相似度优于杰卡德距离度量。然而,与准确的相似性相比,检索到的句子中存在明显的误报问题,因为一些句子的相似性得分超过80%,但它们不能被视为相似句子。
在这项初步研究中,我们提出了一种自动化与健康相关的在线信息事实核查的方法。将来自PubMed或其他科学文章数据库的内容作为可信资源纳入,可以自动在健康领域发现类似的可信信息。