Bertke S J, Meyers A R, Wurzelbacher S J, Measure A, Lampl M P, Robins D
National Institute for Occupational Safety and Health, Division of Surveillance, Hazard Evaluations, and Field Studies, Industrywide Studies Branch, 1090 Tusculum Ave, Cincinnati, OH 45226, United States.
National Institute for Occupational Safety and Health, Division of Surveillance, Hazard Evaluations, and Field Studies, Industrywide Studies Branch, Center for Workers' Compensation Studies, 1090 Tusculum Ave, Cincinnati, OH 45226, United States.
Accid Anal Prev. 2016 Mar;88:117-23. doi: 10.1016/j.aap.2015.12.006. Epub 2015 Dec 30.
Manually reading free-text narratives in large databases to identify the cause of an injury can be very time consuming and recently, there has been much work in automating this process. In particular, the variations of the naïve Bayes model have been used to successfully auto-code free text narratives describing the event/exposure leading to the injury of a workers' compensation claim. This paper compares the naïve Bayes model with an alternative logistic model and found that this new model outperformed the naïve Bayesian model. Further modest improvements were found through the addition of sequences of keywords in the models as opposed to consideration of only single keywords. The programs and weights used in this paper are available upon request to researchers without a training set wishing to automatically assign event codes to large data-sets of text narratives. The utility of sharing this program was tested on an outside set of injury narratives provided by the Bureau of Labor Statistics with promising results.
在大型数据库中人工阅读自由文本叙述以确定受伤原因可能非常耗时,最近,在自动化这一过程方面已经开展了大量工作。特别是,朴素贝叶斯模型的变体已被成功用于对描述导致工伤赔偿申请受伤事件/暴露情况的自由文本叙述进行自动编码。本文将朴素贝叶斯模型与另一种逻辑模型进行了比较,发现这种新模型优于朴素贝叶斯模型。通过在模型中添加关键词序列,而不是仅考虑单个关键词,还发现了进一步的适度改进。本文中使用的程序和权重可应要求提供给希望自动为大型文本叙述数据集分配事件代码且没有训练集的研究人员。在劳工统计局提供的一组外部受伤叙述上测试了共享此程序的效用,结果很有前景。