Suppr超能文献

评估者间信度估计值与真实评估者间信度进行比较。

Interrater reliability estimators tested against true interrater reliabilities.

机构信息

Department of Communication, Faculty of Social Sciences, University of Macau, Taipa, Macao.

出版信息

BMC Med Res Methodol. 2022 Aug 29;22(1):232. doi: 10.1186/s12874-022-01707-5.

Abstract

BACKGROUND

Interrater reliability, aka intercoder reliability, is defined as true agreement between raters, aka coders, without chance agreement. It is used across many disciplines including medical and health research to measure the quality of ratings, coding, diagnoses, or other observations and judgements. While numerous indices of interrater reliability are available, experts disagree on which ones are legitimate or more appropriate. Almost all agree that percent agreement (a), the oldest and the simplest index, is also the most flawed because it fails to estimate and remove chance agreement, which is produced by raters' random rating. The experts, however, disagree on which chance estimators are legitimate or better. The experts also disagree on which of the three factors, rating category, distribution skew, or task difficulty, an index should rely on to estimate chance agreement, or which factors the known indices in fact rely on. The most popular chance-adjusted indices, according to a functionalist view of mathematical statistics, assume that all raters conduct intentional and maximum random rating while typical raters conduct involuntary and reluctant random rating. The mismatches between the assumed and the actual rater behaviors cause the indices to rely on mistaken factors to estimate chance agreement, leading to the numerous paradoxes, abnormalities, and other misbehaviors of the indices identified by prior studies.

METHODS

We conducted a 4 × 8 × 3 between-subject controlled experiment with 4 subjects per cell. Each subject was a rating session with 100 pairs of rating by two raters, totaling 384 rating sessions as the experimental subjects. The experiment tested seven best-known indices of interrater reliability against the observed reliabilities and chance agreements. Impacts of the three factors, i.e., rating category, distribution skew, and task difficulty, on the indices were tested.

RESULTS

The most criticized index, percent agreement (a), showed as the most accurate predictor of reliability, reporting directional r = .84. It was also the third best approximator, overestimating observed reliability by 13 percentage points on average. The three most acclaimed and most popular indices, Scott's π, Cohen's κ and Krippendorff's α, underperformed all other indices, reporting directional r = .312 and underestimated reliability by 31.4 ~ 31.8 points. The newest index, Gwet's AC, emerged as the second-best predictor and the most accurate approximator. Bennett et al's S ranked behind AC, and Perreault and Leigh's I ranked the fourth both for prediction and approximation. The reliance on category and skew and failure to rely on difficulty explain why the six chance-adjusted indices often underperformed a, which they were created to outperform. The evidence corroborated the notion that the chance-adjusted indices assume intentional and maximum random rating while the raters instead exhibited involuntary and reluctant random rating.

CONCLUSION

The authors call for more empirical studies and especially more controlled experiments to falsify or qualify this study. If the main findings are replicated and the underlying theories supported, new thinking and new indices may be needed. Index designers may need to refrain from assuming intentional and maximum random rating, and instead assume involuntary and reluctant random rating. Accordingly, the new indices may need to rely on task difficulty, rather than distribution skew or rating category, to estimate chance agreement.

摘要

背景

组内一致性,又称编码者一致性,是指评分者之间真实的一致性,而没有机会一致性。它被广泛应用于医学和健康研究等多个领域,用于衡量评分、编码、诊断或其他观察和判断的质量。有许多组内一致性的指标,但专家们对哪些是合法的或更合适的指标存在分歧。几乎所有人都同意,百分比一致性(a)是最古老和最简单的指标,也是最有缺陷的,因为它未能估计和去除机会一致性,而机会一致性是由评分者的随机评分产生的。然而,专家们对哪些机会估计器是合法的或更好的存在分歧。专家们还对索引应依赖哪些因素来估计机会一致性,或者已知索引实际上依赖哪些因素,存在分歧。根据数学统计学的功能主义观点,最流行的机会调整指标假设所有评分者在进行意图和最大随机评分,而典型的评分者则进行非自愿和不情愿的随机评分。评分者的假设行为和实际行为之间的不匹配导致这些指标依赖于错误的因素来估计机会一致性,从而导致先前研究中发现的许多矛盾、异常和其他行为不当。

方法

我们进行了一项 4×8×3 的被试间对照实验,每个单元格有 4 名被试。每个被试都是由两名评分者进行的 100 对评分的评分会议,总共有 384 次评分会议作为实验对象。该实验测试了七个最著名的组内一致性指标与观察到的可靠性和机会一致性。测试了三个因素,即评分类别、分布偏斜和任务难度,对这些指标的影响。

结果

最受批评的指标,百分比一致性(a),显示为可靠性的最准确预测指标,报告方向 r=0.84。它也是第三个最佳逼近指标,平均高估了观察到的可靠性 13 个百分点。三个最受赞誉和最受欢迎的指标,Scott's π、Cohen's κ 和 Krippendorff's α,都逊于其他指标,报告方向 r=0.312,低估了可靠性 31.4 到 31.8 个百分点。最新的指标,Gwet's AC,成为第二大预测指标和最准确的逼近指标。Bennett 等人的 S 排名仅次于 AC,Perreault 和 Leigh 的 I 排名第四,无论是预测还是逼近。依赖类别和偏斜而不依赖难度,解释了为什么六个机会调整指标经常表现不佳于 a,它们的设计初衷是为了超越 a。证据证实了这样一种观点,即机会调整指标假设评分者进行意图和最大随机评分,而评分者实际上表现出非自愿和不情愿的随机评分。

结论

作者呼吁进行更多的实证研究,特别是更多的对照实验,以证伪或证实本研究。如果主要发现得到复制,并得到基础理论的支持,可能需要新的思考和新的指标。指标设计者可能需要避免假设意图和最大随机评分,而应假设非自愿和不情愿的随机评分。相应地,新的指标可能需要依赖于任务难度,而不是分布偏斜或评分类别,来估计机会一致性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f937/9426226/b2ff6c2d561d/12874_2022_1707_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验