评估者间信度估计值与真实评估者间信度进行比较。

Interrater reliability estimators tested against true interrater reliabilities.

机构信息

Department of Communication, Faculty of Social Sciences, University of Macau, Taipa, Macao.

出版信息

BMC Med Res Methodol. 2022 Aug 29;22(1):232. doi: 10.1186/s12874-022-01707-5.

DOI:10.1186/s12874-022-01707-5

PMID:36038846

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9426226/

Abstract

BACKGROUND

Interrater reliability, aka intercoder reliability, is defined as true agreement between raters, aka coders, without chance agreement. It is used across many disciplines including medical and health research to measure the quality of ratings, coding, diagnoses, or other observations and judgements. While numerous indices of interrater reliability are available, experts disagree on which ones are legitimate or more appropriate. Almost all agree that percent agreement (a), the oldest and the simplest index, is also the most flawed because it fails to estimate and remove chance agreement, which is produced by raters' random rating. The experts, however, disagree on which chance estimators are legitimate or better. The experts also disagree on which of the three factors, rating category, distribution skew, or task difficulty, an index should rely on to estimate chance agreement, or which factors the known indices in fact rely on. The most popular chance-adjusted indices, according to a functionalist view of mathematical statistics, assume that all raters conduct intentional and maximum random rating while typical raters conduct involuntary and reluctant random rating. The mismatches between the assumed and the actual rater behaviors cause the indices to rely on mistaken factors to estimate chance agreement, leading to the numerous paradoxes, abnormalities, and other misbehaviors of the indices identified by prior studies.

METHODS

We conducted a 4 × 8 × 3 between-subject controlled experiment with 4 subjects per cell. Each subject was a rating session with 100 pairs of rating by two raters, totaling 384 rating sessions as the experimental subjects. The experiment tested seven best-known indices of interrater reliability against the observed reliabilities and chance agreements. Impacts of the three factors, i.e., rating category, distribution skew, and task difficulty, on the indices were tested.

RESULTS

The most criticized index, percent agreement (a), showed as the most accurate predictor of reliability, reporting directional r = .84. It was also the third best approximator, overestimating observed reliability by 13 percentage points on average. The three most acclaimed and most popular indices, Scott's π, Cohen's κ and Krippendorff's α, underperformed all other indices, reporting directional r = .312 and underestimated reliability by 31.4 ~ 31.8 points. The newest index, Gwet's AC, emerged as the second-best predictor and the most accurate approximator. Bennett et al's S ranked behind AC, and Perreault and Leigh's I ranked the fourth both for prediction and approximation. The reliance on category and skew and failure to rely on difficulty explain why the six chance-adjusted indices often underperformed a, which they were created to outperform. The evidence corroborated the notion that the chance-adjusted indices assume intentional and maximum random rating while the raters instead exhibited involuntary and reluctant random rating.

CONCLUSION

The authors call for more empirical studies and especially more controlled experiments to falsify or qualify this study. If the main findings are replicated and the underlying theories supported, new thinking and new indices may be needed. Index designers may need to refrain from assuming intentional and maximum random rating, and instead assume involuntary and reluctant random rating. Accordingly, the new indices may need to rely on task difficulty, rather than distribution skew or rating category, to estimate chance agreement.

摘要

背景

组内一致性，又称编码者一致性，是指评分者之间真实的一致性，而没有机会一致性。它被广泛应用于医学和健康研究等多个领域，用于衡量评分、编码、诊断或其他观察和判断的质量。有许多组内一致性的指标，但专家们对哪些是合法的或更合适的指标存在分歧。几乎所有人都同意，百分比一致性（a）是最古老和最简单的指标，也是最有缺陷的，因为它未能估计和去除机会一致性，而机会一致性是由评分者的随机评分产生的。然而，专家们对哪些机会估计器是合法的或更好的存在分歧。专家们还对索引应依赖哪些因素来估计机会一致性，或者已知索引实际上依赖哪些因素，存在分歧。根据数学统计学的功能主义观点，最流行的机会调整指标假设所有评分者在进行意图和最大随机评分，而典型的评分者则进行非自愿和不情愿的随机评分。评分者的假设行为和实际行为之间的不匹配导致这些指标依赖于错误的因素来估计机会一致性，从而导致先前研究中发现的许多矛盾、异常和其他行为不当。

方法

我们进行了一项 4×8×3 的被试间对照实验，每个单元格有 4 名被试。每个被试都是由两名评分者进行的 100 对评分的评分会议，总共有 384 次评分会议作为实验对象。该实验测试了七个最著名的组内一致性指标与观察到的可靠性和机会一致性。测试了三个因素，即评分类别、分布偏斜和任务难度，对这些指标的影响。

结果

最受批评的指标，百分比一致性（a），显示为可靠性的最准确预测指标，报告方向 r=0.84。它也是第三个最佳逼近指标，平均高估了观察到的可靠性 13 个百分点。三个最受赞誉和最受欢迎的指标，Scott's π、Cohen's κ 和 Krippendorff's α，都逊于其他指标，报告方向 r=0.312，低估了可靠性 31.4 到 31.8 个百分点。最新的指标，Gwet's AC，成为第二大预测指标和最准确的逼近指标。Bennett 等人的 S 排名仅次于 AC，Perreault 和 Leigh 的 I 排名第四，无论是预测还是逼近。依赖类别和偏斜而不依赖难度，解释了为什么六个机会调整指标经常表现不佳于 a，它们的设计初衷是为了超越 a。证据证实了这样一种观点，即机会调整指标假设评分者进行意图和最大随机评分，而评分者实际上表现出非自愿和不情愿的随机评分。

结论

作者呼吁进行更多的实证研究，特别是更多的对照实验，以证伪或证实本研究。如果主要发现得到复制，并得到基础理论的支持，可能需要新的思考和新的指标。指标设计者可能需要避免假设意图和最大随机评分，而应假设非自愿和不情愿的随机评分。相应地，新的指标可能需要依赖于任务难度，而不是分布偏斜或评分类别，来估计机会一致性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f937/9426226/b2ff6c2d561d/12874_2022_1707_Fig1_HTML.jpg

相似文献

Interrater reliability estimators tested against true interrater reliabilities.

BMC Med Res Methodol. 2022 Aug 29;22(1):232. doi: 10.1186/s12874-022-01707-5.

A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples.

BMC Med Res Methodol. 2013 Apr 29;13:61. doi: 10.1186/1471-2288-13-61.

An Evaluation of Interrater Reliability Measures on Binary Tasks Using .

Appl Psychol Meas. 2017 Jun;41(4):264-276. doi: 10.1177/0146621616684584. Epub 2016 Dec 29.

A new coefficient of interrater agreement: The challenge of highly unequal category proportions.

Psychol Methods. 2019 Aug;24(4):439-451. doi: 10.1037/met0000183. Epub 2018 May 3.

The Flexor Pollicis Longus Reflex: Interrater and Intrarater Reliability in Comparison With Established Muscle Stretch Reflexes.

Am J Phys Med Rehabil. 2021 Jun 1;100(6):539-545. doi: 10.1097/PHM.0000000000001731.

The role of raters threshold in estimating interrater agreement.

Psychol Methods. 2021 Oct;26(5):622-634. doi: 10.1037/met0000416.

Interrater agreement statistics with skewed data: evaluation of alternatives to Cohen's kappa.

J Consult Clin Psychol. 2014 Dec;82(6):1219-27. doi: 10.1037/a0037489. Epub 2014 Aug 4.

Gwet's AC1 is not a substitute for Cohen's kappa - A comparison of basic properties.

MethodsX. 2023 May 10;10:102212. doi: 10.1016/j.mex.2023.102212. eCollection 2023.

Better to be in agreement than in bad company : A critical analysis of many kappa-like tests.

Behav Res Methods. 2023 Oct;55(7):3326-3347. doi: 10.3758/s13428-022-01950-0. Epub 2022 Sep 16.

Interrater reliability of the categorization of late radiographic changes after lung stereotactic body radiation therapy.

Int J Radiat Oncol Biol Phys. 2014 Aug 1;89(5):1076-1083. doi: 10.1016/j.ijrobp.2014.04.042. Epub 2014 Jul 8.

引用本文的文献

The influence of patient-centered communication on psychological distress: the chain mediating role of health-related self-efficacy and healthy lifestyle behaviors and the moderating role of social media use.

Front Psychiatry. 2025 Jul 14;16:1562414. doi: 10.3389/fpsyt.2025.1562414. eCollection 2025.

Automating Data Entry from Electronic Health Record to Electronic Data Capture Using a Trusted Cloud-Based Application in Multisite Cancer Clinical Trials.

J Soc Clin Data Manag. 2025 Winter;5(1):1-16. doi: 10.47912/jscdm.371. Epub 2025 Jan 14.

Automatic Feature Selection for Imbalanced Echocardiogram Data Using Event-Based Self-Similarity.

Diagnostics (Basel). 2025 Apr 11;15(8):976. doi: 10.3390/diagnostics15080976.

Association between e-health usage and consideration for clinical trial participation: An exploratory study on the mediating role of cancer-related self-efficacy and patient-centered communication.

Digit Health. 2025 Mar 25;11:20552076251328598. doi: 10.1177/20552076251328598. eCollection 2025 Jan-Dec.

Evaluating the quality of medical content on YouTube using large language models.

Sci Rep. 2025 Mar 22;15(1):9906. doi: 10.1038/s41598-025-94208-6.

Quantifying intratumoral biomarker heterogeneity in tubo-ovarian high-grade serous carcinoma to optimize clinical translation.

Sci Rep. 2025 Jan 20;15(1):2459. doi: 10.1038/s41598-024-82206-z.

An Exploratory Study of Published Case Reports Using a Systematic Typology.

Fam Med. 2025 Jan;57(1):16-19. doi: 10.22454/FamMed.2024.976230. Epub 2024 Oct 16.

Digital health tools in nephrology: A comparative analysis of AI and professional opinions via online polls.

Digit Health. 2024 Aug 28;10:20552076241277458. doi: 10.1177/20552076241277458. eCollection 2024 Jan-Dec.

(Semi)automated approaches to data extraction for systematic reviews and meta-analyses in social sciences: A living review.

F1000Res. 2024 Sep 26;13:664. doi: 10.12688/f1000research.151493.1. eCollection 2024.

Performance of perimetric glaucoma staging systems and their preference patterns among the Indian eye care practitioners.

Indian J Ophthalmol. 2024 Mar 1;72(3):447-451. doi: 10.4103/IJO.IJO_2060_23. Epub 2024 Feb 28.

本文引用的文献

COVID-19 information exposure and vaccine hesitancy: The influence of trust in government and vaccine confidence.

Psychol Health Med. 2023 Jan;28(1):27-36. doi: 10.1080/13548506.2021.2014910. Epub 2021 Dec 7.

Why Cohen's Kappa should be avoided as performance measure in classification.

PLoS One. 2019 Sep 26;14(9):e0222916. doi: 10.1371/journal.pone.0222916. eCollection 2019.

Scientists rise up against statistical significance.

Nature. 2019 Mar;567(7748):305-307. doi: 10.1038/d41586-019-00857-9.

Kappa and Rater Accuracy: Paradigms and Parameters.

Educ Psychol Meas. 2017 Dec;77(6):1019-1047. doi: 10.1177/0013164416663277. Epub 2016 Aug 19.

An Unbiased Estimate of Global Interrater Agreement.

Educ Psychol Meas. 2017 Oct;77(5):721-742. doi: 10.1177/0013164416654740. Epub 2016 Jul 1.

A Ratio Test of Interrater Agreement With High Specificity.

Educ Psychol Meas. 2015 Dec;75(6):979-1001. doi: 10.1177/0013164415574086. Epub 2015 Mar 25.

Inter-Coder Agreement in One-to-Many Classification: Fuzzy Kappa.

PLoS One. 2016 Mar 2;11(3):e0149787. doi: 10.1371/journal.pone.0149787. eCollection 2016.

Computing inter-rater reliability and its variance in the presence of high agreement.

Br J Math Stat Psychol. 2008 May;61(Pt 1):29-48. doi: 10.1348/000711006X126600.

Psychological probability as a function of experienced frequency.

J Exp Psychol. 1953 Aug;46(2):81-6. doi: 10.1037/h0057955.

Behavior and interpretation of the kappa statistic: resolution of the two paradoxes.

J Clin Epidemiol. 1996 Apr;49(4):431-4. doi: 10.1016/0895-4356(95)00571-4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估者间信度估计值与真实评估者间信度进行比较。

Interrater reliability estimators tested against true interrater reliabilities.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献