Bi Yilin, Jiao Xinshan, Lee Yan-Li, Zhou Tao
CompleX Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.
School of Computer and Software Engineering, Xihua University, Chengdu 610039, China.
PNAS Nexus. 2024 Nov 6;3(11):pgae498. doi: 10.1093/pnasnexus/pgae498. eCollection 2024 Nov.
Link prediction is a paradigmatic and challenging problem in network science, which aims to predict missing links, future links, and temporal links based on known topology. Along with the increasing number of link prediction algorithms, a critical yet previously ignored risk is that the evaluation metrics for algorithm performance are usually chosen at will. This paper implements extensive experiments on hundreds of real networks and 26 well-known algorithms, revealing significant inconsistency among evaluation metrics, namely different metrics probably produce remarkably different rankings of algorithms. Therefore, we conclude that any single metric cannot comprehensively or credibly evaluate algorithm performance. In terms of information content, we suggest the usage of at least two metrics: one is the area under the receiver operating characteristic curve, and the other is one of the following three candidates, say the area under the precision-recall curve, the area under the precision curve, and the normalized discounted cumulative gain. When the data are imbalanced, say the number of negative samples significantly outweighs the number of positive samples, the area under the generalized Receiver Operating Characteristic curve should also be used. In addition, as we have proved the essential equivalence of threshold-dependent metrics, if in a link prediction task, some specific thresholds are meaningful, we can consider any one threshold-dependent metric with those thresholds. This work completes a missing part in the landscape of link prediction, and provides a starting point toward a well-accepted criterion or standard to select proper evaluation metrics for link prediction.
链路预测是网络科学中一个典型且具有挑战性的问题,其目的是基于已知拓扑结构预测缺失的链路、未来的链路以及时间链路。随着链路预测算法数量的不断增加,一个关键但此前被忽视的风险是,算法性能的评估指标通常是随意选择的。本文对数百个真实网络和26种知名算法进行了广泛的实验,揭示了评估指标之间存在显著的不一致性,即不同的指标可能会产生截然不同的算法排名。因此,我们得出结论,任何单一指标都无法全面或可靠地评估算法性能。在信息内容方面,我们建议至少使用两个指标:一个是接收器操作特征曲线下的面积,另一个是以下三个候选指标之一,即精确率-召回率曲线下的面积、精确率曲线下的面积以及归一化折损累计增益。当数据不平衡时,即负样本数量显著超过正样本数量时,还应使用广义接收器操作特征曲线下的面积。此外,由于我们已经证明了依赖阈值的指标本质上是等价的,如果在链路预测任务中某些特定阈值是有意义的,我们可以考虑任何一个带有这些阈值的依赖阈值的指标。这项工作填补了链路预测领域中缺失的一部分,并为选择链路预测的合适评估指标提供了一个被广泛接受的标准或准则的起点。