Yan Xiaoran, Shalizi Cosma, Jensen Jacob E, Krzakala Florent, Moore Cristopher, Zdeborová Lenka, Zhang Pan, Zhu Yaojia
Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA.
Statistics Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
J Stat Mech. 2014 May;2014(5). doi: 10.1088/1742-5468/2014/05/P05007.
The proliferation of models for networks raises challenging problems of model selection: the data are sparse and globally dependent, and models are typically high-dimensional and have large numbers of latent variables. Together, these issues mean that the usual model-selection criteria do not work properly for networks. We illustrate these challenges, and show one way to resolve them, by considering the key network-analysis problem of dividing a graph into communities or blocks of nodes with homogeneous patterns of links to the rest of the network. The standard tool for undertaking this is the stochastic block model, under which the probability of a link between two nodes is a function solely of the blocks to which they belong. This imposes a homogeneous degree distribution within each block; this can be unrealistic, so degree-corrected block models add a parameter for each node, modulating its overall degree. The choice between ordinary and degree-corrected block models matters because they make very different inferences about communities. We present the first principled and tractable approach to model selection between standard and degree-corrected block models, based on new large-graph asymptotics for the distribution of log-likelihood ratios under the stochastic block model, finding substantial departures from classical results for sparse graphs. We also develop linear-time approximations for log-likelihoods under both the stochastic block model and the degree-corrected model, using belief propagation. Applications to simulated and real networks show excellent agreement with our approximations. Our results thus both solve the practical problem of deciding on degree correction and point to a general approach to model selection in network analysis.
数据稀疏且全局相关,模型通常是高维的且有大量潜在变量。这些问题共同意味着常用的模型选择标准在网络中不能正常工作。我们通过考虑将图划分为具有与网络其余部分链接的同质模式的社区或节点块这一关键网络分析问题,来说明这些挑战,并展示解决这些挑战的一种方法。进行此操作的标准工具是随机块模型,在该模型下,两个节点之间链接的概率仅是它们所属块的函数。这在每个块内强加了一个同质的度分布;这可能不现实,因此度校正块模型为每个节点添加一个参数,以调节其总体度。普通块模型和度校正块模型之间的选择很重要,因为它们对社区的推断非常不同。我们基于随机块模型下对数似然比分布的新的大图渐近性,提出了一种在标准块模型和度校正块模型之间进行模型选择的第一个有原则且易于处理的方法,发现与稀疏图的经典结果有很大偏差。我们还使用信念传播为随机块模型和度校正模型下的对数似然开发了线性时间近似。对模拟网络和真实网络的应用表明与我们的近似结果非常吻合。因此,我们的结果既解决了决定度校正的实际问题,又指出了网络分析中模型选择的一般方法。