基于规范粒子群优化算法的K均值聚类方法用于实际数据集

Canonical PSO Based K-Means Clustering Approach for Real Datasets.

作者信息

Dey Lopamudra, Chakraborty Sanjay

机构信息

Heritage Institute of Technology, Kolkata, West Bengal 700 107, India.

Institute of Engineering & Management, Kolkata, West Bengal 700 091, India.

出版信息

Int Sch Res Notices. 2014 Nov 12;2014:414013. doi: 10.1155/2014/414013. eCollection 2014.

DOI:10.1155/2014/414013

PMID:27355083

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4897525/

Abstract

"Clustering" the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

摘要

“聚类”这项技术的意义和应用遍布各个领域。聚类是数据挖掘中的一个无监督过程，这就是为什么对结果进行恰当评估以及衡量聚类的紧密性和可分离性是重要问题。评估聚类算法结果的过程被称为聚类有效性度量。不同类型的指标用于解决不同类型的问题，指标的选择取决于可用数据的类型。本文首先提出基于规范粒子群优化算法的K均值聚类算法，还分析了一些重要的聚类指标（类间、类内），然后使用典型的K均值算法、基于规范粒子群优化算法的K均值算法、基于简单粒子群优化算法的K均值算法、密度聚类算法（DBSCAN）和层次聚类算法，评估这些指标对实时空气污染数据库、批发客户、葡萄酒和车辆数据集的影响。本文还描述了聚类的性质，最后根据有效性评估比较这些聚类算法的性能。它还确定在所有这些算法中哪种算法在这个特定的现实生活数据集上生成恰当紧密聚类时更可取。它实际上研究了这些聚类算法相对于验证指标的行为，并以数学和图形形式展示它们的评估结果。