一种基于GA的混合属性特征大数据集聚类算法
A GA-Based Clustering Algorithm for Large Data Sets with Mixed Numerical and Categorical Values
-
摘要: 在数据挖掘中,经常会遇到和分析大量具有数值和类属特征的数据。然而,现有的大多数算法只能单独处理数值特征数据或类属特征数据,而不能分析具有混合属性的数据。为此,该文提出了一种基于GA的模糊聚类新算法,通过改进聚类目标函数将数值特征与类属特征相结合,从而实现具有混合属性特征数据的聚类分析;通过引入GA算法能够快速得到全局最优解,而且不依赖于原型初始化。实验结果表明,基于GA的新聚类算法对于处理具有混合特征的大数据集聚类问题是相当有效的。Abstract: In the field of data mining, it is often encountered to perform cluster analysis on large data sets with mixed numerical and categorical values. However, most existing clustering algorithms are only efficient for the numerical data rather than the mixed data set. For this purpose, this paper presents a novel clustering algorithm for these mixed data sets by modifying the common cost function, trace of the within cluster dispersion matrix. The Genetic Algorithm (GA) is used to optimize the new cost function to obtain valid clustering result. Experimental result illustrates that the GA-based new clustering algorithm is feasible for the large data sets with mixed numerical and categorical values.
-
Klosgen W,Zytkow J M.Knowledge Discovery in Databases Terminology.Advances in Knowledge Discovery and Data Mining,Fayyad U M,Piatetsky-Shapiro G,Smyth P,Uthurusamy R.(Eds.),AAAI Press/The MIT Press,MA,1996:573-592.[2]Cormack R M.A review of classification[J].J.Roy.Statist.Soc.Series A.1971,134:321-367[3]IBM.Data Management Solutions.IBM White Paper,IBM Corp.1996.[4]Anderberg M B.Cluster Analysis for Applications.New York:Academic Press.1973:79-90.[5]Kaufman L,Rousseeuw P J.Finding Groups in Data-An Introduction to Cluster Analysis.New York:John Wiley,1990:98-110.[6]Everitt B.Cluster Analysis.New York:Heinemann Educational Books Ltd.,1974:45-60.[7]Huang Zhexue,Michael K N.A fuzzy k-modes algorithm for clustering categorical data[J].IEEE Trans.on Fuzzy Systems.1999,7(4):446-452[8]Zhexue Huang.A fast clustering algorithm to cluster very large categorical data sets in data mining.Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery,Dept.of Computer Science,The University of British Columbia,Canada,1997:1-8.[9]Holland J H.Adoption in Natural and Artificial System.Ann Arbor,MI:Univ.Mich.Press,1975:83-90.[10]Krovi R.Genetic algorithm for clustering:A preliminary investigation.Proceedings of the 25th Hawaii International Conf.on System Sciences,4,Information Systems,Hawaii,1992:504-544.
计量
- 文章访问数: 2917
- HTML全文浏览量: 111
- PDF下载量: 973
- 被引次数: 0