用于数据挖掘的聚类算法

姜园; 张朝阳; 仇佩亮; 周东方

用于数据挖掘的聚类算法

姜园^{①②;张朝阳},
张朝阳,
仇佩亮,
周东方

计量
- 文章访问数: 5030
- HTML全文浏览量: 387
- PDF下载量: 5050
- 被引次数: 0
出版历程
- 收稿日期: 2003-12-22
- 修回日期: 2004-04-26
- 刊出日期: 2005-04-19

Clustering Algorithms Used in Data Mining

摘要

摘要: 数据挖掘用于从超大规模数据库中提取感兴趣的信息。聚类是数据挖掘的重要工具，根据数据间的相似性将数据库分成多个类，每类中数据应尽可能相似。从机器学习的观点来看，类相当于隐藏模式，寻找类是无监督学习过程。目前已有应用于统计、模式识别、机器学习等不同领域的几十种聚类算法。该文对数据挖掘中的聚类算法进行了归纳和分类，总结了7类算法并分析了其性能特点。
- 数据挖掘; 聚类; 分层聚类; 分割聚类; K-Means
Abstract: Data mining is used to draw interesting information from Very Large DataBases (VLDB). Clustering plays an outstanding role in data mining applications. Clustering is a division of databases into groups of similar objects based on the similarity. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning. There are tens of clustering algorithms used in various fields such as statistics, pattern recognition and machine learning now. This paper concludes the clustering algorithms used in data mining and assorts them into 7 classes. Seven types of algorithms are summarized and their performances are analyzed here.

HTML全文

参考文献(1)

Guha S, Rastogi R, Sim K. CURE: An efficient clustering algorithm for large databases. In Proc. of the ACM SIGMOD Conference, Seattle, WA, 1998:73 - 84.[2]Karypis G, Han E H, Kumar V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling.[J]. Computer.1999,32:68-[3]Boley D L. Principal direction divisive partitioning[J].Data Mining and Knowledge Discovery.1998, 2(4):325-[4]Fisher D. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 1987, 23(2): 139 - 172.[5]Mclachlan G, Krishnan T. The EM Algorithm and Extensions[J].New York, NY: John Wiley Sons.1997, http:-[6]Wallace C, Dowe D. Intrinsic classification by MML - the Snob program. In the Proc. of the 7th Australian Joint Conference on Artificial Intelligence, UNE, Armidale, Australia, World Scientific Publishing Co., 1994:37 - 44.[7]Cheeseman P, Stutz J. Bayesian classification (AutoClass): theory and results. Fayyad U M., Piatetsky-Shapiro G, Smyth P, and Uthurusamy R, (Eds.) Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, 1996:95 - 164.[8]Fraley C, Raftery A. MCLUST: Software for model-based cluster and discriminant analysis, Tech. Report 342, Dept. Statistics,Univ. of Washington, 1999.[9]高新波,裴继红,谢维信.基于统计检验指导的聚类分析方法.电子科学学刊,2000,22(1):6-12.[10]邢永康,马少平.一种基于Markov链模型的动态聚类方法.计算机研究与发展,2003,40(2):34-39.[11]杨岳湘,田艳芳,王韶红.基于模糊聚类和Naive Bayes方法的文本分类器,计算机工程与科学,2002,24(5):20-23.[12]Kaufman L, Rousseeuw P. Finding Groups in Data: An Introduction to Cluster Analysis. New York, John Wiley and Sons,NY, 1990: 145- 193.[13]Ng R, Hah J. Efficient and effective clustering methods for spatial data mining. In Proc. of the 20th Conference on VLDB, Santiago,Chile, 1994:144- 155.[14]Ian Davidson. Understanding K-Means No-hierarchical Clustering.Suny Albany-Technical Report 02-2, http:∥www.cs.alb any.edu/～davidson/courses/CSI635/UnderstandingK-MeansClustering.pdf.[15]Vance Faber. Clustering and the Continuous k-Means Algorithm.Los Alamos Science Number 22 1994, http:∥www.c3. lanl.gov/～kelly/ml/pubs/1994_concept/sidebar.pdf.[16]Bradley P S, Fayyad U M. Refining initial points for k-means clustering. In Proc. of the 15th ICML, Madison, WI, 1998:91-99.[17]Aristidis Likas, Nokos Vlassis, Jakob Verbeek. The global k-means clustering algorithm, http:∥iris. usc.edu/ Vision-Notes/bibliography/pattern623.html, 2003:451 - 461.[18]Babu G P, Murty M N. A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm[J].Pattern Recogn.Lell.1993, 14(10):763-[19]Brown D, Huntley C. A practical application of simulated annealing to clustering. Technical Report IPC-TR-91-003,University of Virginia, 1991.[20]Zhang B. Generalized k-harmonic means-dynamic weighting of data in unsupervised learning. In Proc. of the 1st SIAM International Conference on Data Mining, Chicago, IL, 2001:1- 13.[21]Pelleg D, Moore A. X-means: Extending K-means with efficient estimation of the number of clusters. In Proc. 17th ICML, Stanford University, 2000:89 - 97.[22]刘健庄,谢维信,等.聚类分析的遗传算法[J].电子学报,1995,23(11):81-83.[23]李碧,雍正正.一种改进的基于遗传算法的聚类分析方法.电路与系统学报,2002,7(3):96-99.[24]刘静,钟伟才,刘芳,焦李成.免疫进化聚类算法.电子学报,2001,29(12A):1868-1872.[25]高新波,裴继红,谢维信.模糊c均值聚类算法中加权指数m的研究.电子学报,2000,28(4):1-4.[26]张志华,郑南宁,史罡.极大熵聚类算法及其全局收敛性分析.中国科学(E辑),2001,31(1):59-70.[27]沈越泓,益晓新,徐发强,李兴国.模糊聚类和模糊模式识别技术在通信设备抗干扰性能评估系统中的应用.电子科学学刊,2000, 22(2): 210 - 217.[28]Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd ACM SIGKDD, Portland, 1996:226 - 231.[29]Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications[J].Data Mining and Knowledge Discovery.1998, 2(2):169-[30]Ankerst M, Breunig M, Kriegel H P, Sander J. OPTICS: Ordering points to identify clustering structure. In Proc. of the ACM SIGMOD Conference, Philadelphia, PA, 1999:49 - 60.[31]Xu X, Ester M, Kiegel H P, Sander J. A distribution-based clustering algorithm for mining in large spatial databases. In Proc.of the 14th ICDE, Orlando, FL, 1998:324 - 331.Hinneburg A, Keim D. An efficient approach to clustering large multimedia databases with noise. In Proc. of the 4th ACM SIGKDD, New York, NY, 1998:58 - 65.Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of the ACM SIGMOD Conference, Seattle,WA, 1998:94 - 105.[32]Wang W, Yang J, Muntz R. STING: a statistical information grid approach to spatialdata mining. In Proc. of the 23rd Conference on VLDB, Athens, Greece, 1997:186 - 195.[33]Wang W, Yang J, Muntz R. STING+: An approach to active spatial data mining. In Proc. 15th ICDE, Sydney, Australia, 1999:116 - 125.[34]Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proc. of the 24th Conference on VLDB, New York,NY, 1998:428 - 439.[35]Barbara D, Chen P. Using the fractal dimension to cluster datasets.In Proc. of the 6th ACM SIGKDD, Boston, MA, 2000:260 - 264.[36]Guha S, Rastogi R, Shim K. ROCK: A robust clustering algorithm for categorical attributes. In Proc. of the 15th ICDE,Sydney, Australia, 1999:512 - 521.[37]Ertoz L, Steinbach M, Kumar V. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data,Department of Computer Science, University of Minnesota,Minneapolis, MN, USA Technical Report, 2002, www-users.cs.umn.edu/～kumar/papers/kdd02 snn 28.pdf.[38]Ganti V, Gehrke J, Ramakrishnan R. CACTUS-clustering categorical data using summaries. In Proc. of the 5th ACM SIGKDD, San Diego, CA, 1999:73 - 83.[39]Gibson D, Kleinberg J, Raghavan P. Clustering categorical data:An approach based on dynamic systems. In Proc. of the 24thInternational Conference on Very Large Databases, New York,NY, 1998:311 - 323.[40]Cheng C, Fu A, Zhang Y. Entropy-based subspace clustering for mining numerical data. In Proc. of the 5th ACM SIGKDD, San Diego, CA, 1999:84 - 93.Hinneburg A, Keim D. Optimal grid-clustering: Towards breading the curse of dimensionality in high-dimensional clustering. In Proc. of the 25th Coference on VLDB, Edinburgh,Scotland, 1999:506 - 517.Aggarwal C C, Procopiuc C, Wolf J L, Yu P S, Park J S. Fast algorithms for projected clustering. In Proc. of the ACM SIGMOD Conference Philadelphia, PA, 1999:61 - 72,.[41]Aggarwal C C, Yu P S. Finding generalized projected clusters in high dimension spaces. In Proc. ACM SIGMOD Int. Conf. 2000,http:∥citeseer. ist.psu.edu/aggarwal00finding.html.[42]Kohonen T, The self-organizing map. Proc[J].IEEE.1990, 78(9):1464-[43]钱云涛,谢维信.一种由模糊逻辑神经元网络实现的聚类分析方法.西安电子科技大学学报,1995,22(1):1-7.[44]钱云涛,谢维信.聚类神经网络的通用设计方法.西安电子科技大学学报,1997,24(1):15-21.[45]黄敏超,张育林,陈启智.模糊超球神经网络在模式聚类中的应用.自动化学报,1997,23(2):279-282.[46]魏立梅,谢维信.聚类分析中竞争学习的一种新算法.电子科学学刊,2000,22(1):13-18.[47]黄凤岗,宋克欧.一种集成模糊聚类神经网络.哈尔滨工程大学学报,1997,18(3):82-85.[48]宋爱国,陆佶人.基于进化规划的Kohonen网络用于被动声纳目标聚类研究.电子学报,1998,26(7):128-132[49]张艳宁,赵荣椿,梁怡.一种有效的大规模数据的分类方法.电子学报,2002,30(10):1533-1535.[50]杨志荣,李磊.用SOM聚类实现多级高维点数据索引.计算机研究与发展,2003,40(1):100-106.[51]王莉,王正欧.TGSOM:一种用于数据聚类的动态自组织映射神经网络[J].电子与信息学报.2003,25(3):313-319浏览

施引文献

资源附件(0)

访问统计