Robust Fuzzy C-Means Based on Adaptive Relaxation
-
摘要:
噪声是影响聚类结果的最重要的因素之一,现有的模糊聚类算法主要通过对隶属度约束进行松弛的方式来降低噪声样本的影响。这种方式仍然存在两个基本问题需要解决:第一,如何评估一个样本是噪声的可能性;第二,如何在抑制噪声样本影响力的同时,保留正常样本的作用力。针对这两问题,该文提出了基于自适应松弛的鲁棒模糊C均值聚类算法(AR-RFCM)。新模型基于K最近邻的方式(KNN)来估计样本的可靠性,自适应地调整松弛参数,从而实现在降低噪声样本影响力的同时,保留可靠样本的作用力。此外,AR-RFCM利用了C均值聚类模型中隶属度的稀疏性来提高可靠样本的作用力,从而提高数据簇的内聚程度,进而降低噪声样本的影响。实验表明,AR-RFCM不仅在处理噪声样本时具有良好的鲁棒性,同时在25个UCI 数据集实验中,分类正确率(兰德指数)平均高于FCM算法7.7864%。
Abstract:Noise is one of the most important influences for clustering. Existing fuzzy clustering methods try to reduce the impact of noise by relaxing the constraint condition of membership. But there are still two basic problems to be solved. The first is how to evaluate the probability that a sample point is a noise. The second is how to retain the effect of normal points while suppressing the impact of noise. To solve these two problems, Robust Fuzzy C-Means based on Adaptive Relaxation (AR-RFCM) is proposed. The new model estimates the reliability of sample points by the method of the K-Nearest Neighbor (KNN). It adjusts adaptively the relaxation parameters to reduce the impact of noise, and keeps the effect of reliable sample points at the same time. In addition, AR-RFCM utilizes the sparsity of membership in K-means to improve the effect of reliable sample points. Therefore, the compactness of clusters is improved and the impact of noise is suppressed. Experiments demonstrate that AR-RFCM has a good robustness for noise, and also achieves higher rand index in all 25 UCI data sets, even averagely higher than FCM 7.7864%.
-
Key words:
- Noise /
- Clustering /
- Fuzzy C-Means (FCM) /
- Adaptive /
- Relaxation
-
表 1 各算法求得的隶属度值
样本编号 FCM PCM NC REFCM FDCM_SSR AR-RFCM C#1 C#2 C#1 C#2 C#1 C#2 C#1 C#2 C#1 C#2 C#1 C#2 1 0.054 0.946 0.015 0.539 0.004 0.332 0 0 0.095 0.905 0 0.984 2 0.008 0.992 0.018 0.546 0.005 0.332 0 0 0.060 0.940 0 0.985 3 0.040 0.960 0.019 0.999 0 1.000 0 0.992 0.090 0.910 0 1.000 4 0.093 0.907 0.018 0.545 0.005 0.332 0 0 0.137 0.863 0 0.985 5 0.055 0.945 0.024 0.552 0.007 0.331 0 0 0.114 0.886 0 0.985 6 0.500 0.500 0.003 0.003 0.001 0.001 0 0 0.500 0.500 0 0 7 0.500 0.500 0.010 0.010 0.004 0.004 0 0 0.500 0.500 0 0 8 0.945 0.055 0.552 0.024 0.331 0.007 0 0 0.886 0.114 0.985 0 9 0.992 0.008 0.546 0.018 0.332 0.005 0 0 0.941 0.059 0.985 0 10 0.960 0.040 0.999 0.019 1.000 0 0.992 0 0.910 0.090 1.000 0 11 0.907 0.093 0.545 0.018 0.332 0.005 0 0 0.863 0.137 0.985 0 12 0.946 0.054 0.539 0.015 0.332 0.004 0 0 0.905 0.095 0.984 0 表 2 各算法所得的聚类簇中心
簇中心1 簇中心2 FCM $\left[ \begin{aligned} & { {\rm{3} }{\rm{.9870} } } \\ & { {\rm{0} }{\rm{.0011} } } \end{aligned} \right]$ $\left[ \begin{aligned}& { {\rm{ - 3} }{\rm{.9870} } } \\ &\ \ \, { {\rm{0} }{\rm{.0011} } } \end{aligned} \right]$ PCM $\left[ \begin{aligned}& { {\rm{3} }{\rm{.9870} } } \\ & { {\rm{0} }{\rm{.0011} } } \end{aligned} \right]$ $\left[ \begin{aligned}& { {\rm{ - 3} }{\rm{.9870} } } \\ & \ \ \, { {\rm{0} }{\rm{.0011} } } \end{aligned} \right]$ NC $\left[ \begin{aligned}& { {\rm{3} }{\rm{.9996} } } \\ & { {\rm{0} }{\rm{.0002} } } \end{aligned} \right]$ $\left[ \begin{aligned}& { {\rm{ - 3} }{\rm{.9996} } } \\ & \ \ \, { {\rm{0} }{\rm{.0002} } } \end{aligned} \right]$ REFCM $\left[ \begin{aligned} & { {\rm{4} }{\rm{.000} }0} \\ & { {\rm{0} }{\rm{.0000} } } \end{aligned} \right]$ $\left[ \begin{aligned}& { {\rm{ - 4} }{\rm{.000} }0} \\ &\ \ \, { {\rm{0} }{\rm{.0000} } } \end{aligned} \right]$ FDCM_SSR $\left[ \begin{aligned}& { {\rm{3} }{\rm{.4833} } } \\ & { {\rm{1} }{\rm{.6530} } } \end{aligned} \right]$ $\left[ \begin{aligned}& { {\rm{ - 3} }{\rm{.4833} } } \\ &\ \ \, { {\rm{1} }{\rm{.6530} } } \end{aligned} \right]$ AR-RFCM $\left[ \begin{aligned}& { {\rm{3} }{\rm{.9999} } } \\ & { {\rm{0} }{\rm{.0000} } } \end{aligned} \right]$ $\left[ \begin{aligned}& { {\rm{ - 3} }{\rm{.9999} } } \\ &\ \ \, { {\rm{0} }{\rm{.0000} } } \end{aligned} \right]$ 表 3 噪声数据集信息
均值 协方差 簇1 [1 2] $\left[ {\begin{array}{*{20}{c}}2&{0.2}\\{0.2}&2\end{array}} \right]$ 簇2 [–1 –2] $\left[ {\begin{array}{*{20}{c}}2&0\\0&1\end{array}} \right]$ 簇3 [–3 –5] $\left[ {\begin{array}{*{20}{c}}3&0\\0&1\end{array}} \right]$ 噪声 [–3 5] $\left[ {\begin{array}{*{20}{c}}{10}&0\\0&2\end{array}} \right]$ 表 4 噪声数据集实验结果
FCM PCM NC REFCM FDCM_SSR AR-RFCM 准确率 0.8683 0.8683 0.8767 0.8583 0.8667 0.8833 精确度 0.6775 0.6775 0.6900 0.6625 0.6750 0.7000 灵敏度 0.9033 0.9033 0.9200 0.8833 0.9000 0.9333 特异度 0.8567 0.8567 0.8622 0.8500 0.8556 0.8667 表 5 不同聚类算法在UCI数据集的聚类结果的RI(%)
FCM PCM NC REFCM FDCM_SSR AR-RFCM Ecoli 79.66±0.47 78.60±3.28 78.25±1.11 80.12±1.13 79.83±0.45 88.40±1.62 Auto-mpg 76.23±0 75.21±4.02 77.27±0.46 75.64±0 85.62±0.30 77.64±2.09 Dermatology 83.49±1.18 81.17±3.26 69.82±6.61 81.33±5.49 84.88±1.59 92.55±0 Iris 83.68±0 79.33±7.90 82.27±4.23 85.68±0 87.37±0 85.68±0 Zoo 87.91±1.76 87.56±6.15 88.90±5.82 85.76±5.22 89.78±2.33 96.65±0 Transfusion 53.10±0 55.47±5.59 56.30±3.75 53.16±0 63.90±0.09 63.68±0 Parkinsons 52.19±0 58.86±5.86 64.40±5.31 60.27±7.74 63.12±0.57 73.83±0 Banknote 51.52±0 59.55±9.75 61.32±6.89 59.68±12.12 52.44±1.73 66.32±2.79 Creadit-approval 67.57±0.37 55.78±5.82 64.89±0 53.96±6.53 67.51±0 68.06±0 Breast-cancer 90.53±0 78.42±14.25 86.24±16.97 91.59±0 94.31±0 94.86±0 Wine 95.43±0 73.35±4.57 91.85±6.62 90.36±0 95.43±0 96.40±0.73 Automobile 71.83±0.44 70.81±1.78 72.08±1.62 71.79±0.38 71.99±0.29 74.24±0.29 Messidor Features 50.49±0 50.62±0.35 50.63±0.01 50.86±0 50.65±0.53 51.80±0.61 Fertility 50.00±0 65.74±12.93 50.08±0.18 57.99±6.55 79.13±0.71 78.85±1.75 Seeds 89.92±0 78.70±5.28 88.53±0.52 87.06±0 89.92±0 91.26±0.52 Blance 58.82±4.58 58.49±5.70 62.66±4.69 60.55±5.09 60.56±4.07 65.97±3.54 House Votes 77.52±0 71.46±8.61 75.49±6.07 72.41±6.76 77.86±0 78.90±0 Vowel 67.15±2.47 82.68±1.61 53.52±4.44 82.91±0.94 66.81±1.81 85.47±0.20 Glass 71.31±0.45 69.29±1.19 71.58±0.98 71.07±0.86 71.59±0.39 72.45±0.77 Mammographic 67.96±0 64.00±7.05 67.98±0.04 62.35±6.02 68.55±0 68.11±0 Pima Indians Diabetes 59.07±0 56.23±3.27 52.41±0.13 58.09±0 59.06±0.03 59.18±0.29 Qualitative Bankruptcy 94.53±0 74.58±16.72 97.62±0 94.53±0 94.53±0 97.62±0 Seismic Bumps 51.88±0 71.15±12.53 56.19±10.29 87.70±0 87.69±0.10 87.70±0 Phishing Data 68.72±0.29 60.73±5.18 69.13±0.28 62.71±0 68.93±0.39 69.67±0.09 Yeast 65.83±1.61 72.95±1.21 63.44±2.90 72.95±2.06 66.58±1.58 75.71±0.03 -
JAIN A K. Data clustering: 50 years beyond k-means[J]. Pattern Recognition Letters, 2010, 31(8): 651–666. doi: 10.1016/j.patrec.2009.09.011 DENG Zhaohong, JIANG Yizhang, CHUNG Fulai, et al. Transfer prototype-based fuzzy clustering[J]. IEEE Transactions on Fuzzy Systems, 2016, 24(5): 1210–1232. doi: 10.1109/TFUZZ.2015.2505330 张洁玉, 李佐勇. 基于核空间的加权邻域约束直觉模糊聚类算法[J]. 电子与信息学报, 2017, 39(9): 2162–2168. doi: 10.11999/JEIT161317ZHANG Jieyu and LI Zuoyong. Kernel-based algorithm with weighted spatial information intuitionistic fuzzy c-means[J]. Journal of Electronics &Information Technology, 2017, 39(9): 2162–2168. doi: 10.11999/JEIT161317 LV Yinghua, MA Tinghuai, TANG Meili, et al. An efficient and scalable density-based clustering algorithm for datasets with complex structures[J]. Neurocomputing, 2016, 171: 9–22. doi: 10.1016/j.neucom.2015.05.109 QIN Xiaoyu, TING Kaiming, ZHU Ye, et al. Nearest-neighbour-induced isolation similarity and its impact on density-based clustering[C]. The 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 4755–4762. doi: 10.1609/aaai.v33i01.33014755. 赵小强, 刘晓丽. 基于公理化模糊子集的改进谱聚类算法[J]. 电子与信息学报, 2018, 40(8): 1904–1910. doi: 10.11999/IEIT170904ZHAO Xiaoqiang and LIU Xiaoli. An improved spectral clustering algorithm based on axiomatic fuzzy set[J]. Journal of Electronics &Information Technology, 2018, 40(8): 1904–1910. doi: 10.11999/IEIT170904 YIN Hao, BENSON A R, LESKOVEC J, et al. Local higher-order graph clustering[C]. The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, 2017: 555–564. doi: 10.1145/3097983.3098069. MOSLEHI Z, TAHERI M, MIRZAEI A, et al. Discriminative fuzzy c-means as a large margin unsupervised metric learning algorithm[J]. IEEE Transactions on Fuzzy Systems, 2018, 26(6): 3534–3544. doi: 10.1109/TFUZZ.2018.2836338 刘解放, 王士同, 王骏, 等. 一种具有最优保证特性的贝叶斯可能性聚类方法[J]. 电子与信息学报, 2017, 39(7): 1554–1562. doi: 10.11999/JEIT160908LIU Jiefang, WANG Shitong, WANG Jun, et al. Bayesian possibilistic clustering method with optimality guarantees[J]. Journal of Electronics &Information Technology, 2017, 39(7): 1554–1562. doi: 10.11999/JEIT160908 MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]. The 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, USA, 1965: 281–297. BEZDEK J C, EHRLICH R, and FULL W. FCM: The fuzzy c-means clustering algorithm[J]. Computers & Geosciences, 1984, 10(2/3): 191–203. DE OLIVEIRA J V and PEDRYCZ W. Advances in Fuzzy Clustering and its Applications[M]. Chichester: John Wiley & Sons, Ltd., 2007. doi: 10.1002/9780470061190. DAVE R N. Characterization and detection of noise in clustering[J]. Pattern Recognition, 1991, 12(11): 657–664. doi: 10.1016/0167-8655(91)90002-4 KRISHNAPURAM R and KELLER J M. A possibilistic approach to clustering[J]. IEEE Transactions on Fuzzy Systems, 1993, 1(2): 98–110. doi: 10.1109/91.227387 ZARINBAL M, ZARANDI M H F, and TURKSEN I B. Relative entropy fuzzy c-means clustering[J]. Information Sciences, 2014, 260: 74–97. doi: 10.1016/j.ins.2013.11.004 GU Jing, JIAO Licheng, YANG Shuyuan, et al. Fuzzy double c-means clustering based on sparse self-representation[J]. IEEE Transactions on Fuzzy Systems, 2018, 26(2): 612–626. doi: 10.1109/TFUZZ.2017.2686804