基于快速地标采样的大规模谱聚类算法

叶茂; 刘文芬

doi:10.11999/JEIT160260

基于快速地标采样的大规模谱聚类算法

doi: 10.11999/JEIT160260

叶茂^1, ,,
刘文芬¹

基金项目:

国家973计划(2012CB315905)，国家自然科学基金(61502527, 61379150)

计量
- 文章访问数: 1844
- HTML全文浏览量: 207
- PDF下载量: 548
- 被引次数: 0
出版历程
- 收稿日期: 2016-03-21
- 修回日期: 2016-07-18
- 刊出日期: 2017-02-19

Large Scale Spectral Clustering Based on Fast Landmark Sampling

YE Mao^{1
, ,},
LIU Wenfen¹

Funds:

The National 973 Program of China (2012CB315905), The National Natural Science Foundation of China (61502527, 61379150)

摘要

摘要: 为避免传统谱聚类算法高复杂度的应用局限，基于地标表示的谱聚类算法利用地标点与数据集各点间的相似度矩阵，有效降低了谱嵌入的计算复杂度。在大数据集情况下，现有的随机抽取地标点的方法会影响聚类结果的稳定性，k均值中心点方法面临收敛时间未知、反复读取数据的问题。该文将近似奇异值分解应用于基于地标点的谱聚类，设计了一种快速地标点采样算法。该算法利用由近似奇异向量矩阵行向量的长度计算的抽样概率来进行抽样，同随机抽样策略相比，保证了聚类结果的稳定性和精度，同k均值中心点策略相比降低了算法复杂度。同时从理论上分析了抽样结果对原始数据的信息保持性，并对算法的性能进行了实验验证。
- 地标点采样 /
- 大数据 /
- 谱聚类 /
- 近似奇异值分解
Abstract: The applicability of traditional spectral clustering is limited by its high complexity in large-scale data sets. Through construction of affinity matrix between landmark points and data points, the Landmark-based Spectral Clustering (LSC) algorithm can significantly reduce the computational complexity of spectral embedding. It is vital for clustering results to apply the suitable strategies of the generation of landmark points. While considering big data problems, the existing generation strategies of landmark points face some deficiencies: the unstable results of random sampling, along with the unknown convergence time and the repeatability of data reading in k-means centers method. In this paper, a rapid landmark-sampling spectral clustering algorithm based on the approximate singular value decomposition is designed, which makes the sampling probability of each landmark point decided by the row norm of the approximate singular vector matrix. Compared with LSC algorithm based on random sampling, the clustering result of new algorithm is more stable and accurate; compared with LSC algorithm based on k-means centers, the new algorithm reduces the computational complexity. Moreover, the preservation of information in original data is analyzed for the landmark-sampling results theoretically. At the same time, the performance of new approach is verified by the experiments in some public data sets.
- Landmark sampling /
- Big data /
- Spectral clustering /
- Approximate singular value decomposition

HTML全文

参考文献(24)

何清, 李宁, 罗文娟, 等. 大数据下的机器学习算法综述[J]. 模式识别与人工智能, 2014, 27(4): 327-336.

HE Qing, LI Ning, LUO Wenjuan, et al. A survey of machine learning algorithms for big data[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(4): 327-336.

DING S, JIA H, ZHANG L, et al. Research of semi-supervised spectral clustering algorithm based on pairwise constraints[J]. Neural Computing and Applications, 2014, 24(1): 211-219. doi: 10.1007/s00521-012-1207-8.

NG A Y, JORDAN M I, and WEISS Y. On spectral clustering: Analysis and an algorithm[C]. Neural Information Processing Systems: Natural and Synthetic, Vancouver, Canada, 2001: 849-856.

FOWLKES C, BELONGIE S, CHUNG F, et al. Spectral grouping using the Nystrom method[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(2): 214-225. doi: 10.1109/TPAMI.2004.1262185.

LI M, KWOK J T, and LU B L. Making large-scale Nystrm approximation possible[C]. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 631-638.

LI M, BI W, KWORK J T, et al. Large-scale Nystrm kernel matrix approximation using randomized SVD[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(1): 152-164. doi: 10.1109/TNNLS.2014.2359798.

丁世飞, 贾洪杰, 史忠植. 基于自适应Nystrm 采样的大数据谱聚类算法[J]. 软件学报, 2014, 25(9): 2037-2049. doi: 10.13328/j.cnki.jos.004643.

DING Shifei, JIA Hongjie, and SHI Zhongzhi. Spectral clustering algorithm based on adaptive Nystrm sampling for big data analysis[J]. Journal of Software, 2014, 25(9): 2037-2049. doi: 10.13328/j.cnki.jos.004643.

YAN D, HUANG L, and JORDAN M I. Fast approximate spectral clustering[C]. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 2009: 907-916. doi: 10.1145/1557019.1557118.

CHEN X and CAI D. Large scale spectral clustering with landmark-based representation[C]. Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, California, USA, 2011: 313-318.

CAI D and CHEN X. Large scale spectral clustering via landmark-based sparse representation[J]. IEEE Transactions on Cybernetics, 2015, 45(8): 1669-1680. doi: 10.1109/TCYB. 2014.2358564.

BOUTSIDIS C, ZOUZIAS A, MAHONEY M W, et al. Randomized dimensionality reduction for-means clustering[J]. IEEE Transactions on Information Theory, 2015, 61(2): 1045-1062. doi: 10.1109/TIT.2014.2375327.

COHEN M, ELDER S, MUSCO C, et al. Dimensionality reduction for k-means clustering and low rank approximation[C]. Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, Portland, OR, USA, 2015: 163-172. doi: 10.1145/2746539.2746569.

KHOA N L D and CHAWLA S. A scalable approach to spectral clustering with SDD solvers[J]. Journal of Intelligent Information Systems, 2015, 44(2): 289-308. doi: 10.1007/ s10844-013-0285-0.

FRIEZE A, KANNAN R, and VEMPALA S. Fast Monte-Carlo algorithms for finding low-rank approximations[C]. Proceedings of the 39th Annual Symposium on Foundations of Computer Science, Palo Alto, California, USA, 1998: 370-378. doi: 10.1109/SFCS. 1998.743487.

DRINEAS P, MAHONEY M W, and MUTHUKRISHNAN S. Sampling algorithms for l2 regression and applications[C]. Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, Miami, Florida, USA, 2006: 1127-1136.

DRINES P, MAHONEY M W, and MUTHUKRISHNAN S. Subspace sampling and relative-error matrix approximation: Column-based methods[C]. 9th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems and 10th International Workshop on Randomization and Computation, Barcelona, Spain, 2006: 316-326. doi: 10.1007/11830924_30.

BOUTSIDIS C, DRINEAS P, and MAGDON-ISMAIL M. Near-optimal column-based matrix reconstruction [J]. SIAM Journal on Computing, 2014, 43(2): 687-717. doi: 10.1137/12086755X.

HALKO N, MARTINSSON P G, and TROPP J A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions[J]. SIAM Review, 2011, 53(2): 217-288. doi: 10.1137/090771806.

SARLOIS T. Improved approximation algorithms for large matrices via random projections[C]. Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, Berkeley, California, USA, 2006: 143-152. doi: 10.1109/FOCS.2006.37.

CHEN W Y, SONG Y, BAI H, et al. Parallel spectral clustering in distributed systems[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3): 568-586. doi: 10.1109/TPAMI.2010.88.

AFAHAD A, ALSHATRI N, TARI Z, et al. A survey of clustering algorithms for big data: Taxonomy and empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014, 2(3): 267-279. doi: 10.1109/TETC. 2014.2330519.

STREHL A and GHOSH J. Cluster ensemblesA knowledge reuse framework for combining multiple partitions[J]. The Journal of Machine Learning Research, 2003, 3: 583-617. doi: 10.1162/153244303321897735.

施引文献

资源附件(0)

访问统计