一种基于MapReduce的知识聚类与统计机制

徐小龙; 李永萍

doi:10.11999/JEIT150247

一种基于MapReduce的知识聚类与统计机制

doi: 10.11999/JEIT150247 cstr: 32379.14.JEIT150247

徐小龙¹,
李永萍¹

基金项目:

国家自然科学基金(61202004, 61472192)，教育部科技发展中心网络时代的科技论文快速共享专项研究(2013116)，江苏省高校自然科学研究计划(14KJB520014)

计量
- 文章访问数: 1621
- HTML全文浏览量: 134
- PDF下载量: 592
- 被引次数: 0
出版历程
- 收稿日期: 2015-02-12
- 修回日期: 2015-10-08
- 刊出日期: 2016-01-19

Knowledge Clustering and Statistics Based on MapReduce

XU Xiaolong¹,
LI Yongping¹

Funds:

The National Natural Science Foundation of China (61202004, 61472192), The Special Fund for Fast Sharing of Science Paper in Net Era by CSTD (2013116), The Natural Science Fund of Higher Education of Jiangsu Province (14KJB520014)

摘要

摘要: 网络文献知识库中的海量资源及其分类的粗粒度，导致学习者容易在文献检索和阅读过程出现认知迷航和知识过载问题。该文提出一种基于MapReduce的知识聚类与统计机制：首先，提出基于MapReduce的共现矩阵构建算法MR-CoMatrix；其次，将共现矩阵与相似度系数结合构建相似度矩阵；然后，通过Z Scores对相似度矩阵进行标准化；最后，使用离差平方和法(Ward,s method)对相似度矩阵进行聚类，生成树状的知识聚类谱系图；基于聚类结果，提出基于MapReduce的知识文献统计算法MR-Statistics，对每个分类的知识属性进行统计。实验结果表明：将MR-CoMatrix和MR-Statistics方法应用于网络文献知识库进行知识聚类和统计，达到较理想的聚类精度和计算效率，实现了细粒度知识聚类和多维统计，同时减少了时间开销。
- 数据挖掘 /
- 聚类 /
- 知识 /
- 共现矩阵 /
- 统计 /
- MapReduce
Abstract: The large scale and the coarse classification granularity of resources in literature knowledge bases lead to disorientation and overloading when learners retrieve and read papers. This paper proposes a mechanism of knowledge clustering and knowledge statistics based on MapReduce. Firstly, this paper presents a Co-occurrence Matrix building algorithm based on MapReduce (MR-CoMatrix). Secondly, it makes combination of the co-occurrence matrix and similarity coefficient to build the similarity matrix. Thirdly, the similarity matrix is standardized with Z scores. Finally, knowledge clusters are constructed with the Ward,s method. After knowledge clustering, this paper introduces a knowledge Statistics algorithm based on MapReduce (MR-Statistics) to dig the hidden information in each cluster. The experimental results show that the literature knowledge base with MR- CoMatrix and MR-Statistics can realize the accurate and fine clustering, multi-dimension statistics, computational efficiency, and less cost of time.
- Data mining /
- Cluster /
- Knowledge /
- Co-occurrence matrix /
- Statistics /
- MapReduce

HTML全文

参考文献(35)

SERET A, VERBRAKEN T, and BAESENS B. A new knowledge-based constrained clustering approach: theory and application in direct marking[J]. Applied Soft Computing, 2014, 24(3): 316-327.

朱林, 雷景生, 毕忠勤, 等. 一种基于数据流的软子空间聚类算法[J]. 软件学报, 2013, 24(11): 2610-2627.

ZHU Lin, LEI Jingsheng, BI Zhongqin, et al. Soft subspace clustering algorithm for streaming data[J]. Journal of Software, 2013, 24(11): 2610-2627.

ZHU Lin, CHUNG Fulai, and WANG Shitong. Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions[J]. IEEE Transactions on Systems, Man, and Cybernetics, 2009, 39(3): 578-591.

张敏, 于剑. 基于划分的模糊聚类算法[J]. 软件学报, 2004, 15(6): 858-866.

ZHANG Min and YU Jian. Fuzzy partitional clustering algorithms[J]. Journal of Software, 2004, 15(6): 858-866.

徐森, 周天, 于化龙, 等. 一种基于矩阵低秩近似的聚类集成算法[J]. 电子学报, 2013, 41(6): 1219-1223.

XU Sen, ZHOU Tian, YU Hualong, et al. Matrix low rank approximation-based cluster ensemble algorithm[J]. Acta Electronica Sinica, 2013, 41(6): 1219-1223.

徐森, 卢志茂, 顾国昌. 使用谱聚类算法解决文本聚类集成问题[J]. 通信学报, 2010, 31(6): 58-66.

XU Sen, LU Zhimao, and GU Guochang. Spectral clustering algorithm for document cluster ensemble problem[J]. Journal on Communications, 2010, 31(6): 58-66.

ZHU Wenxing, CHEN Jianli, and LI Weiguo. An augmented Lagrangian method for VLSI global placement[J]. The Journal of Supercomputing, 2014, 69(2): 714-738.

ZHOU F, TORRE F D L, and HODGINS J K. Hierarchical aligned cluster analysis for temporal clustering of human motion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(3): 582-596.

MASHSHI S, NIU G, MAKOTO Y, et al. Information- maximization clustering based on squared-loss mutual information[J]. Neural Computation, 2014. 26(1): 84-131.

YU Feili, CAO Liangliang, FERIS R S, et al. Designing Category-level attributes for discriminative visual recognition [C]. Preceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013: 771-776.

李建元, 周脚根, 关佶红. 谱图聚类算法研究进展[J]. 智能系统学报, 2011, 6(5): 405-414.

LI Jianyuan, ZHOU Jiaogen, and GUAN Jihong. A survey of clustering algorithms based on spectra of graphs[J]. CAAI Transactions on Intelligent Systems, 2011, 6(5): 405-414.

LU Zhimao and ZHANG Qi. Clustering by data competition [J]. Science China (Information Sciences), 2013, 56(1): 1-13.

CHENG Bo, WANG Minhong, A I, et al. Research on e-learning in the workplace 2000-2012: A bibliometric analysis of the literature[J]. Educational Research Review, 2013, 11: 56-72.

孔万增, 孙志海, 杨灿. 基于基本间隙与正交特征向量的自动谱聚类[J]. 电子学报, 2010, 38(8): 1880-1891.

KONG Wanzeng, SUN Zhihai, and YANG Can. Automatic spectral clustering based on eigengap and orthogonal eigenvector[J]. Acta Electronica Sinica, 2010, 38(8): 1880-1891.

CARPENTIER S, SOLE A D, and KAC V G. Rational matrix pseudodifferential operators[J]. Selecta Mathematica, 2014, 20(2): 403-419.

JUGL E, KUHWALD T, and IVERSEN K. Algorithm for construction of (0,1)-matrix codes[J]. Electronics Letters, 1997, 33(3): 226-229.

李建江, 崔健, 王聃, 等. MapReduce并行编程模型研究综述[J]. 电子学报, 2011, 39(11): 2635-2642.

LI Jianjiang, CUI Jian, WANG Dan, et al. Survey of MapReduce parallel programming model [J]. Acta Electronica Sinica, 2011, 39(11): 2635-2642.

FERRERA P, PRADO I D, PALACIOS E, et al. Tuple MapReduce and pangool: an associated implementation[J]. Knowledge and Information Systems, 2014, 41(2): 531-557.

陈吉荣, 乐嘉锦. SingleMapReduce：单一输出HDFS文件的MapReduce编程模型[J]. 华南理工大学学报, 2014, 42(5): 135-142.

CHEN Jirong and LE Jiajin. SingleMapReduce: a MapReduce programming model based on single output file of HDFS[J]. Journal of South China University of Technology, 2014, 42(5): 135-142.

王肇国, 易涵, 张为华. 基于机器学习特性的数据中心能耗优化算法[J]. 软件学报, 2014, 25(7): 1432-1447.

WANG Zhaoguo, YI Han, and ZHANG Weihua. Power saving based on characteristics of machine learning in data center[J]. Journal of Software, 2014, 25(7): 1432-1447.

易小华, 刘杰, 叶丹. 面向MapReduce数据处理流程开发方法[J]. 计算机科学与探索, 2011, 5(2): 161-168.

YI Xiaohua, LIU Jie, and YE Dan. Development method of MapReduce oriented data flow processing[J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(2): 161-168.

ROWBERRY J. Z Scores[M]. New York: Springer Science + Business Media, 2013: 3419-3420.

VARIN T and BUREAU R. Clustering files of chemical structures using the Szekely-Rizzo generalization of Wards method[J]. Journal of Molecular Graphics and Modelling, 2009, 28(2): 187-195.

LEE A. Minkowski generalizations of Wards method in hierarchical clustering[J]. Journal of Classification, 2014, 31(2): 194-218.

MURTAGH F and LEGENDRE P. Wards hierarchical agglomerative clustering method: which algorithms implement Wards criterion?[J]. Journal of Classification, 2014, 31(3): 274-295.

施引文献

资源附件(0)

访问统计