基于Rough集约简算法的中文文本自动分类系统
Automatic Classification of Chinese Documents Based on Rough Set and Improved Quick-Reduce Algorithm
-
摘要: 现有的文本自动分类离不开文档向量的构造,向量的分量与文档中的特征项相对应。这种向量通常高达几千维甚至数万维,计算量相当大,因此需要对向量进行约简。而传统的基于频率的阈值过滤法往往会导致有效信息的丢失,影响分类的准确度。该文将Rough集理论引入自动分类,并提出了一种新的文档向量约简算法。实验证明该算法不仅能有效缩减文档向量的规模,而且相比传统的阈值法信息损失小、准确率更高。Abstract: Much of the previous automatic Text Classification (TC) methods are closely connected with the construction of document vectors. With each term corresponding to a unit in the vector, this method maps the document vectors into a very high dimensional space, possibly of tens of thousands of dimension, which results in a massive amount of calculation. Since the traditional algorithms based on frequency and threshold filtering may often lead to the loss of effective information, this paper presents a new system for TC, which introduces rough set theory that can greatly reduce the document vector dimensions by reduction algorithm. The empirical results prove to be very successful, for it can not only effectively reduce the dimensional space, but also reach higher accuracy while losing less information compared with usual reduction methods.
-
Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J].Communications of the ACM.1975, 18(11):613-[2]Sebastiani F. Machine learning in automated text categorization[J].ACM Computing Surveys.2002, 34(1):1-47[3]Riloff E, Lehnert W. Information extraction as a basis for high-precision text classification[J].ACM Trans on Information Systems.1994, 12(3):296-[4]Zdzislaw Pawlak. Rough sets[J].International Journal of Computer and Information Sciences.1982, 11(5):341-[5]Zdzislaw Pawlak. Rough sets: Theoretical Aspects of Reasoning about Data. Dordrecht: Kluwer Academic Publishers, 1991:15 - 16, 69 - 80.[6]Chouchoulas A, Shen Q. A rough set-based approach to text classification. In Proceedings of the 7th International Workshop on Rough Sets, Yamaguchi, Japan, November 1999:118 - 127.[7]李滔等.一种基于粗糙集的网页分类方法.小型微型计算机系统,2003,24(3):520-523.[8]Maudal O. Preprocessing Data for Neural Network based Classifiers: Rough Sets vs. Principal Component Analysis.Project Report, Department of Artificial Intelligence, University of Edinburgh, 1996.[9]王国胤.Rough集理论与知识获取.西安:西安交通大学出版社,2001:133-146.[10]Wong S K M, Ziarko W. On optimal decision rules in decision tables. Bulletin, Polish Academy of Sciences, 1985, 33(11/12):693-696.[11]Skowron A, Rauszer C. The discernibility matrices and functions in information system. In Intelligent Decision Support Handbook of Applications and Advances of the Rough Sets Theory. Dordrecht: Kluwer Academic Publishers, 1992:331 - 362.[12]刘少辉,等.Rough集高效算法的研究.计算机学报,2003,26(5):524-529.[13]Schutze H.[J].Silverstein C. Projections for efficient document clustering. In Proceedings of ACM/SIGIR97, Conference on Research and Development in Information Retrieval,Philadelphia, USA.1997,:- 期刊类型引用(7)
1. 王艳平. 莱斯信道下大规模MIMO系统传输速率研究. 通讯世界. 2024(05): 49-51 . 百度学术
2. 刘珏,程凯欣,杨炜伟. 智能窃听攻击下的物理层安全技术研究. 信息网络安全. 2023(02): 45-53 . 百度学术
3. 周峰. 5G通信背景下电子阅读资源安全传输技术研究. 信息记录材料. 2023(04): 134-136 . 百度学术
4. 张甫兆. 电视台高清非编制作网网络节点安全传输技术. 西部广播电视. 2023(09): 217-219 . 百度学术
5. 谭蓉俊. 一种无线通信防护方法. 舰船电子工程. 2022(04): 79-85 . 百度学术
6. 高远,谭蓉俊,邓志祥. 无人机辅助物理层安全下的保密性能优化. 电子与信息学报. 2022(08): 2730-2738 . 本站查看
7. 陈晓燕. 基于物联网技术的监控视频信息安全传输方法. 信息与电脑(理论版). 2022(13): 22-24 . 百度学术
其他类型引用(3)
-
计量
- 文章访问数: 2255
- HTML全文浏览量: 73
- PDF下载量: 953
- 被引次数: 10