高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于类别分布差异和VPRS特征选择的文本分类方法

吴迪 张亚平 殷福亮 李明

吴迪, 张亚平, 殷福亮, 李明. 基于类别分布差异和VPRS特征选择的文本分类方法[J]. 电子与信息学报, 2007, 29(12): 2880-2884. doi: 10.3724/SP.J.1146.2006.02073
引用本文: 吴迪, 张亚平, 殷福亮, 李明. 基于类别分布差异和VPRS特征选择的文本分类方法[J]. 电子与信息学报, 2007, 29(12): 2880-2884. doi: 10.3724/SP.J.1146.2006.02073
Wu Di, Zhang Ya-ping, Yin Fu-liang, Li Ming. Feature Selection Based on Class Distribution Difference and VPRS for Text Classification[J]. Journal of Electronics & Information Technology, 2007, 29(12): 2880-2884. doi: 10.3724/SP.J.1146.2006.02073
Citation: Wu Di, Zhang Ya-ping, Yin Fu-liang, Li Ming. Feature Selection Based on Class Distribution Difference and VPRS for Text Classification[J]. Journal of Electronics & Information Technology, 2007, 29(12): 2880-2884. doi: 10.3724/SP.J.1146.2006.02073

基于类别分布差异和VPRS特征选择的文本分类方法

doi: 10.3724/SP.J.1146.2006.02073

Feature Selection Based on Class Distribution Difference and VPRS for Text Classification

  • 摘要: 权值计算和特征降维是影响文本分类的精度和效率的两个重要步骤。该文首先根据特征词的类别分布差异进行特征过滤;然后,分析传统的权值公式TF-IDF的缺点,采用改进的权值计算公式简记为TF-CDF,依据TF-CDF公式计算每个特征词的权值,生成文档集的向量空间模型VSM;接着,提出了一种基于可变精度粗糙理论(VPRS)的特征选择进一步选择对分类贡献度大的特征,并用SQL实现。最后利用支持向量机LibSVM分类器进行实验,实验结果表明特征过滤和选择方法及TF-CDF权值公式有助于提高分类精度和分类效率。
  • [1] 宋枫溪,高秀梅,刘树海,杨静宇.统计模式识别中的维数削减与低损降维.计算机学报,2005, 28(11): 1915-1922. Song F X and Gao X M, et al. Dimensionality reduction in statistical pattern recognition and low loss dimensionlity reduction. Chinese Journal of Computers, 2005, 28(11): 1915-1922. [2] 陈治平,林亚平,彭雅等.基于最小类差异的无关信息预处理算法.电子与信息学报,2003,31(11):1750~1753. Chen Zhi-ping and Lin Ya-ping et al.. An irrelevant information preprocess based on the minimal class difference. Journal of Electronics Information Technology, 2003, 31(11): 1750-1753. [3] 胡清华,谢宗霞,于达仁.基于粗糙集加权的文本分类方法研究.情报学报,2005, 24(1): 59-63. Hu Qin-hua, Xie Zhong-xia, and Yu Da-ren. Weighting algorithm for text classification based on rough set approach. Journal of the China Society for Scientific and Technical Information, 2005, 24(1): 59-63. [4] 鲁明羽,李凡,庞淑英,陆玉昌,周立柱.基于权值调整的文本分类改进方法.清华大学学报(自然科学版),2003, 43(4): 513-515. Lu Ming-yu, Li Fan, Pang Shu-ying, Lu Yu-chang, and Zhou Li-zhu. Improved text classification methods based on weighted adjustments. J Tsinghua Univ (Sci Tech), 2003, 43(4): 513-515. [5] 陆玉昌,鲁明羽,李凡,周立柱.向量空间法中单词权重函数的分析与构造.计算机研究与发展,2002, 39(10): 12005- 12010. Lu Yu-chang, Lu Ming-yu, Li Fan, and Zhou Li-zhu. Analysis and construction of word weighting function in VSM. Journal of Computer Research and Development [J]. 2002, 39(10): 1205-1210. [6] Thorsten J. A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In: Proc of the 14th Int1 Conf on Machine Learning (ICML97). Vanderbit University, 1997: 143-151. [7] Pawlak Z. Rough sets[J].International Journal of Information and Computer Science.1982, 11(5):341-356 [8] Wojciech Ziarko. Variable precision rough set model[J].Journal of Computer and System Sciences.1993, 46(1):39-59 [9] Malcolm Beynon. Reducts within the variable precision rough sets model: A further investigation. European Journal of Operational Research, 2001, 134: 529-605. [10] 徐章艳,刘作鹏,杨炳儒,宋威. 一个复杂度为max(O(|C||U|), O(|C|2|U/C|))的快速属性约简算法. 计算机学报,2006, 29(3): 391-399. Xu Zhang-yan, Liu Zuo-peng, Yang Bin-gru, and Song Wei. A quick attribute reduction algorithm with complexity of max (O(|C||U|),O(|C|2|U/C|)). Chinese Journal of Computers, 2006, 29(3): 391-399.
  • 加载中
计量
  • 文章访问数:  3725
  • HTML全文浏览量:  71
  • PDF下载量:  1268
  • 被引次数: 0
出版历程
  • 收稿日期:  2006-12-28
  • 修回日期:  2007-07-01
  • 刊出日期:  2007-12-19

目录

    /

    返回文章
    返回