Advanced Search
Volume 29 Issue 12
Jan.  2011
Turn off MathJax
Article Contents
Wu Di, Zhang Ya-ping, Yin Fu-liang, Li Ming. Feature Selection Based on Class Distribution Difference and VPRS for Text Classification[J]. Journal of Electronics & Information Technology, 2007, 29(12): 2880-2884. doi: 10.3724/SP.J.1146.2006.02073
Citation: Wu Di, Zhang Ya-ping, Yin Fu-liang, Li Ming. Feature Selection Based on Class Distribution Difference and VPRS for Text Classification[J]. Journal of Electronics & Information Technology, 2007, 29(12): 2880-2884. doi: 10.3724/SP.J.1146.2006.02073

Feature Selection Based on Class Distribution Difference and VPRS for Text Classification

doi: 10.3724/SP.J.1146.2006.02073
  • Received Date: 2006-12-28
  • Rev Recd Date: 2007-07-01
  • Publish Date: 2007-12-19
  • Weight calculating and feature reduction are key preprocesses in text classification. Firstly, those useless to classify texts are filtered according the category document frequency distribution difference of each feature; and then in order to overcome the limitations of TF-IDF weighting formula a novel weighting formula called TF-CDF is presented . Calculate the weight of each feature according to TF-CDF and build the Vector Space Model (VSM) for the entire corpus. To select significant features, a feature selection approach based on Variable Precision Rough Set (VPRS) is also proposed and implement with some SQL sentences combining the definitions of VPRS with the advantages of SQL sentences. Finally, some experiments based on different weighting formulas and feature selection methods are conducted using libSVM as text classifier. The experimental results show that the novel feature filtering, weighting formula and feature selection method improve the performance of text classification.
  • loading
  • [1] 宋枫溪,高秀梅,刘树海,杨静宇.统计模式识别中的维数削减与低损降维.计算机学报,2005, 28(11): 1915-1922. Song F X and Gao X M, et al. Dimensionality reduction in statistical pattern recognition and low loss dimensionlity reduction. Chinese Journal of Computers, 2005, 28(11): 1915-1922. [2] 陈治平,林亚平,彭雅等.基于最小类差异的无关信息预处理算法.电子与信息学报,2003,31(11):1750~1753. Chen Zhi-ping and Lin Ya-ping et al.. An irrelevant information preprocess based on the minimal class difference. Journal of Electronics Information Technology, 2003, 31(11): 1750-1753. [3] 胡清华,谢宗霞,于达仁.基于粗糙集加权的文本分类方法研究.情报学报,2005, 24(1): 59-63. Hu Qin-hua, Xie Zhong-xia, and Yu Da-ren. Weighting algorithm for text classification based on rough set approach. Journal of the China Society for Scientific and Technical Information, 2005, 24(1): 59-63. [4] 鲁明羽,李凡,庞淑英,陆玉昌,周立柱.基于权值调整的文本分类改进方法.清华大学学报(自然科学版),2003, 43(4): 513-515. Lu Ming-yu, Li Fan, Pang Shu-ying, Lu Yu-chang, and Zhou Li-zhu. Improved text classification methods based on weighted adjustments. J Tsinghua Univ (Sci Tech), 2003, 43(4): 513-515. [5] 陆玉昌,鲁明羽,李凡,周立柱.向量空间法中单词权重函数的分析与构造.计算机研究与发展,2002, 39(10): 12005- 12010. Lu Yu-chang, Lu Ming-yu, Li Fan, and Zhou Li-zhu. Analysis and construction of word weighting function in VSM. Journal of Computer Research and Development [J]. 2002, 39(10): 1205-1210. [6] Thorsten J. A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In: Proc of the 14th Int1 Conf on Machine Learning (ICML97). Vanderbit University, 1997: 143-151. [7] Pawlak Z. Rough sets[J].International Journal of Information and Computer Science.1982, 11(5):341-356 [8] Wojciech Ziarko. Variable precision rough set model[J].Journal of Computer and System Sciences.1993, 46(1):39-59 [9] Malcolm Beynon. Reducts within the variable precision rough sets model: A further investigation. European Journal of Operational Research, 2001, 134: 529-605. [10] 徐章艳,刘作鹏,杨炳儒,宋威. 一个复杂度为max(O(|C||U|), O(|C|2|U/C|))的快速属性约简算法. 计算机学报,2006, 29(3): 391-399. Xu Zhang-yan, Liu Zuo-peng, Yang Bin-gru, and Song Wei. A quick attribute reduction algorithm with complexity of max (O(|C||U|),O(|C|2|U/C|)). Chinese Journal of Computers, 2006, 29(3): 391-399.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (3720) PDF downloads(1268) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return