一种面向隐含主题的上下文树核

徐超; 周一民; 沈磊

doi:10.3724/SP.J.1146.2009.01493

一种面向隐含主题的上下文树核

doi: 10.3724/SP.J.1146.2009.01493

计量
- 文章访问数: 3415
- HTML全文浏览量: 88
- PDF下载量: 849
- 被引次数: 0
出版历程
- 收稿日期: 2009-11-20
- 修回日期: 2010-03-09
- 刊出日期: 2010-11-19

A Context Tree Kernel Based on Latent Semantic Topic

摘要

摘要: 该文针对上下文树核用于文本表示时缺乏语义信息的问题，提出了一种面向隐含主题的上下文树核构造方法。首先采用隐含狄利克雷分配将文本中的词语映射到隐含主题空间，然后以隐含主题为单位建立上下文树模型，最后利用模型间的互信息构造上下文树核。该方法以词的语义类别来定义文本的生成模型，解决了基于词的文本建模时所遇到的统计数据的稀疏性问题。在文本数据集上的聚类实验结果表明，文中提出的上下文树核能够更好地度量文本间主题的相似性，提高了文本聚类的性能。
- 文本聚类 /
- 上下文树核 /
- 统计语言模型 /
- 隐含狄利克雷分配(LDA)
Abstract: The lack of semantic information is a critical problem of context tree kernel in text representation. A context tree kernel method based on latent topics is proposed. First, words are mapped to latent topic space through Latent Dirichlet Allocation(LDA). Then, context tree models are built using latent topics. Finally, context tree kernel for text is defined through mutual information between the models. In this approach, document generative models are defined using semantic class instead of words, and the issue of statistic data sparse is solved. The clustering experiment results on text data set show, the proposed context tree kernel is a better measure of topic similarity between documents, and the performance of text clustering is greatly improved.
- Text clustering /
- Context tree kernel /
- Statistical language models /
- Latent Dirichlet Allocation (LDA)

HTML全文

参考文献(1)

Srivastava A N and Sahami M. Text Mining: Classification,Clustering, and Applications[M]. Boca Raton: Chapman and Hall, 2009: 1-25.[2]Cristianini N, Shawe-Taylor J, and Lodhi H. Latent semantic kernels[J].Journal of Intelligent Information Systems.2002, 18(2/3):127-152[3]Nyffenegger M.[J].Chappelier J C, and Gaussier. Revisiting Fisher kernels for document similarities[C]. 17th European Conference on Machine Learning, Berlin, Germany, September 18-2.2006,:-Lehmann A and Shawe-Taylor J. A probabilistic model for text kernels[C]. Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006: 537-544.[4]Cuturi M and Vert J P. The context-tree kernel for strings[J].Neural Networks.2005, 18(8):1111-1123[5]Yin Chuan-huan, Tian Sheng-feng, and Mu Shao-min, et al.. Efficient computations of gapped string kernels based on suffix kernel[J].Neurocomputing.2008, 71(4-6):944-962[6]Vert J P. Text categorization using adaptive context trees[C]. Proceedings of the Second International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico, February 18-24, 2001: 423-436.[7]Willems F M J, Shtarkov Y M, and Tjalkens T J. The context-tree weighting method: basic properties[J].IEEE Transactions on Information Theory.1995, 41(3):653-664[8]Vert J P. Adaptive context trees and text clustering[J].IEEE Transactions on Information Theory.2001, 47(5):1884-1901[9]李晓光, 于戈, 王大玲等. 基于信息论的潜在概念获取与文本聚类[J].软件学报.2008, 19(9):2276-2284Li Xiao-guang, Yu Ge, and Wang Da-ling, et al.. Latent concept extraction and text clustering based on information theory[J].Journal of Software.2008, 19(9):2276-2284[10]Hofmann T. Probabilistic Latent Semantic Analysis[C]. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30-August 1, 1999: 289-296.[11]Phan Xuan-hieu.[J].Nguyen Le-minh, and Horiguchi Susumu. Learning to classify short and sparse text web with hidden topics from large-scale data collections[C]. Proceeding of the 17th International Conference on World Wide Web, Beijing, China, April 21-2.2008,:-[12]Pinto D and Rosso P. On the relative hardness of clustering corpora[C]. Proceedings of 10th International Conference on Text, Speech and Dialogue, Pilsen, Czech Republic, September 3-7, 2007: 155-161.[13]周昭涛. 文本聚类分析效果评价及文本表示研究[D]. [硕士论文], 中国科学院计算技术研究所, 2005.[14]Zhou Zhao-tao. Quality evaluation of text clustering results and investigation on text representation[D]. [MA. dissertation], Institute of Computing Technology, Chinese Academy of Sciences, 2005.

施引文献

资源附件(0)

访问统计