信息检索中的聚类分析技术
The Clustering Analysis Technology for Information Retrieval
-
摘要: 信息检索/搜索引擎技术的快速发展使得信息的查全率有较大提高,而查准率以及人们获取信息的效率改善却不明显。文本聚类和多文档关键词的自动生成技术将有助于解决这一问题。其基本思想是对检索到的部分文档进行聚类处理,并对每类文档自动生成关键词,从而帮助用户判断各个类别的文档和检索需求是否相关。该文提出文档相关度和类别相关度的概念,并利用词频信息以及知网(HOWNET)中词的概念计算模型计算类别相关度,将其作为聚类合并的依据。信息获取的仿真实验表明文档检索效率有较大提高。Abstract: The rapid development of Information Retrieval(IR) and search engine improves recall rate greatly, whereas the enhancement on both precision rate and information retrieval efficiency is not clear. The research on document clustering and multi-document keyword extraction will help solve this problem. The basic idea is to cluster part of the documents returned by search engine, and automatically extract some keywords for each cluster. Thus user can judge whether the documents in each cluster are relevant to his need. In this paper the concept of document relevancy and cluster relevancy are proposed, and both word frequency and the concept relevancy model of HOWNET are used to compute cluster relevancy, which is used to guide the merging process of clusters. The experimental results show that the IR efficiency has improved greatly.
计量
- 文章访问数: 2161
- HTML全文浏览量: 135
- PDF下载量: 810
- 被引次数: 0