一种基于EMD的文档语义相似性度量
doi: 10.3724/SP.J.1146.2007.00177
An EMD-Based Metric for Document Semantic Similarity
-
摘要: 针对基于EMD(Earth Movers Distance)的文档语义相似性算法不满足度量公理因而难以在信息检索与数据挖掘中推广应用的问题,该文提出了一种新的基于EMD的文档语义相似性度量Mdss_EMD(Metric for document semantic similarity based EMD)。首先在分析EMD及现有改进方法缺陷的基础上,给出了文档宽度、虚拟项的概念;随后通过增加虚拟项来对齐文档矢量的总权值,使所有度量公理得到满足;最后,为提高该度量的适应能力及处理速度,还实现了虚拟项相似距离的弹性设计并对EMD算法进行了简化。该方法把EMD扩展到度量空间中来,很大程度上提高了EMD的索引能力与精度,初步实验表明,Mdss_EMD的整体性能优于原EMD及现有其它类似方法。Abstract: Aiming at the conflicts between EMD(Earth Movers Distance)-based measure for document semantic similarity and metric axioms, which prevent EMD from being widely applied in the information retrieval and data mining, a novel EMD-based metric for document semantic similarity named Mdss_EMD is presented. Firstly, based on the analysis of drawbacks of EMD and its existing modifications, the concepts of document width and virtual term are proposed. Subsequently, by adding virtual term to initial document vector, the approach aligns the total weights of document vectors, so that all of metric axioms are satisfied. Finally, in order to improve the applicability and processing speed of the metric, the similarity distance of virtual term is designed to be elastic and EMD algorithm is also simplified. The proposed approach extends EMD to metric space, and substantially improves EMD on indexing and accuracy. The experimental results demonstrate that Mdss_EMD outperforms the original EMD and other similar measures in general.
计量
- 文章访问数: 3597
- HTML全文浏览量: 139
- PDF下载量: 1240
- 被引次数: 0