一种基于N-gram模型和机器学习的汉语分词算法

吴应良; 韦岗; 李海洲

一种基于N-gram模型和机器学习的汉语分词算法

吴应良^①;韦岗,
韦岗,
李海洲

计量
- 文章访问数: 4400
- HTML全文浏览量: 242
- PDF下载量: 1297
- 被引次数: 0
出版历程
- 收稿日期: 1999-09-29
- 修回日期: 2000-04-06
- 刊出日期: 2001-11-19

A WORD SEGMENTATION ALGORITHM FOR CHINESE LANGUAGE BASED ON N-GRAM MODELS AND MACHINE LEARNING

摘要

摘要: 汉语的自动分词,是计算机中文信息处理领域中一个基础而困难的课题。该文提出了一种将汉语文本句子切分成词的新方法,这种方法以N-gram模型为基础,并结合有效的Viterbi搜索算法来实现汉语句子的切词。由于采用了基于机器学习的自组词算法,无需人工编制领域词典。该文还讨论了评价分词算法的两个定量指标,即查准率和查全率的定义,在此基础上,用封闭语料库和开放语料库对该文提出的汉语分词模型进行了实验测试,表明该模型和算法具有较高的查准率和查全率。
- 汉语分词; N-gram模型; 机器学习; 查准率; 查全率
Abstract: Automatic word segmentation for the Chinese language is a fundamental and difficult problem in the field of computer Chinese language information processing. This paper presents a new method for segmenting the input Chinese language text sentence into words, which consists of a character-based N-gram model and an efficient Viterbi search algorithm. In addition, two performance evaluation ration targets, i.e. Recall and Precision for word segmentation algorithm are discussed, The effectiveness has been confirmed by evaluation experiments using the closed texts and open texts corpus.

HTML全文

参考文献(1)

梁南元,汉语计算机自动分词知识,中文信息学报,1989,4(2),29-33.[2]王德春,应用语言学概论,上海,上海外语教育出版社,1997年12月第1版,88-120.[3]E. Charniak, C. Hendrickson, N. Jacoboson, M. Perkowitz, Equations for part-of speech tagging,AAAI-93, 1993, 784 789.[4]K. Church, A stochastic parts program and noun phrase parser for unrestricted text, ANLP-88,1998, 136-143.[5]S. Sakai, Morphological category bigram: A single language model for both spoken language and text, ISSD-93, 1993, 97-90.[6]M. Yamamoto, A re-estimation method for stochastic language modeling from ambigous obser-vations, in Proceeding of WVLC-96, California, 1996, 155-167.[7]赵以宝, 孙圣和, 一种基于单字统计二元文法的自组词音字转换算法,电子学报, 1998, 26(10), 55-58.[8]F. Jelinek, Self-Organized Language Modeling for Speech Recognition, IBM Research Report,IBM T, J. Watson Research Center, 1985. Reprinted in Reading in Speech Recognition, Waibel,A., and Lee, K-F. (Eds.), Morgan Kaufann Publishers, 1990, 450-506.[9]S.M. Katz, Estimation of probailities from sparse data for the language model component ofspeech recognizer, IEEE Trans. on Acousttics, Speech, and Signal Processing, 1987, ASSP-35(3),400-401.[10]R. Rosenfeld, The CMU statistical language modeling toolkit and its use in the 1994 ARPA CSR evaluation, In the Proc. of ARPA Spoken Language Systems Technology Workshop, Washington, 1995, 47-50.

施引文献

资源附件(0)

访问统计