本文主要让读者快速学习Language Model。
语言模型最初在语音识别领域中应用,然后逐渐将起扩展到各个领域OCR、手写识别、统计机器翻译、拼写校正、信息检索等各个领域。
基本Language Model 主要涉及
(1) LM的定义.
(2) N-gram作为LM的主要工具.下面所涉及都指N-gram
(3) LM链式规则.
(4) LM MLE(Maximum Likelihood Estimation).
(5) LM 评估(Cross-Entropy, Perplexity).
(6) 针对LM的数据稀疏,提出的各类平滑方法.
(7) 平滑方法分类
-退化法(Backing-off Models)
• Katz smoothing
• Kneser-Ney smoothing
-线性插值法(Linear Interpolated Models)
• Additive smoothing
• Absolute smoothing
• Jelinek-Mercer smoothing
• Witten-Bell smoothing
• Interpolated Kneser-Ney smoothing
(8) 其他平滑方法
- Church-Gale Smoothing
- Bayesian Smoothing
(9) Good-Turing估计法
Reference:
n Stanley F. Chen and Joshua Goodman (1998), An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, Computer Science Group, HarvardUniversity , 1998. (推荐)
n Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on pages 334--342, 2001.
n Christopher D. Manning and Hinrich Schutze. Foudations of Statistical Natural Language Processing [chapter 7 Statistical Inference: n-gram models over Sparse Data]. The MIT Press.
n Berlin Chen (2003). Statistical Language Modeling for Speech Recognition. PPT slide
n 统计语言建模中的平滑技术(中科院计算所软件室LCC组).(推荐)
n 刘挺. 语言模型:[网上讲义,http://ir.hit.edu.cn/download/NLP_3.pdf]
n http://dingo.sbs.arizona.edu/~sandiway/ling538/(内含计算机语言学大量讲义)
n Statistical Methods in Computational Linguistics(内含计算机语言学大量讲义).
n The State of the art in Language Modeling powerpoint slides
(附注:推荐先看《统计语言建模中的平滑技术》这篇ppt,该ppt以An Empirical Study of Smoothing Techniques for Language Modeling这篇论文为主要线索,然后覆盖了很多别的论文的内容.将这片ppt和An Empirical Study of Smoothing Techniques for Language Modeling这篇论文穿插看会省很多时间。如果遇到具体不明白的地方,可以参考别的paper或者ppt。)
LM软件工具