1. Basic Text Processing 1. Regular Expressions 2. Word tokenization 3. Word Normalization and Stemming 4. Sentence Segmentation and Decision Trees 2. Minimum Edit Distance 1. Definition of Minimum Edit Distance 2. Computing Minimum Edit Distance 3. Backtrace for Computing Alignments 4. Weighted Minimum Edit DIstance 5. Minimum Edit Distance in Computational Biology 3. Language Modeling 1. Introduction to N-grams 2. Estimating N-gram Probabilities 3. valuation and perplexity 4. Generalization and zeros Smoothing : Add-one(Laplace) smoothing Add-1, Add-k, Unigram prior Smoothing 贴图(3-68) Interpolation(插值法), Backoff, and Web-Scale LMs 总结: Add-1,文本分类可以,语言模型不适用; 插值法比较常用;web-scale的N-grams 使用简单的回退,trigram->bigram->unigram Advanced: Good Turing Sm