MIT自然语言处理第三讲：概率语言模型（第一、二、三部分）

最新推荐文章于 2024-01-10 02:09:01 发布

GarfieldEr007

最新推荐文章于 2024-01-10 02:09:01 发布

阅读量2.6k

点赞数

分类专栏：自然语言处理NLP 文章标签： MIT 自然语言处理 NLP 概率语言模型教程

自然语言处理NLP 专栏收录该内容

18 篇文章 1 订阅

订阅专栏

MIT自然语言处理第三讲：概率语言模型（第一部分）

自然语言处理：概率语言模型
Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
译者：我爱自然语言处理（www.52nlp.cn ，2009年1月16日）

上一讲主要内容回顾（Last time）
　　语料库处理（Corpora processing）
　　齐夫定律（Zipf’s law）
　　数据稀疏问题（Data sparseness）
本讲主要内容（Today）：
　　概率语言模型（Probabilistic language Modeling）

一、简单介绍
a) 预测字符串概率（Predicting String Probabilities）
　i. 那一个字符串更有可能或者更符合语法Which string is more likely? (Which string is more grammatical?)
　　1. Grill doctoral candidates.
　　2. Grill doctoral updates.
　　(example from Lee 1997)
　ii. 向字符串赋予概率的方法被称之为语言模型（Methods for assigning probabilities to strings are called Language Models.）
b) 动机（Motivation）
　i. 语音识别，拼写检查，光学字符识别和其他应用领域（Speech recognition, spelling correction, optical character recognition and other applications）
　ii. 让E作为物证（？不确定翻译），我们需要决定字符串W是否是有E编码而得到的消息（Let E be physical evidence, and we need to determine whether the string W is the message encoded by E）
　iii. 使用贝叶斯规则（Use Bayes Rule）：
　　　　P(W/E)={P_{LM}(W)P(E/W)}/{P(E)}　
　其中P_{LM}(W)是语言模型概率(where P_{LM}(W)is language model probability)
　iv. P_{LM}(W)提供了必要的消歧信息(P_{LM}(W)provides the information necessary for isambiguation (esp. when the physical evidence is not sufficient for disambiguation))
c) 如何计算（How to Compute it）?
　i. 朴素方法（Naive approach）:
　　1. 使用最大似然估计（Use the maximum likelihood estimates (MLE)）——字符串在语料库S中存在次数的值由语料库规模归一化（the number of times that the string occurs in the corpus S, normalized by the corpus size）：
P_{MLE}(Grill~doctorate~candidates)={count(Grill~doctorate~candidates)}/delim{|}{S}{|}
　　2. 对于未知事件，最大似然估计P_{MLE}=0（For unseen events, P_{MLE}=0）
　　——数据稀疏问题比较“可怕”（Dreadful behavior in the presence of Data Sparseness）
d) 两个著名的句子（Two Famous Sentences）
　i. “It is fair to assume that neither sentence
　　“Colorless green ideas sleep furiously”
　　nor
　　“Furiously sleep ideas green colorless”
　　… has ever occurred … Hence, in any statistical model … these　sentences will be ruled out on identical grounds as equally “remote” from English. Yet (1), though nonsensical, is grammatical, while (2) is not.” [Chomsky 1957]
　ii. 注：这是乔姆斯基《句法结构》第9页上的：下面的两句话从来没有在一段英语谈话中出现过，从统计角度看离英语同样的“遥远”，但只有句1是合乎语法的：
　　1) Colorless green ideas sleep furiously.
　　2) Furiously sleep ideas sleep green colorless .
　　“从来没有在一段英语谈话中出现过”、“从统计角度看离英语同样的‘遥远’”要看从哪个角度去看了，如果抛开具体的词汇、从形类角度看，恐怕句1的统计频率要高于句2而且在英语中出现过。

未完待续：第二部分

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工学院开放式课程创作共享规范翻译发布，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

from：http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-first-part/

MIT自然语言处理第三讲：概率语言模型（第二部分）

二、语言模型构造
a) 语言模型问题提出（The Language Modeling Problem）
　i. 始于一些词汇集合（Start with some vocabulary）:
　　ν= {the, a, doctorate, candidate, Professors, grill, cook, ask, …}
　ii. 得到一个与词汇集合v关的训练样本（Get a training sample of v）:
　　Grill doctorate candidate.
　　Cook Professors.
　　Ask Professors.
　　……
　iii. 假设（Assumption）:训练样本是由一些隐藏的分布P刻画的（training sample is drawn from some underlying distribution P）
　iv. 目标（Goal）: 学习一个概率分布P prime尽可能的与P近似（learn a probability distribution P prime “as close” to P as possible）
　　　　　sum{x in v}{}{P prime (x)}=1, P prime (x) >=0
　　　　　P prime (candidates)=10^{-5}
　　　　　{P prime (ask~candidates)}=10^{-8}
b) 获得语言模型（Deriving Language Model）
　i. 向一组单词序列w_{1}w_{2}…w_{n}赋予概率（Assign probability to a sequencew_{1}w_{2}…w_{n} ）
　ii. 应用链式法则（Apply chain rule）:
　　1. P(w1w2…wn)= P(w1|S)∗P(w2|S,w1)∗P(w3|S,w1,w2)…P(E|S,w1,w2,…,wn)
　　2. 基于“历史”的模型(History-based model): 我们从过去的事件中预测未来的事件（we predict following things from past things）
　　3. 我们需要考虑多大范围的上下文（How much context do we need to take into account）?
c) 马尔科夫假设（Markov Assumption）
　i. 对于任意长度的单词序列P(wi|w(i-n) …w(i−1))是比较难预测的（For arbitrary long contexts P(wi|w(i-n) …w(i−1))difficult to estimate）
　ii. 马尔科夫假设（Markov Assumption）: 第i个单词wi仅依赖于前n个单词（wi depends only on n preceding words）
　iii. 三元语法模型（又称二阶马尔科夫模型）（Trigrams (second order)）:
　　1. P(wi|START,w1,w2,…,w(i−1）)=P(wi|w(i−1),w(i−2))
　　2. P(w1w2…wn)= 　P(w1|S)∗P(w2|S,w1)∗P(w3|w1,w2)∗…P(E|w(n−1),wn)
d) 一种语言计算模型（A Computational Model of Language）
　i. 一种有用的概念和练习装置（A useful conceptual and practical device）:“抛硬币”模型（coin-flipping models）
　　1. 由随机算法生成句子（A sentence is generated by a randomized algorithm）
　　——生成器可以是许多“状态”中的一个（The generator can be one of several “states”）
　　——抛硬币决定下一个状态（Flip coins to choose the next state）
　　——抛其他硬币决定哪一个字母或单词输出（Flip other coins to decide which letter or word to output）
　ii. 香农（Shannon）: “The states will correspond to the“residue of influence” from preceding letters”
e) 基于单词的逼近（Word-Based Approximations）
　注：以下是用莎士比亚作品训练后随机生成的句子，可参考《自然语言处理综论》
　i. 一元语法逼近（这里MIT课件有误，不是一阶逼近（First-order approximation））
　　1. To him swallowed confess hear both. which. OF save
　　2. on trail for are ay device and rote life have
　　3. Every enter now severally so, let
　　4. Hill he late speaks; or! a more to leg less first you
　　5. enter
　ii. 三元语法逼近（这里课件有误，不是三阶逼近（Third-order approximation））
　　1. King Henry. What! I will go seek the traitor Gloucester.
　　2. Exeunt some of the watch. A great banquet serv’s in;
　　3. Will you tell me how I am?
　　4. It cannot be but so.

未完待续：第三部分

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工学院开放式课程创作共享规范翻译发布，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

from：http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-second-part/

MIT自然语言处理第三讲：概率语言模型（第三部分）

三、语言模型的评估
a) 评估一个语言模型（Evaluating a Language Model）
　i. 我们有n个测试单词串（We have n test string）:
　　　　　S_{1},S_{2},…,S_{n}
　ii. 考虑在我们模型之下这段单词串的概率（Consider the probability under our model）：
　　　　　prod{i=1}{n}{P(S_{i})}
或对数概率(or log probability):
　　log{prod{i=1}{n}{P(S_{i})}}=sum{i=1}{n}{logP(S_{i})}
　iii. 困惑度（Perplexity）:
　　　　　Perplexity = 2^{-x}
　　这里x = {1/W}sum{i=1}{n}{logP(S_{i})}
　　W是测试数据里总的单词数（W is the total number of words in the test data.）
　iv. 困惑度是一种有效的“分支因子”评测方法（Perplexity is a measure of effective “branching factor”）
　　1. 我们有一个规模为N的词汇集v，模型预测（We have a vocabulary v of size N, and model predicts）：
　　P(w) = 1/N 对于v中所有的单词（for all the words in v.）
　v. 困惑度是什么（What about Perplexity）?
　　　　　 Perplexity = 2^{-x}
　　　这里 x = log{1/N}
　　　于是 Perplexity = N
　vi. 人类行为的评估（estimate of human performance (Shannon, 1951)
　　1. 香农游戏（Shannon game）— 人们在一段文本中猜测下一个字母（humans guess next letter in text）
　　2. PP=142(1.3 bits/letter), uncased, open vocabulary
　vii. 三元语言模型的评估（estimate of trigram language model (Brown et al. 1992)）
　　PP=790(1.75 bits/letter), cased, open vocabulary