开始看coursera上Micheal Collins 的nlp 课程了,老师讲得不错,但是自己理解其实不是很深刻所以做些笔记,方便补充。
什么是NLP?
一般定义就是计算机使用自然语言作为输入或者输出。
NLU: undestanding input
NLG: generation output
NLP的任务有:
最老的Machine Translation, Information Extraction(信息抽取), TextSummarization(文本归纳), DialogueSystems(对话系统)。最基本的nlp问题主要有Tagging和Parsing两类:
Tagging就是对于一句话中的每一个可分单位给予特定的标记
例如:
- 常用的词性标注Part-of-speech tagging: noun, verb, preposition, …
- Profits (N) soared (V) at (P) Boeing (N)
- 命名实体识别Named Entity Recognition: companies, locations, people
- Profits (NA) soared (NA) at (NA) Boeing (C)
Parsing主要是对于句子的语义分析。
例如:
将“Boeing is located in Seattle" 变成一颗派生树。
为什么NLP这个方向很难?
Ambiguity(二义性或者多义性)多方面:syntactic, acoustic, semantic, discourse(multi-course)
例如:
- “At last, a computer that understands you like your mother”; three intrepretations at the syntactic level.
- But also occurs at an acoustic level: “like your” sounds like “lie cured”.
- One is more likely than the other, but without this information difficult to tell.
- At semantic level, words often have more than one meaning. Need context to disambiguate.
- “I saw her duck with a telescope”.
- At discourse (multi-clause) level.
- “Alice says they’ve built a computer that understands you like your mother”
- If you start a sentence saying “but she…”, who is she referring to?
The Language Modeling Problem
V:是我们已经有的有限个数的单词集合
如: V={the,a, man, telescope, Beckham, two,...}
而V+则代表了可数的无限字符串集合,这些字符串都由之前的单词集合组成,而语言模型的目标就是用来判定每一个字符串是否符合要求,而现在最常用的判断
方法就是通过使用一个概率分布来体系这个句子的正确性。
如:V+={"the STOP","a STOP","the fan STOP",...}
- We have a training sample of example sentences in English.
- Sentences from the New York Times in the last 10 years.
- Sentences from a large set of web pages.
- In the 1990’s 20 million words common, by the end of the 90’s 1 billion words common.
- Nowadays 100’s of billions of words.
-
With this training sample we want to “learn” a probabiliy distribution p, i.e. p is a function that satisfies:
- For any sentence x in language, p(x) >= 0.
- 通过每个句子或者字符串概率的大小就可以较为科学的区分出句子的正确性。
语言模型中使用最为广泛的模型叫做Markov模型,它对于语言的分析做了一个很强的假设,也就是说“一个词汇出现的概率只与前面的一个或者数个词相关”,很明显这个假设是错的,但在目前的应用中非常有效,主要使得计算简便了很多。用公式解释这个假设也就是:
其中最后一个等式使用了假设
上式因为一个词汇的概率只与前一个词有关,所有又被称为First-OrderMarkov Processess除此之外还有Second-Order MarkovProcesses:与前两个词有关:
一般为了表示的方便,我们设定x(0)=*,x(-1)=* 。' * '表示一种特殊的开始符号,所以Second-Order又可以表示为:
Trigram Language Models
- 一个三元语言模型由以下几个部分组成:
- 一个有限集合V (the words, the vocabulary).
- 对于每个三元组u,v,w有一个参数 q(w|u,v)满足 w∈V⋃{STOP}, and u,v∈V⋃{∗}.
- For each trigram u,v,w, a sequence of three words, we have a parameter q(w|u,v).
- w could be any element of V or STOP, and
- u,v could be any element of V or START.
- For any sentence x1,…,xn where xi∈V for i=1…(n−1), and xn=STOP, the probability of the sentence under the trigram model is:
- where we define x0=x−1=∗.
- i.e. for any sentence the probability of it is the product of second-order Markov probabilities of its constituent trigrams.
The Trigram Estimation Problem
- A natural estimate: the maximum likelihood estimate (ML).
- Recall that we assume that we have a training set, some example sentences in our language, typically, as you recall, millions or billions of sentences.
- From these sentences we can derive counts; how often do trigrams occur?
会遇到Sparse Data Problems,也就是说分母可能为0,这是极有可能出现的事情,而且当|V|越来越大时,参数q的个数也就越来越多,因为我们知道参数的个数等于|V|^3,这里留待以后解决。
Evaluating Language Models: Perplexity
Linear Interpolation 线性插值?
这两个部分摘自http://blog.csdn.net/dark_scope/article/details/8616672感觉总结得不错