NLP applications:
1) machine translation : use computer to translate language automatically
2) information extraction : given text, produce structured database representation
3) text summarization : give multiple text resources, then generate sth that could capture main idea
4) dialogue system : interact with computer to get answers
Tagging:
1) part of speech tagging : sequence of tag for adj, prep..
2) name entity recognition : which attribute the word belongs to
Parse:
map the input to parse tree that represent some hierarchical structure
Basic language modeling problem:
finite vocabulary set v = {....}
sentence s which words came from v.
sum of p(s) = 1, learn p. More reasonable sentence assigned higher prob
--Naive model:
p = count of sentence / total count of sentence -> hard to be general
--Trigram model:
based on second order Markov
P(X1 = x1, X2 = x2...Xn = xn, ) = product over i=1~n { P(Xi = xi |Xi-1 = xi-1, Xi-2 = xi-2 ) }
P(X0) = P(X-1) = *
Each random variable(word, including the STOP)in the chain is only conditioned on previously two variables(words).
--How to get P(Xi = xi | Xi-1 = xi-1, Xi-2 = xi-2 ) ?
use weighted combination of threeMaximum likelihood estimate: Linear Interpolation
trigram: est3 = count (Xi = xi , Xi-1 = xi-1, Xi-2 = xi-2 ) / count (Xi-1 = xi-1, Xi-2 = xi-2 )
bigram: est2 = count (Xi = xi , Xi-1 = xi-1) / count (Xi-1 = xi-1)
unigram: est1 = count (Xi = xi) / count () Note here: count () include stop. count = total occurrence
P(Xi = xi | Xi-1 = xi-1, Xi-2 = xi-2 ) = a3*est3 + a2*est2 + a1*est1 where a1 + a2 + a3 =1
--How to estimate a1, a2, a3? change it to the optimization problem
pick up validation data, c'(X1, X2, X3) is the count of trigram in there, make max L(a1, a2, a3)
L(a1, a2, a3) = sum c'(X1, X2, X3) * log (P(X1|X2, X3))
where P(X1|X2, X3) = a3*est3 + a2*est2 + a1*est1 and a1 + a2 + a3 =1
Evaluate model : perplexity (lower, better)
l = 1/M * sum over log (prob of each sentence), M--total words count in sentence set
(this can be viewed as responsibility assigned to each word)
if uniformly distributed, l = log 1/N N-- size of vacabulary set
perplexity = 2^-l