NLP coursera note 1

NLP applications:

1) machine translation : use computer to translate language automatically

2) information extraction : given text, produce structured database representation

3) text summarization : give multiple text resources, then generate sth that could capture main idea

4) dialogue system : interact with computer to get answers


Tagging:

1) part of speech tagging : sequence of tag for adj, prep..

2) name entity recognition : which attribute the word belongs to

Parse:

map the input to parse tree that represent some hierarchical structure


Basic language modeling problem:

finite vocabulary set v = {....}

sentence s which words came from v.

sum of p(s) = 1, learn p. More reasonable sentence assigned higher prob


--Naive model:

p = count of sentence / total count of sentence -> hard to be general


--Trigram model:

based on second order Markov

P(X1  = x1, X2 = x2...Xn  = xn, ) = product over i=1~n { P(Xi  = xi |Xi-1  = xi-1, Xi-2 = xi-2  ) }

P(X0) = P(X-1) = * 

Each random variable(word, including the STOP)in the chain is only conditioned on previously two variables(words).

--How to get P(Xi  = xi | Xi-1  = xi-1, Xi-2 = xi-2  ) ?

use weighted combination of threeMaximum likelihood estimate: Linear Interpolation

trigram: est3 = count (Xi  = xi , Xi-1  = xi-1, Xi-2 = xi-2  ) / count (Xi-1  = xi-1, Xi-2 = xi-2  ) 

bigram: est2 =  count (Xi  = xi , Xi-1  = xi-1) / count (Xi-1  = xi-1) 

unigram:  est1 = count (Xi  = xi) / count ()  Note here:  count ()  include stop. count = total occurrence 

P(Xi  = xi | Xi-1  = xi-1, Xi-2 = xi-2  ) = a3*est3 + a2*est2 + a1*est1 where a1 + a2 + a3 =1


--How to estimate a1, a2, a3?  change it to the optimization problem

pick up validation data, c'(X1, X2, X3) is the count of trigram in there, make max L(a1, a2, a3)

L(a1, a2, a3) = sum c'(X1, X2, X3) * log (P(X1|X2, X3))

where P(X1|X2, X3) = a3*est3 + a2*est2 + a1*est1 and a1 + a2 + a3 =1


Evaluate model : perplexity (lower, better)

l = 1/M * sum over log (prob of each sentence),  M--total words count in sentence set 

(this can be viewed as responsibility assigned to each word)

if uniformly distributed, l = log 1/N     N-- size of vacabulary set

perplexity = 2^-l

 







  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值