转载自:http://blog.sina.com.cn/s/blog_72d083c701017r9t.html
有网友问我:Stanford parser 和Berkeley parser 是哪种类型的分析器?
我觉得,Stanford parser基本上是一个词汇化的概率上下文无关语法分析器,同时也使用了依存分析。根据不同的语法观点可以输出不同的的分析结果。所以,可以认为是一个使用混合分析方法的剖析器。
Berkeley Parser主要是一个概率上下文无关语法的分析器。
下面,我们以Stanford Parser为例,具体地介绍一下。
我们来分析如下的句子。Stanford parser可以给出不同过的结果:
The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.
1. 下面是这句话的词类标注结果 (part-of-speech tagged text):
The/DT strongest/JJS rain/NN ever/RB recorded/VBN in/IN India/NNP
shut/VBD down/RP the/DT financial/JJ hub/NN of/IN Mumbai/NNP ,/,
snapped/VBD communication/NN lines/NNS ,/, closed/VBD airports/NNS
and/CC forced/VBD thousands/NNS of/IN people/NNS to/TO sleep/VB in/IN
their/PRP$ offices/NNS or/CC walk/VB home/NN during/IN the/DT night/NN
,/, officials/NNS said/VBD today/NN ./.
2. 下面是上下文无关短语结构语法的树形表示( a context-free phrase structure grammar representation)
(ROOT
(S
(S
(NP
(NP (DT The) (JJS strongest) (NN rain))
(VP
(ADVP (RB ever))
(VBN recorded)
(PP (IN in)
(NP (NNP India)))))
(VP
(VP (VBD shut)
(PRT (RP down))
(NP
(NP (DT the) (JJ financial) (NN hub))
(PP (IN of)
(NP (NNP Mumbai)))))
(, ,)
(VP (VBD snapped)
(NP (NN communication) (NNS lines)))
(, ,)
(VP (VBD closed)
(NP (NNS airports)))
(CC and)
(VP (VBD forced)
(NP
(NP (NNS thousands))
(PP (IN of)
(NP (NNS people))))
(S
(VP (TO to)
(VP
(VP (VB sleep)
(PP (IN in)
(NP (PRP$ their) (NNS offices))))
(CC or)
(VP (VB walk)
(NP (NN home))
(PP (IN during)
(NP (DT the) (NN night))))))))))
(, ,)
(NP (NNS officials))
(VP (VBD said)
(NP-TMP (NN today)))
(. .)))
3.下面是一个类型化的依存表示结果(a typed dependency representation)。
我们首先给句子中的单词标号:
1. The
2. strongest
3. rain
4. ever
5. recorded
6. in
7. India
8. shut
9. down
10. the
11. financial
12. hub
13. of
14. Mumbai
15. ,
16. snapped
17. communication
18. lines
19. ,
20. closed
21. airports
23. forced
24. thousands
25. of
26. people
27. to
28. sleep
29. in
30. their
31. offices
32. or
33. walk
34. home
35. during
36. the
37. night
38. ,
39. officials
40. said
41. today
下面是依存关系的分析结果。前项是支配词(governor),后项是从属词(dependent)。
det(rain-3, The-1)
amod(rain-3, strongest-2)
nsubj(shut-8, rain-3)
nsubj(snapped-16, rain-3)
nsubj(closed-20, rain-3)
nsubj(forced-23, rain-3)
advmod(recorded-5, ever-4)
partmod(rain-3, recorded-5)
prep_in(recorded-5, India-7)
ccomp(said-40, shut-8)
prt(shut-8, down-9)
det(hub-12, the-10)
amod(hub-12, financial-11)
dobj(shut-8, hub-12)
prep_of(hub-12, Mumbai-14)
conj_and(shut-8, snapped-16)
ccomp(said-40, snapped-16)
nn(lines-18, communication-17)
dobj(snapped-16, lines-18)
conj_and(shut-8, closed-20)
ccomp(said-40, closed-20)
dobj(closed-20, airports-21)
conj_and(shut-8, forced-23)
ccomp(said-40, forced-23)
dobj(forced-23, thousands-24)
prep_of(thousands-24, people-26)
aux(sleep-28, to-27)
xcomp(forced-23, sleep-28)
poss(offices-31, their-30)
prep_in(sleep-28, offices-31)
xcomp(forced-23, walk-33)
conj_or(sleep-28, walk-33)
dobj(walk-33, home-34)
det(night-37, the-36)
prep_during(walk-33, night-37)
nsubj(said-40, officials-39)
tmod(said-40, today-41)
所有这些结果都是根据不同的语法观点输出的不同结果。
北京航空航天大学外国语学院卫乃兴教授悼念著名语料库语言学家Sinclair的英文悼词的前4句是:
We are shocked to hear that Professor John Sinclair has left us.
Undoubtedly, the 13th of March 2007 is a saddest day to the world linguistics, Corpus Linguistics in particular.
The gap left by the departure of this innovative thinker and distinguished linguist will be felt in the hearts of the researchers working along the lines he has set.
In deepest sorrow, we, linguists at Shanghai Jiao Tong University, China, found that we cannot express with words our gratitude and respect to John.
Stanford parser得到结果如下:
Parsed 94 words in 4 sentences (13.73 wds/sec; 0.58 sents/sec).
每句的树形结构如下:
1. We are shocked to hear that Professor John Sinclair has left us.
概率短语结构语法的结果:
(ROOT
(S
(NP (PRP We))
(VP (VBP are)
(ADJP (JJ shocked)
(S
(VP (TO to)
(VP (VB hear)
(SBAR (IN that)
(S
(NP (NNP Professor) (NNP John) (NNP Sinclair))
(VP (VBZ has)
(VP (VBN left)
(NP (PRP us)))))))))))
(. .)))
依存语法的结果:
nsubj(shocked-3, We-1)
cop(shocked-3, are-2)
aux(hear-5, to-4)
xcomp(shocked-3, hear-5)
complm(left-11, that-6)
nn(Sinclair-9, Professor-7)
nn(Sinclair-9, John-8)
nsubj(left-11, Sinclair-9)
aux(left-11, has-10)
ccomp(hear-5, left-11)
dobj(left-11, us-12)
2. Undoubtedly, the 13th of March 2007 is a saddest day to the world linguistics, Corpus Linguistics in particular.
概率短语结构语法的结果:
(ROOT
(S
(ADVP (RB Undoubtedly))
(, ,)
(NP
(NP (DT the) (NN 13th))
(PP (IN of)
(NP (NNP March) (CD 2007))))
(VP (VBZ is)
(NP
(NP (DT a) (JJ saddest) (NN day))
(PP (TO to)
(NP
(NP (DT the) (NN world) (NNS linguistics))
(, ,)
(NP
(NP (NNP Corpus) (NNP Linguistics))
(PP (IN in)
(NP (NN particular))))))))
(. .)))
依存语法的结果:
advmod(day-11, Undoubtedly-1)
det(13th-4, the-3)
nsubj(day-11, 13th-4)
prep_of(13th-4, March-6)
num(March-6, 2007-7)
cop(day-11, is-8)
det(day-11, a-9)
amod(day-11, saddest-10)
det(linguistics-15, the-13)
nn(linguistics-15, world-14)
prep_to(day-11, linguistics-15)
nn(Linguistics-18, Corpus-17)
appos(linguistics-15, Linguistics-18)
prep_in(Linguistics-18, particular-20)
3. The gap left by the departure of this innovative thinker and distinguished linguist will be felt in the hearts of the researchers working along the lines he has set.
概率短语结构语法的结果:
(ROOT
(S
(NP
(NP
(NP (DT The) (NN gap))
(VP (VBN left)
(PP (IN by)
(NP
(NP (DT the) (NN departure))
(PP (IN of)
(NP (DT this) (JJ innovative) (NN thinker)))))))
(CC and)
(NP (VBN distinguished) (NN linguist)))
(VP (MD will)
(VP (VB be)
(VP (VBN felt)
(PP (IN in)
(NP
(NP (DT the) (NNS hearts))
(PP (IN of)
(NP (DT the) (NNS researchers)))))
(S
(VP (VBG working)
(PRT (RP along))
(NP
(NP (DT the) (NNS lines))
(SBAR
(S
(NP (PRP he))
(VP (VBZ has)
(VP (VBN set)))))))))))
(. .)))
依存语法的结果:
det(gap-2, The-1)
nsubjpass(felt-16, gap-2)
partmod(gap-2, left-3)
det(departure-6, the-5)
prep_by(left-3, departure-6)
det(thinker-10, this-8)
amod(thinker-10, innovative-9)
prep_of(departure-6, thinker-10)
amod(linguist-13, distinguished-12)
conj_and(gap-2, linguist-13)
aux(felt-16, will-14)
auxpass(felt-16, be-15)
det(hearts-19, the-18)
prep_in(felt-16, hearts-19)
det(researchers-22, the-21)
prep_of(hearts-19, researchers-22)
partmod(felt-16, working-23)
prt(working-23, along-24)
det(lines-26, the-25)
dobj(working-23, lines-26)
nsubj(set-29, he-27)
aux(set-29, has-28)
rcmod(lines-26, set-29)
4. In deepest sorrow, we, linguists at Shanghai Jiao Tong University, China, found that we cannot express with words our gratitude and respect to John.
概率短语结构语法的结果:
(ROOT
(S
(PP (IN In)
(NP (JJS deepest) (NN sorrow)))
(, ,)
(NP
(NP (PRP we))
(, ,)
(NP
(NP (NNS linguists))
(PP (IN at)
(NP
(NP (NNP Shanghai) (NNP Jiao) (NNP Tong) (NNP University))
(, ,)
(NP (NNP China)))))
(, ,))
(VP (VBD found)
(SBAR (IN that)
(S
(NP (PRP we))
(VP (MD can) (RB not)
(VP (VB express)
(PP (IN with)
(NP (NNS words)))
(NP
(NP (PRP$ our) (NN gratitude)
(CC and)
(NN respect))
(PP (TO to)
(NP (NNP John)))))))))
(. .)))
依存语法的结果:
amod(sorrow-3, deepest-2)
prep_in(found-16, sorrow-3)
nsubj(found-16, we-5)
appos(we-5, linguists-7)
nn(University-12, Shanghai-9)
nn(University-12, Jiao-10)
nn(University-12, Tong-11)
prep_at(linguists-7, University-12)
appos(University-12, China-14)
complm(express-21, that-17)
nsubj(express-21, we-18)
aux(express-21, can-19)
neg(express-21, not-20)
ccomp(found-16, express-21)
prep_with(express-21, words-23)
poss(gratitude-25, our-24)
dobj(express-21, gratitude-25)
conj_and(gratitude-25, respect-27)
prep_to(gratitude-25, John-29)
Berkeley Parser主要是一个概率上下文无关语法的分析器。就不详述了。