《精通Python自然语言处理》
Deepti Chopra(印度)
王威 译
第五章 语法分析:分析训练资料
语法解析(也被称作句法分析)被定义为一个检查用自然语言书写的字符序列是否合乎正式语法中所定义的规则的过程。它是将句子分解为单词或短语序列并为他们提供特定的成分类别(n/adj/prep)的过程。
5.1语法解析简介
解析器是一个可以接受输入文本并构造解析树或句法树的软件。
语法解析分为两类:
自顶向下的语法解析 | 从起始符开始一直持续到单个的句子成分。递归下降解析器(Recursive Descent Parser)、LL解析器、Earley解析器 |
---|---|
自底向上的语法解析 | 从单个句子开始一直持续到起始符。运算符优先解析器(Operator-precedence parser)、简单优先解析器(Simple precedence parser)、简单LR解析器(Simple LR Parser)、LALR解析器、规范LR解析器、GLR解析器、CYK解析器、递归提升解析器、移位归约解析器 |
5.2Treebank建设
Fields()函数获取文件标识符:
import nltk
import nltk.corpus
print(str(nltk.corpus.treebank).replace('\\\\','/'))
print(nltk.corpus.treebank.fileids())
from nltk.corpus import treebank
print(treebank.words('wsj_0007.mrg'))
print(treebank.tagged_words('wsj_0007.mrg'))
Treebank语料库阅读器:
import nltk
from nltk.corpus import treebank
print(treebank.parsed_sents('wsj_0007.mrg')[2])
import nltk
from nltk.corpus import treebank_chunk
print(treebank_chunk.chunked_sents()[1])
print(treebank_chunk.chunked_sents()[1].draw())
import nltk
from nltk.corpus import treebank_chunk
print(treebank_chunk.chunked_sents()[1].leaves())
print(treebank_chunk.chunked_sents()[1].pos())
print(treebank_chunk.chunked_sents()[1].productions())
print(nltk.corpus.treebank.tagged_words())
获取标签及频率:
import nltk
from nltk.probability import FreqDist
from nltk.corpus import treebank
fd = FreqDist()
fd.items()
访问Sinica Treebank语料库:
import nltk
from nltk.corpus import sinica_treebank
print(sinica_treebank.sents())
print(sinica_treebank.parsed_sents()[27])
5.3从Treebank提取上下文无关文法规则
上下文无关文法(Context-free Grammar,CFG)由以下部分组成:
- 非终结符的有限集合(N);
- 终结符的有限集合(T);
- 开始符号(S);
- 产生式的有限集合(P)。如:A->a
CFG构建中有如下四种结构(句子级别):
陈述结构 | 祈使结构 | 一般疑问结构 | 特殊疑问结构 |
---|
上下文无关文法规则:
import nltk
from nltk import Nonterminal, nonterminals, Production, CFG
nonterminal1 = Nonterminal('NP')
nonterminal2 = Nonterminal('VP')
nonterminal3 = Nonterminal('PP')
print(nonterminal1.symbol())
print(nonterminal2.symbol())
print(nonterminal3.symbol())
print(nonterminal1==nonterminal2)
print(nonterminal2==nonterminal3)
print(nonterminal1==nonterminal3)
S, NP, VP, PP = nonterminals('S, NP, VP, PP')
N, V, P, DT = nonterminals('N, V, P, DT')
production1 = Production(S, [NP, VP])
production2 = Production(NP, [DT, NP])
production3 = Production(VP, [V, NP,NP,PP])
print(production1.lhs())
print(production1.rhs())
print(production3.lhs())
print(production3.rhs())
print(production3 == Production(VP, [V,NP,NP,PP]))
print(production2 == production3)
访问ATIS语法:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
print(gram1)
从ATIS提取测试句子:
import nltk
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
print(len(sent))
testingsent=sent[25]
print(testingsent[1])
print(testingsent[0])
sent=testingsent[0]
自底向上的语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser1 = nltk.parse.BottomUpChartParser(gram1)
chart1 = parser1.chart_parse(sent)
print((chart1.num_edges()))
print((len(list(chart1.parses(gram1.start())))))
自底向上,左角(Left Corner)语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser2 = nltk.parse.BottomUpLeftCornerChartParser(gram1)
chart2 = parser2.chart_parse(sent)
print((chart2.num_edges()))
print((len(list(chart2.parses(gram1.start())))))
使用自底向上过滤器的左角语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser3 = nltk.parse.LeftCornerChartParser(gram1)
chart3 = parser3.chart_parse(sent)
print((chart3.num_edges()))
print((len(list(chart3.parses(gram1.start())))))
自顶向下的语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser4 = nltk.parse.TopDownChartParser(gram1)
chart4 = parser4.chart_parse(sent)
print((chart4.num_edges()))
print((len(list(chart4.parses(gram1.start())))))
增量式自底向上语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser5 = nltk.parse.IncrementalBottomUpChartParser(gram1)
chart5 = parser5.chart_parse(sent)
print((chart5.num_edges()))
print((len(list(chart5.parses(gram1.start())))))
增量式自底向上、左角语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser6 = nltk.parse.IncrementalBottomUpLeftCornerChartParser(gram1)
chart6 = parser6.chart_parse(sent)
print((chart6.num_edges()))
print((len(list(chart6.parses(gram1.start())))))
自底向上过滤器的增量式左角语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser7 = nltk.parse.IncrementalLeftCornerChartParser(gram1)
chart7 = parser7.chart_parse(sent)
print((chart7.num_edges()))
print((len(list(chart7.parses(gram1.start())))))
增量式自顶向下语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser8 = nltk.parse.IncrementalTopDownChartParser(gram1)
chart8 = parser8.chart_parse(sent)
print((chart8.num_edges()))
print((len(list(chart8.parses(gram1.start())))))
Earlay语法解析:
import nltk
gram1 = nltk.data.load('grammars/large_grammars/atis.cfg')
sent = nltk.data.load('grammars/large_grammars/atis_sentences.txt')
sent = nltk.parse.util.extract_test_sentences(sent)
testingsent=sent[25]
sent=testingsent[0]
parser9 = nltk.parse.EarleyChartParser(gram1)
chart9 = parser9.chart_parse(sent)
print((chart9.num_edges()))
print((len(list(chart9.parses(gram1.start())))))
5.4从CFG创建概率上下文无关文法(PCFG)
概率被附加到CFG中呈现的所有产生式中,这些概率之和为1。解析树的概率是在构建树的过程中用到的所有产生式概率的乘积。
PCFG的规则信息:
import nltk
from nltk.corpus import treebank
from itertools import islice
from nltk.grammar import PCFG, induce_pcfg, toy_pcfg1, toy_pcfg2
gram2 = PCFG.fromstring("""
A -> B B [.3] | C B C [.7]
B -> B D [.5] | C [.5]
C -> 'a' [.1] | 'b' [0.9]
D -> 'b' [1.0]
""")
prod1 = gram2.productions()[0]
print(prod1)
prod2 = gram2.productions()[1]
print(prod2)
print(prod2.lhs())
print(prod2.rhs())
print((prod2.prob()))
print(gram2.start())
print(gram2.productions())
概率分布图解析:
import nltk
from nltk.corpus import treebank
from itertools import islice
from nltk.grammar import PCFG, induce_pcfg, toy_pcfg1, toy_pcfg2
tokens = "Jack told Bob to bring my cookie".split()
grammar = toy_pcfg2
print(grammar)
5.5CYK线图解析算法
CYK线图解析使用动态规划方法,是最简单的线图解析算法之一。
CYK线图解析:
tok = ["the", "kids","opened",”the”, ”box”, ”on”, ”the”, “floor”]
gram = nltk.parse_cfg (“””
S -> NP VP
NP -> Det N | NP PP
VP -> V NP | VP PP
PP -> P NP
Det -> 'the'
N -> 'kids' | 'box' | 'floor'
V -> 'opened' P -> 'on'
""")
构建初始化线图:
def init_nfst(tok, gram):
numtokens1 = len(tok)
# fill w/ dots
nfst = [ ["." for i in range (nmtokenst1+1)] !!!!!!! for j in range (numtokens1+1)]
# fill in diagonal
for i in range (numtokens1) :
prod = gram.productions (rhs=tok[i])
nfst[i][i+1] = prod[o].Ihs()
return nfst
填充线图:
def complete_nfst(nfst, tok, trace = False) :
index1 = {} for prod in gram.productions():
#make lookup reverse
index1 [prod.rhs()].prod.lhs()
numtokens1 = len(tok) for span in range (2, numtokens1+1) :
for start in range (numtokens1 + 1 - span) :
#go down towards diagonal
end1 = start1 + span for mid in range(start1+1, end1) :
nt1, nt2 = nfst[start1] (mid1], nfst [mid1] [end1]
if (nt1,nt2) in index1:
if trace:
print "[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]” % \ (start, ntl,mid1, nt2, end1, start1, indexi[(nt1,nt2)], end)
nfst[start1] [end1] =index[(nt1,nt2)]
return nfst
构建显示线图:
def display(wfst, tck):
print '\nWFST ‘ + ' ‘.join([("% - 4d" % 1) for i in range(1,len (wfst))])
for i in range (len (wfst)-1):
print " %d " % I,
for j in range(1, len (wfst)) :
print "%-4s" % wfst[i][j],
print
获取输出结果:
tok = ["the", "kids", "opened", "the", "box", "on", "the", "floor"]
res1 = init wfst(tok, gram)
display (res1, tok)
res2 = complete wfst (resl, tok)
display(res2, tok)
5.6 Earley线图解析算法
Earley算法由Earley于1970年提出。该算法类似于自顶向下的语句解析。它可以处理左递归问题,并且不需要CNF (乔姆斯基范式)转化。Earley 算法以从左到右的方式填充线图。
Earley 线图解析器进行语法解析:
import nltk
nltk.parse.earleychart.demo (print_times=False, trace=1, sent='I saw a dog', numparses=2)
NLTK线图解析器进行语法解析:
import nltk
nltk.parse.chart.demo(2, print_times=False, trace=1,sent='John saw a dog', numparses=1)
NLTK中的Stepping线图解析器进行语法解析:
import nltk
nltk.parse.chart.demo(5, print_times=False, trace=1,sent='John saw a dog', numparses=2)
Feature线图解析:
import nltk
nltk.parse.featurechart.demo(print_times=False,print_grammar=True,parser=nltk.parse.featurechart.FeatureChartParser,sent='I saw a dog')
实现Earley算法:
def demo (print_times=True, print_grammar=False,
print_trees=True, trace=2,
sent='I saw John with a dog with my cookie', numparses=5) :
“”“
A demonstration of the Earley parsers.
”“”
import sys, time
from nltk.parse.chart import demo_grammar
# The grammar for ChartParser and SteppingChartParser:
grammar = demo_grammar ()
if print_grammar:
print ("* Gr ammar")
print (grammar)
# Tokenize the sample sentence.
print ("* Sentence:")
print (sent)
tokens = sent.splitll
print (tokens)
print ()
# Do the parsing.
earley = EarleyChartParser (grammar, trace=trace)
t = time.clock()
chart = earley.chart_parse (tokens)
parses = list(chart.parses (grammar.start() ) )
t = time.clock()-t
# Print results.
if numparses:
assert len (parses) ==numparses, 'Not all parses found'
if print_trees:
for tree in parses: print (tree)
else:
print("Nr trees:", len (parses))
if print times:
print ("Time:", t)
if __name__ == '__main__':
demo()
“”"***笔者的话:整理了《精通Python自然语言处理》的第五章内容:语法分析。语法分析对于词性标注是至关重要的。也是对句子理解的首要任务。后续会整理这本书的后面章节。本博客记录了书中的每段代码。希望对阅读这本书的人有所帮助。FIGHTING...(热烈欢迎大家批评指正,互相讨论)
(Precious time, which cannot be recovered once lost.) ***"""