python nltk 8 分析句子结构

最新推荐文章于 2024-08-03 20:56:21 发布

lakomi

最新推荐文章于 2024-08-03 20:56:21 发布

阅读量2.2k

点赞数

分类专栏： NLTK 文章标签： python nltk NLP

本文链接：https://blog.csdn.net/Q_s_qiu/article/details/107296255

版权

NLTK 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

8 分析句子结构

Analyzing Sentence Structure（分析句子结构）

英文文档 http://www.nltk.org/book/
中文文档 https://www.bookstack.cn/read/nlp-py-2e-zh/0.md
以下编号按个人习惯

Analyzing Sentence Structure（分析句子结构）

1 Some Grammatical Dilemmas（一些语法困境）

歧义在语言中是普遍存在的，重要的目的是能够理解自然语言。

2 What’s the Use of Syntax?（语法有什么用）

2.1 Beyond n-grams（超越n-grams）

下图中，我们系统地将较长的序列替换为较短的序列，并且保留了语法性。每个组成单元的序列实际上都可以被一个单词替换，最后我们只得到两个元素。
在这里插入图片描述
替换单词序列:从第一行开始，我们可以将特定的单词序列(例如the brook)替换为单个的单词(例如it);重复这个过程，我们得到一个合乎语法的两个词的句子。

3 Context Free Grammar（上下文无关语法）

无上下文语法定义在nltk.grammar模块中。下面例子定义了一个简单的语法，并用其解析句子。
上下文无关语法不考虑它所处的上下文。

# 简单的无上下文语法
def asimple_grammar():
    grammar1 = CFG.fromstring("""
        S -> NP VP
        VP -> V NP | V NP PP
        PP -> P NP
        V -> "saw" | "ate" | "walked"
        NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
        Det -> "a" | "an" | "the" | "my"
        N -> "man" | "dog" | "cat" | "telescope" | "park"
        P -> "in" | "on" | "by" | "with"
        """)

    sent1 = "Mary saw Bob".split()
    rd_parser = nltk.RecursiveDescentParser(grammar1)
    # 解析sent1。只有一种结果
    for tree in rd_parser.parse(sent1):
        print(tree)

    sent2 = "the dog saw a man in the park".split()
    # 解析sent2，得到两棵树，则在结构上有歧义——介词短语附件歧义
    for tree in rd_parser.parse(sent2):
        print(tree)

可以在一个文件中自己编写语法，并加载。

# 在文件mygrammar.cfg中，编写自己的语法。
# 加载自定义的语法文件
grammar1 = nltk.data.load('file:mygrammar.cfg')

mygrammar.cfg文件中内容如下：

S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"

一个语法被认为是递归的：如果语法类型出现在产生式左侧也出现在右侧。
下面例子中Nom -> Adj Nom 此为一个递归。而S -> NP VP 和 VP -> V S构成间接递归。

grammar2 = nltk.CFG.fromstring("""
      S  -> NP VP
      NP -> Det Nom | PropN
      Nom -> Adj Nom | N                  #递归
      VP -> V Adj | V NP | V S | V NP PP
      PP -> P NP
      PropN -> 'Buster' | 'Chatterer' | 'Joe'
      Det -> 'the' | 'a'
      N -> 'bear' | 'squirrel' | 'tree' | 'fish' | 'log'
      Adj  -> 'angry' | 'frightened' |  'little' | 'tall'
      V ->  'chased'  | 'saw' | 'said' | 'thought' | 'was' | 'put'
      P -> 'on'
      """)

4 Parsing With Context Free Grammar（使用上下文无关语法进行解析）

解析器：根据文法产生式处理输入的句子，并建立符合语法的一个或多个组分的结构。语法是格式正确的说明性规范，仅是一个字符串，不是程序。例如问答系统对提交的问题进行语法分析。
简单的解析算法有①递归下降解析的自上而下的方法②移位减少解析的自下而上的方法
复杂的解析算法有①带有自下向上过滤的自顶向下方法（左角解析）② 图表解析的动态编程技术

4.1 Recursive Descent Parsing（递归向下解析）

# nltk 提供了递归下降解析器
rd_parser = nltk.RecursiveDescentParser(grammar1)

本方法中，解析树向下扩展的过程可通过图形演示，看到实际效果

nltk.app.rdparser()

4.2 Shift-Reduce Parsing（移进-归约分析）

简单的自下而上的解析器，移进-归约分析器试图查找与语法产生的右侧相对应的单词和短语序列，并用左侧代替他们，直到减少为S。

# nltk提供移进-归约分析器。此分析器最多只能找到一个解析
sr_parser = nltk.ShiftReduceParser(grammar1)

查看分析的过程

nltk.app.srparser()

5 Dependencies and Dependency Grammar（依存关系和依存文法）

短语结构文法是关于词和词序列如何结合起来形成句子成分的。
依存语法集中关注的是词和其他词之间的关系。依存关系是一个中心词（通常是动词）与它的依赖之间的二元对称关系。其他词语中心词要么依赖，要么依赖路径与它联通。

# nltk为依存语法编码的一种方式。只能捕捉依存关系信息，不能指定关系类型。
groucho_dep_grammar = nltk.DependencyGrammar.fromstring("""
     'shot' -> 'I' | 'elephant' | 'in'
     'elephant' -> 'an' | 'in'
     'in' -> 'pajamas'
     'pajamas' -> 'my'
     """)
print(groucho_dep_grammar)

在依存语法的传统中，下表中的动词具有不同的配价，配价限制不仅适用于动词，也适用于其他类的中心词。
在这里插入图片描述
到目前为止，语法能否扩大到覆盖自然语言的大型语料库，还是非常困难的。将语法模块化是很难的，每部分语法可以独立开发。

6 Grammar Development（语法开发）

访问树库，开发广泛覆盖的语法

# treebank中包含许多人工标注句法树的句法
t = treebank.parsed_sents('wsj_0001.mrg')[0]
print(t)

# 中央研究院树库语料
nltk.corpus.sinica_treebank.parsed_sents()[3450].draw()

随着语法覆盖范围的增加和输入句子长度的增长，分析树的数量也在增长。这就会存在歧义。例如：

grammar = nltk.CFG.fromstring("""
    S -> NP V NP
    NP -> NP Sbar
    Sbar -> NP V
    NP -> 'fish'
    V -> 'fish'
    """)
tokens = ["fish"] * 5
cp = nltk.ChartParser(grammar)
for tree in cp.parse(tokens):
    print(tree)

上面代码的结果，输出了两个树：
在这里插入图片描述
因此，处理歧义是开发广泛覆盖的解析器的关键。图表分析器岁提高了计算同一句子的多个分析的效率，但仍然会被大量的分析淹没。另一种有效解决方案：加权语法和概率解析算法

def give(t):
    return t.label() == 'VP' and len(t) > 2 and t[1].label() == 'NP' and (
            t[2].label() == 'PP-DTV' or t[2].label() == 'NP') and (
                   'give' in t[0].leaves() or 'gave' in t[0].leaves())
                   
def sent(t):
    return ' '.join(token for token in t.leaves() if token[0] not in '*-0')

def print_node(t, width):
    output = "%s %s: %s / %s: %s" % (sent(t[0]), t[1].label(), sent(t[1]), t[2].label(), sent(t[2]))
    if len(output) > width:
        output = output[:width] + "..."
    print(output)

# 加权语法
def weighted_grammar():
    # 检查所有涉及give的介词和双宾语结构的实例
    for tree in nltk.corpus.treebank.parsed_sents():
        for t in tree.subtrees(give):
            print_node(t, 72)

概率上下文无关文法（PCFG），要求产生式所有给定的左侧的概率之和为1。分析树被分配了概率，结果返回概率最大最有可能的解析。

# 概率上下文无关文法
grammar = nltk.PCFG.fromstring("""
    S    -> NP VP              [1.0]
    VP   -> TV NP              [0.4]
    VP   -> IV                 [0.3]
    VP   -> DatV NP NP         [0.3]
    TV   -> 'saw'              [1.0]
    IV   -> 'ate'              [1.0]
    DatV -> 'gave'             [1.0]
    NP   -> 'telescopes'       [0.8]
    NP   -> 'Jack'             [0.2]
    """)
viterbi_parser = nltk.ViterbiParser(grammar)
for tree in viterbi_parser.parse(['Jack', 'saw', 'telescopes']):
    print(tree)

在这里插入图片描述

lakomi

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python nltk 8 分析句子结构

8 分析句子结构Analyzing Sentence Structure（分析句子结构）1 Some Grammatical Dilemmas（一些语法困境）2 What's the Use of Syntax?（语法有什么用）2.1 Beyond n-grams（超越n-grams）3 Context Free Grammar（上下文无关语法）英文文档 http://www.nltk.org/book/中文文档 https://www.bookstack.cn/read/nlp-py-2e-zh
复制链接

扫一扫

专栏目录