语义角色标注 Semantic Role Labeling(SRL) 初探(整理英文tutorial)

语义角色标注

本文链接
最近调研了一下语义角色标注,记录如下

  • 将语言信息结构化,方便计算机理解句子中蕴含的语义信息。

    语义角色标注 (Semantic Role Labeling, SRL) 是一种浅层的语义分析技术,标注句子中某些短语为给定谓词的论元 (语义角色) ,如施事、受事、时间和地点等。其能够对问答系统、信息抽取和机器翻译等应用产生推动作用。

  • 语义标注的不足之处

    • 仅仅对于特定谓词进行论元标注,那多谓词呢?没有涉及到。
    • 不会补出句子所省略的部分语义。信息有所缺失。
  • 核心的语义角色: A0-5 六种,A0 通常表示动作的施事,A1通常表示动作的影响等,A2-5 根据谓语动词不同会有不同的语义含义。

  • 附加语义角色(15种):

    • ADV adverbial, default tag ( 附加的,默认标记 )
    • BNE beneficiary ( 受益人 )
    • CND condition ( 条件 )
    • DIR direction ( 方向 )
    • DGR degree ( 程度 )
    • EXT extent ( 扩展 )
    • FRQ frequency ( 频率 )
    • LOC locative ( 地点 )
    • MNR manner ( 方式 )
    • PRP purpose or reason ( 目的或原因 )
    • TMP temporal ( 时间 )
    • TPC topic ( 主题 )
    • CRD coordinated arguments ( 并列参数 )
    • PRD predicate ( 谓语动词 )
    • PSR possessor ( 持有者 )
    • PSE possessee ( 被持有 )
  • 传统方法

    • 依赖句法分析的结果进行。因为句法分析包括短语结构分析、浅层句法分析、依存关系分析,所以语义角色标注也可以按照此思路分类。
    • 基于短语结构树的语义角色标注方法
    • 基于浅层句法分析结果的语义角色标注方法
    • 基于依存句法分析结果的语义角色标注方法
    • 基于特征向量的 SRL
    • 基于最大熵分类器的 SRL
    • 基于核函数的 SRL
    • 基于条件随机场的 SRL
    • 各方法的不同,主要集中在他们论元检出的过程有什么不同。
  • 统一标注的过程:句法分析->候选论元剪除->论元识别->论元标注->语义角色标注结果

    • 论元剪除:在较多候选项中去掉肯定不是论元的部分(span)
    • 论元识别:一个二值分类问题,即:是论元和不是论元
    • 论元标注:一个多值分类问题
# 短语结构分析
S——| 
|     | 
NN    VP 
我       |——| 
           Vt    NN 
           吃     肉
  • 分类问题的特征怎么设计?

    • 谓词本身、
    • 短语结构树路径、
    • 短语类型、
    • 论元在谓词的位置、
    • 谓词语态、
    • 论元中心词、
    • 从属类别、
    • 论元第一个词和最后一个词、
    • 组合特征。
  • 应用领域

    • 数字图书馆建设
    • 信息检索
    • 信息抽取
    • 科技文献知识抽取
  • 目前标注方法弊端

    • 依赖于句法分析的准确性
    • 领域适应能力差
    • 现有的分类算法还有多大潜力可挖掘?同样的,还能设计多少新特征?很难了。
    • end-to-end 就不用依赖于句法分析的结果了
    • 多语平行语料有助于弥补准确性的问题?

tutorial of NAACL2009

  • Linguistic Background, Resources, Annotation

    • Motivation: From Sentences to Propositions(抽取句子的主干意义)
    • Capturing semantic roles

    • Case Theory

      • Case relations occur in deep-structure
        • Surface-structure cases are derived
      • A sentence is a verb + one or more NPs

        • Each NP has a deep-structure case
          • A(gentive)
          • I(nstrumental)
          • D(ative) - recipient
          • F(actitive) – result
          • L(ocative)
          • O(bjective) – affected object, theme
        • Subject is no more important than Object
          • Subject/Object are surface structure
      • Case Theory Benefits - Generalizations

        • Fewer tokens
          • Fewer verb senses
          • E.g. cook/bake [ __O(A)] covers
            • Mother is cooking/baking the potatoes
            • The potatoes are cooking/baking.
            • Mother is cooking/baking.
        • Fewer types
          • “Different” verbs may be the same semantically, but with different subject selection preferences
          • E.g. like and please are both [ __O+D]
      • Oops, problems with Cases/Thematic Roles

        • How many and what are they?
        • Fragmentation: 4 Agent subtypes? (Cruse, 1973)
          • The sun melted the ice./This clothes dryer doesn’t dry clothes well
        • Ambiguity: Andrews (1985)
          • Argument/adjunct distinctions – Extent?
          • The kitten licked my fingers. – Patient or Theme?
        • Θ-Criterion (GB Theory): each NP of predicate in lexicon assigned unique θ-role (Chomsky 1981).
      • Argument Selection Principle

        • Proto-Agent- the mother
          • Volitional involvement in event or state
          •   Sentience (and/or perception)
          • Causes an event or change of state in another participant
          •   Movement (relative to position of another participant)
          •   (exists independently of event named)
            *may be discourse pragmatic
        • Proto-Patient – the cake

          • Undergoes change of state
          •   Incremental theme
          •   Causally affected by another participant
          •   Stationary relative to movement of another participant
          •   (does not exist independently of the event, or at all)
          •   *may be discourse pragmatic
        • Why numbered arguments?

          • Lack of consensus concerning semantic role labels
          •  Numbers correspond to verb-specific labels
          •  Arg0 – Proto-Agent, and Arg1 – Proto-Patient, (Dowty, 1991)
          •  Args 2-5 are highly variable and overloaded – poor performance
        • Why do we need Frameset ID’s?

          • 因为一个动词在不同的情形下有多个意义
      • Annotation procedure, WSJ PropBank Palmer, et. al., 2005

        • PTB II - Extraction of all sentences with given verb
          • Create Frame File for that verb Paul Kingsbury
            •  (3100+ lemmas, 4400 framesets,118K predicates)
            •   Over 300 created automatically via VerbNet
        • First pass: Automatic tagging (Joseph Rosenzweig)
  • Supervised Semantic Role Labeling and Leveraging Parallel PropBanks

    • basic knowledge

      • SRL on Constituent Parse(成分句法分析)
        • A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be
          -非叶子节点是短语,叶子节点是word,边没有标记。
      • SRL on Dependency Parse
        • A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship. A dependency parse of “John sees Bill”, would be:
        • 一个依存解析将word按照他们的关系连接起来,每个节点代表一个word,边用关系来进行表示。
      • 依存句法树能够根据成分句法树转换而来,但成分句法树不能通过依存树转化来。转换的规则是head-finding rules from Zhang and Clark 2008
      • head word 一般指的是短语结构中的中心词。
    • SRL Supervised ML Pipeline

      1. Syntactic Parse
      2. Prune Constituents [Xue, Palmer 2004]
        • For the predicate and each of its ancestors, collect their sisters unless the sister is coordinated with the predicate
        • If a sister is a PP(介词短语) also collect its immediate children
      3. Argument Identification(ML)
        • Extract features from sentence, syntactic parse, and
          other sources for each candidate constituent
        • Train statistical ML classifier to identify arguments
      4. Argument Classification(ML&#x
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值