语义角色标注
本文链接
最近调研了一下语义角色标注,记录如下
将语言信息结构化,方便计算机理解句子中蕴含的语义信息。
语义角色标注 (Semantic Role Labeling, SRL) 是一种浅层的语义分析技术,标注句子中某些短语为给定谓词的论元 (语义角色) ,如施事、受事、时间和地点等。其能够对问答系统、信息抽取和机器翻译等应用产生推动作用。
语义标注的不足之处
- 仅仅对于特定谓词进行论元标注,那多谓词呢?没有涉及到。
- 不会补出句子所省略的部分语义。信息有所缺失。
核心的语义角色: A0-5 六种,A0 通常表示动作的施事,A1通常表示动作的影响等,A2-5 根据谓语动词不同会有不同的语义含义。
附加语义角色(15种):
- ADV adverbial, default tag ( 附加的,默认标记 )
- BNE beneficiary ( 受益人 )
- CND condition ( 条件 )
- DIR direction ( 方向 )
- DGR degree ( 程度 )
- EXT extent ( 扩展 )
- FRQ frequency ( 频率 )
- LOC locative ( 地点 )
- MNR manner ( 方式 )
- PRP purpose or reason ( 目的或原因 )
- TMP temporal ( 时间 )
- TPC topic ( 主题 )
- CRD coordinated arguments ( 并列参数 )
- PRD predicate ( 谓语动词 )
- PSR possessor ( 持有者 )
- PSE possessee ( 被持有 )
传统方法
- 依赖句法分析的结果进行。因为句法分析包括短语结构分析、浅层句法分析、依存关系分析,所以语义角色标注也可以按照此思路分类。
- 基于短语结构树的语义角色标注方法
- 基于浅层句法分析结果的语义角色标注方法
- 基于依存句法分析结果的语义角色标注方法
- 基于特征向量的 SRL
- 基于最大熵分类器的 SRL
- 基于核函数的 SRL
- 基于条件随机场的 SRL
- 各方法的不同,主要集中在他们论元检出的过程有什么不同。
统一标注的过程:句法分析->候选论元剪除->论元识别->论元标注->语义角色标注结果
- 论元剪除:在较多候选项中去掉肯定不是论元的部分(span)
- 论元识别:一个二值分类问题,即:是论元和不是论元
- 论元标注:一个多值分类问题
# 短语结构分析
S——|
| |
NN VP
我 |——|
Vt NN
吃 肉
分类问题的特征怎么设计?
- 谓词本身、
- 短语结构树路径、
- 短语类型、
- 论元在谓词的位置、
- 谓词语态、
- 论元中心词、
- 从属类别、
- 论元第一个词和最后一个词、
- 组合特征。
应用领域
- 数字图书馆建设
- 信息检索
- 信息抽取
- 科技文献知识抽取
目前标注方法弊端
- 依赖于句法分析的准确性
- 领域适应能力差
- 现有的分类算法还有多大潜力可挖掘?同样的,还能设计多少新特征?很难了。
- end-to-end 就不用依赖于句法分析的结果了
- 多语平行语料有助于弥补准确性的问题?
tutorial of NAACL2009
Linguistic Background, Resources, Annotation
- Motivation: From Sentences to Propositions(抽取句子的主干意义)
Capturing semantic roles
Case Theory
- Case relations occur in deep-structure
- Surface-structure cases are derived
A sentence is a verb + one or more NPs
- Each NP has a deep-structure case
- A(gentive)
- I(nstrumental)
- D(ative) - recipient
- F(actitive) – result
- L(ocative)
- O(bjective) – affected object, theme
- Subject is no more important than Object
- Subject/Object are surface structure
- Each NP has a deep-structure case
Case Theory Benefits - Generalizations
- Fewer tokens
- Fewer verb senses
- E.g. cook/bake [ __O(A)] covers
- Mother is cooking/baking the potatoes
- The potatoes are cooking/baking.
- Mother is cooking/baking.
- Fewer types
- “Different” verbs may be the same semantically, but with different subject selection preferences
- E.g. like and please are both [ __O+D]
- Fewer tokens
Oops, problems with Cases/Thematic Roles
- How many and what are they?
- Fragmentation: 4 Agent subtypes? (Cruse, 1973)
- The sun melted the ice./This clothes dryer doesn’t dry clothes well
- Ambiguity: Andrews (1985)
- Argument/adjunct distinctions – Extent?
- The kitten licked my fingers. – Patient or Theme?
- Θ-Criterion (GB Theory): each NP of predicate in lexicon assigned unique θ-role (Chomsky 1981).
Argument Selection Principle
- Proto-Agent- the mother
- Volitional involvement in event or state
- Sentience (and/or perception)
- Causes an event or change of state in another participant
- Movement (relative to position of another participant)
- (exists independently of event named)
*may be discourse pragmatic
Proto-Patient – the cake
- Undergoes change of state
- Incremental theme
- Causally affected by another participant
- Stationary relative to movement of another participant
- (does not exist independently of the event, or at all)
- *may be discourse pragmatic
Why numbered arguments?
- Lack of consensus concerning semantic role labels
- Numbers correspond to verb-specific labels
- Arg0 – Proto-Agent, and Arg1 – Proto-Patient, (Dowty, 1991)
- Args 2-5 are highly variable and overloaded – poor performance
Why do we need Frameset ID’s?
- 因为一个动词在不同的情形下有多个意义
- Proto-Agent- the mother
Annotation procedure, WSJ PropBank Palmer, et. al., 2005
- PTB II - Extraction of all sentences with given verb
- Create Frame File for that verb Paul Kingsbury
- (3100+ lemmas, 4400 framesets,118K predicates)
- Over 300 created automatically via VerbNet
- Create Frame File for that verb Paul Kingsbury
- First pass: Automatic tagging (Joseph Rosenzweig)
- http://www.cis.upenn.edu/~josephr/TIDES/index.html#lexicon
- Second pass: Double blind hand correction (Paul Kingsbury)
- Tagging tool highlights discrepancies (Scott Cotton)
- Third pass: Solomonization (adjudication)
- Betsy Klipple, Olga Babko-Malaya
- PTB II - Extraction of all sentences with given verb
- Case relations occur in deep-structure
Supervised Semantic Role Labeling and Leveraging Parallel PropBanks
basic knowledge
- SRL on Constituent Parse(成分句法分析)
- A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be
-非叶子节点是短语,叶子节点是word,边没有标记。
- A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be
- SRL on Dependency Parse
- A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship. A dependency parse of “John sees Bill”, would be:
- 一个依存解析将word按照他们的关系连接起来,每个节点代表一个word,边用关系来进行表示。
- 依存句法树能够根据成分句法树转换而来,但成分句法树不能通过依存树转化来。转换的规则是head-finding rules from Zhang and Clark 2008
- head word 一般指的是短语结构中的中心词。
- SRL on Constituent Parse(成分句法分析)
SRL Supervised ML Pipeline
- Syntactic Parse
- Prune Constituents [Xue, Palmer 2004]
- For the predicate and each of its ancestors, collect their sisters unless the sister is coordinated with the predicate
- If a sister is a PP(介词短语) also collect its immediate children
- Argument Identification(ML)
- Extract features from sentence, syntactic parse, and
other sources for each candidate constituent - Train statistical ML classifier to identify arguments
- Extract features from sentence, syntactic parse, and
- Argument Classification(ML&#x