语义角色标注
本文链接
最近调研了一下语义角色标注,记录如下
将语言信息结构化,方便计算机理解句子中蕴含的语义信息。
语义角色标注 (Semantic Role Labeling, SRL) 是一种浅层的语义分析技术,标注句子中某些短语为给定谓词的论元 (语义角色) ,如施事、受事、时间和地点等。其能够对问答系统、信息抽取和机器翻译等应用产生推动作用。
语义标注的不足之处
- 仅仅对于特定谓词进行论元标注,那多谓词呢?没有涉及到。
- 不会补出句子所省略的部分语义。信息有所缺失。
核心的语义角色: A0-5 六种,A0 通常表示动作的施事,A1通常表示动作的影响等,A2-5 根据谓语动词不同会有不同的语义含义。
附加语义角色(15种):
- ADV adverbial, default tag ( 附加的,默认标记 )
- BNE beneficiary ( 受益人 )
- CND condition ( 条件 )
- DIR direction ( 方向 )
- DGR degree ( 程度 )
- EXT extent ( 扩展 )
- FRQ frequency ( 频率 )
- LOC locative ( 地点 )
- MNR manner ( 方式 )
- PRP purpose or reason ( 目的或原因 )
- TMP temporal ( 时间 )
- TPC topic ( 主题 )
- CRD coordinated arguments ( 并列参数 )
- PRD predicate ( 谓语动词 )
- PSR possessor ( 持有者 )
- PSE possessee ( 被持有 )
传统方法
- 依赖句法分析的结果进行。因为句法分析包括短语结构分析、浅层句法分析、依存关系分析,所以语义角色标注也可以按照此思路分类。
- 基于短语结构树的语义角色标注方法
- 基于浅层句法分析结果的语义角色标注方法
- 基于依存句法分析结果的语义角色标注方法
- 基于特征向量的 SRL
- 基于最大熵分类器的 SRL
- 基于核函数的 SRL
- 基于条件随机场的 SRL
- 各方法的不同,主要集中在他们论元检出的过程有什么不同。
统一标注的过程:句法分析->候选论元剪除->论元识别->论元标注->语义角色标注结果
- 论元剪除:在较多候选项中去掉肯定不是论元的部分(span)
- 论元识别:一个二值分类问题,即:是论元和不是论元
- 论元标注:一个多值分类问题
# 短语结构分析
S——|
| |
NN VP
我 |——|
Vt NN
吃 肉
分类问题的特征怎么设计?
- 谓词本身、
- 短语结构树路径、
- 短语类型、
- 论元在谓词的位置、
- 谓词语态、
- 论元中心词、
- 从属类别、
- 论元第一个词和最后一个词、
- 组合特征。
应用领域
- 数字图书馆建设
- 信息检索
- 信息抽取
- 科技文献知识抽取
目前标注方法弊端
- 依赖于句法分析的准确性
- 领域适应能力差
- 现有的分类算法还有多大潜力可挖掘?同样的,还能设计多少新特征?很难了。
- end-to-end 就不用依赖于句法分析的结果了
- 多语平行语料有助于弥补准确性的问题?
tutorial of NAACL2009
Linguistic Background, Resources, Annotation
- Motivation: From Sentences to Propositions(抽取句子的主干意义)
Capturing semantic roles
Case Theory
- Case relations occur in deep-structure
- Surface-structure cases are derived
A sentence is a verb + one or more NPs
- Each NP has a deep-structure case
- A(gentive)
- I(nstrumental)
- D(ative) - recipient
- F(actitive) – result
- L(ocative)
- O(bjective) – affected object, theme
- Subject is no more important than Object
- Subject/Object are surface structure
- Each NP has a deep-structure case
Case Theory Benefits - Generalizations
- Fewer tokens
- Fewer verb senses
- E.g. cook/bake [ __O(A)] covers
- Mother is cooking/baking the potatoes
- The potatoes are cooking/baking.
- Mother is cooking/baking.
- Fewer types
- “Different” verbs may be the same semantically, but with different subject selection preferences
- E.g. like and please are both [ __O+D]
- Fewer tokens
Oops, problems with Cases/Thematic Roles
- How many and what are they?
- Fragmentation: 4 Agent subtypes? (Cruse, 1973)
- The sun melted the ice./This clothes dryer doesn’t dry clothes well
- Ambiguity: Andrews (1985)
- Argument/adjunct distinctions – Extent?
- The kitten licked my fingers. – Patient or Theme?
- Θ-Criterion (GB Theory): each NP of predicate in lexicon assigned unique θ-role (Chomsky 1981).
Argument Selection Principle
- Proto-Agent- the mother
- Volitional involvement in event or state
- Sentience (and/or perception)
- Causes an event or change of state in another participant
- Movement (relative to position of another participant)
- (exists independently of event named)
*may be discourse pragmatic
Proto-Patient – the cake
- Undergoes change of state
- Incremental theme
- Causally affected by another participant
- Stationary relative to movement of another participant
- (does not exist independently of the event, or at all)
- *may be discourse pragmatic
Why numbered arguments?
- Lack of consensus concerning semantic role labels
- Numbers correspond to verb-specific labels
- Arg0 – Proto-Agent, and Arg1 – Proto-Patient, (Dowty, 1991)
- Args 2-5 are highly variable and overloaded – poor performance
Why do we need Frameset ID’s?
- 因为一个动词在不同的情形下有多个意义
- Proto-Agent- the mother
Annotation procedure, WSJ PropBank Palmer, et. al., 2005
- PTB II - Extraction of all sentences with given verb
- Create Frame File for that verb Paul Kingsbury
- (3100+ lemmas, 4400 framesets,118K predicates)
- Over 300 created automatically via VerbNet
- Create Frame File for that verb Paul Kingsbury
- First pass: Automatic tagging (Joseph Rosenzweig)
- http://www.cis.upenn.edu/~josephr/TIDES/index.html#lexicon
- Second pass: Double blind hand correction (Paul Kingsbury)
- Tagging tool highlights discrepancies (Scott Cotton)
- Third pass: Solomonization (adjudication)
- Betsy Klipple, Olga Babko-Malaya
- PTB II - Extraction of all sentences with given verb
- Case relations occur in deep-structure
Supervised Semantic Role Labeling and Leveraging Parallel PropBanks
basic knowledge
- SRL on Constituent Parse(成分句法分析)
- A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be
-非叶子节点是短语,叶子节点是word,边没有标记。
- A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be
- SRL on Dependency Parse
- A dependency parse connects words according to their relationships. Each vertex in the tree represents a word, child nodes are words that are dependent on the parent, and edges are labeled by the relationship. A dependency parse of “John sees Bill”, would be:
- 一个依存解析将word按照他们的关系连接起来,每个节点代表一个word,边用关系来进行表示。
- 依存句法树能够根据成分句法树转换而来,但成分句法树不能通过依存树转化来。转换的规则是head-finding rules from Zhang and Clark 2008
- head word 一般指的是短语结构中的中心词。
- SRL on Constituent Parse(成分句法分析)
SRL Supervised ML Pipeline
- Syntactic Parse
- Prune Constituents [Xue, Palmer 2004]
- For the predicate and each of its ancestors, collect their sisters unless the sister is coordinated with the predicate
- If a sister is a PP(介词短语) also collect its immediate children
- Argument Identification(ML)
- Extract features from sentence, syntactic parse, and
other sources for each candidate constituent - Train statistical ML classifier to identify arguments
- Extract features from sentence, syntactic parse, and
- Argument Classification(ML)
- Extract features
- Train statistical ML classifier to select appropriate label
- SVM, Linear (MaxEnt, LibLinear, etc), structured (CRF)
classifiers for arguments - All vs one, pairwise, structured multi-label classification
- SVM, Linear (MaxEnt, LibLinear, etc), structured (CRF)
- Structural Inference(heuristic or ML optimization)
Commonly Used Features: Phrase Type
- Intuition: different roles tend to be realized by different
syntactic categories - For dependency parse, the dependency label can serve similar function
- Phrase Type indicates the syntactic category of the phrase
expressing the semantic roles - Syntactic categories from the Penn Treebank
- FrameNet distributions:
- NP (47%) – noun phrase
- PP (22%) – prepositional phrase
- ADVP (4%) – adverbial phrase
- PRT (2%) – particles (e.g. make something up)
- SBAR (2%), S (2%) - clauses
- Intuition: different roles tend to be realized by different
Governing Category
- Intuition: There is often a link between semantic roles and
their syntactic realization as subject or direct object - He drove the car over the cliff
- Subject NP more likely to fill the agent role
- Approximating grammatical function from parse
- Function tags in constituent parses (typically not recovered in automatic parses)
- Dependency labels in dependency parses
- Intuition: There is often a link between semantic roles and
Features: Parse Tree Path
- Intuition: need a feature that factors in relation to the target word.
- Feature representation: string of symbols indicating the up and down traversal to go from the target word to the constituent of interest
For dependency parses, use dependency path
Issues:
- Parser quality (error rate)
- Data sparseness
- 2978 possible values excluding frame elements with no matching parse constituent
- Compress path by removing consecutive phrases of the same type, retain only clauses in path, etc
- 4086 possible values including total of 35,138 frame elements identifies as NP, only 4% have path feature without VP or S ancestor [Gildea and Jurafsky, 2002]
- 2978 possible values excluding frame elements with no matching parse constituent
Features: Subcategorization
- List of child phrase types of the VP
- highlight the constituent in consideration
- List of child phrase types of the VP
- Intuition: Knowing the number of arguments to the verb constrains the possible set of semantic roles
For dependency parse, collect dependents of predicate
Features: Position
- Intuition: grammatical function is highly correlated with position in the sentence
- Subjects appear before a verb
- Objects appear after a verb
- Representation:
- Binary value – does node appear before or after the predicate
- Intuition: grammatical function is highly correlated with position in the sentence
Features: Voice
- Direct objects in active <> Subject in passive
- He slammed the door.
- The door was slammed by him.
- Approach:
- Use passive identifying patterns / templates (language dependent)
- Passive auxiliary (to be, to get), past participle
- bei construction in Chinese
- Use passive identifying patterns / templates (language dependent)
- Direct objects in active <> Subject in passive
Features: Tree kernel
- Compute sub-trees and partial-trees similarities between training parses and decoding parse
- Does not require exact feature match
- Advantage when training data is small (less likely to have exact feature match)
- Well suited for kernel space classifiers (SVM)
- All possible sub-trees and partial trees do not have to be enumerated as individual features
- Tree comparison can be made in polynomial time even when the number of possible sub/partial trees are exponential
More Features
- Head word
- Head of constituent
- Name entities
- Verb cluster
- Similar verbs share similar argument sets
- First/last word of constituent
- Constituent order/distance
- Whether certain phrase types appear before the argument
- Argument set
- Possible arguments in frame file
- Previous role
- Last found argument type
- Argument order
- Order of arguments from left to right
- Head word
Nominal Predicates
- Verb predicate annotation doesn’t always capture fine semantic details
- Arguments of Nominal Predicates can be harder to classify because arguments are not as well constrained by syntax
- Find the “supporting” verb predicate and its argument candidates
- Usually under the VP headed by the verb predicate and is part of an argument to the
verb
- Usually under the VP headed by the verb predicate and is part of an argument to the
Structural Inference
- Take advantage of predicate-argument structures to re-rank argument label set
- Arguments should not overlap
- Numbered arguments (arg0-5) should not repeat
- R-arg[type] and C-arg[type] should have an associated arg[type]
- Optimize log probability of label set
- Beam search
- Formulate into integer linear programming (ILP) problem
- Re-rank top label sets that conform to constraints
- Choose n-best label sets
- Train structural classifier (CRF, etc)
- Take advantage of predicate-argument structures to re-rank argument label set
SRL ML Notes
- Syntactic parse input
- Training parse accuracy needs to match decoding parse accuracy
- Generate parses via cross-validation
- Cross-validation folds needs to be selected with low correlation
- Training data from the same document source needs to be in the same fold
- Training parse accuracy needs to match decoding parse accuracy
- Separate stages of constituent pruning, argument identification and argument labeling
- Constituent pruning and argument identification reduce training/decoding complexity, but usually incurs a slight accuracy penalty
- Syntactic parse input
Linear Classifier Notes
- Popular choices: LibLinear, MaxEnt, RRM
- Perceptron model in feature space
- each feature j contributes positively or negatively to a label i
- How about position and voice features for classifying the agent?
- He slammed the door.
- The door was slammed by him.
- Position (left): positive indicator since active construction is more frequent
- Voice (active): weak positive indicator by itself (agent can be omitted in passive construction)
- Combine the 2 features as a single feature
- left-active and right-passive are strong positive indicators
- left-passive and right-active are strong negative indicators
Support Vector Machine Notes
- Popular choices: LibSVM, SVM light
- Kernel space classification (linear kernel example)
- The correlation (c j ) of the features of the input sample with each training sample j contributes positively or negatively to a label i
- Creates ? × ? dense correlation matrix during training (? is the size of training samples)
- Requires a lot of memory during training for large corpus
- Use a linear classifier for argument identification
- Train base model with a small subset of samples, iteratively add a portion of incorrectly classified training samples and retrain
- Decoding speed not as adversely affected
- Trained model typically only has a small number of “support vectors”
- Requires a lot of memory during training for large corpus
- Tend to perform better when training data is limited
Evaluation
- Precision – percentage of labels output by the system
which are correct - Recall – recall percentage of true labels correctly
identified by the system - F-measure, F_beta – harmonic mean of precision and
recall - Lots of choices when evaluating in SRL:
- Arguments(检测整个span还是只要短语中心词正确就可以)
- Full span (CoNLL-2005)
- Headword only (CoNLL-2008)
- Predicates(数据是否需要标记谓语动词)
- Given (CoNLL-2005)
- System Identifies (CoNLL-2008)
- Verb and nominal predicates (CoNLL-2008)
- Arguments(检测整个span还是只要短语中心词正确就可以)
- Precision – percentage of labels output by the system
Applications
- Question & answer systems (结构化信息)
- Machine translation generation/evaluation
- Identifying/recovering implicit arguments across language
- Chinese dropped pronoun
# 成分句法分析例子 "John sees Bill"
Sentence
|
+-------------+------------+
| |
Noun Phrase Verb Phrase
| |
John +-------+--------+
| |
Verb Noun Phrase
| |
sees Bill
# 例子:“在秋天的时候,陶喆爱吃苹果”
(ROOT (IP (PP (P 在) (NP (DNP (NP (NN 秋天)) (DEC 的)) (NP (NN 时候)))) (PU ,) (NP (NR 陶喆)) (VP (VV 爱) (IP (VP (VV 吃) (NP (NN 苹果)))))))
# 例子:句法依存树
root(ROOT-0, 爱-7)
case(时候-4, 在-1)
nmod:assmod(时候-4, 秋天-2)
dep(秋天-2, 的-3)
nmod:prep(爱-7, 时候-4)
punct(爱-7, ,-5)
nsubj(爱-7, 陶喆-6)
ccomp(爱-7, 吃-8)
dobj(吃-8, 苹果-9)
Semi- , unsupervised and cross-lingual approaches
Shortcomings of Supervised Methods
- Rely on large expert-annotated datasets (FrameNet and PropBank > 100k predicates)
- Even then they do not provide high coverage (esp. with FrameNet)
- ~50% oracle performance on new data [Palmer and Sporleder, 2010]
- Resulting methods are domain-specific [Pradhan et al., 2008]
- Such resources are not available for many languages
How can we reduce reliance of SRL methods on labeled data?
- Transfer a model or annotation from a more resource-rich language (crosslingual transfer / projection)
- Complement labeled data with unlabeled data (semi-supervised learning)
- Induce SRL representations in an unsupervised fashion (unsupervised learning)
outline
- Crosslingual annotation and model transfer
- Annotation projection
- Direct transfer
- Semi-supervised learning
- methods creating surrogate supervision
- parameter sharing methods
Unsupervised learning
- agglomerative clustering
generative modeling
- Crosslingual annotation and model transfer
Exploiting crosslingual correspondences: classes of methods
- The set-up:
- Annotated resources or a SRL model is available for the source language (often English)
- No or little annotated data is available for the target language
- How can we build a semantic-role labeller for the target language?
- If we have parallel data, we can project annotation from the source language to the target language (annotation projection)
- If no parallel data, we can directly apply a source SRL model to the target language (driect model transfer) [Kozhevnikov and Titov, 2013]
- The set-up:
Crosslingual annotation projection: basic idea
- Start with an aligned sentence pair
- Label the source sentence
- Check if a target predicate can evoke the same frame
- Project roles from source to target sentence
Word-based projection(词对齐的错误和遗漏造成噪音)
- For each source semantic role:
- Follow alignment links
- Target role spans all the projected words
- Ensure contiguity
- For each source semantic role:
Syntax-based projection
- Find alignment between constituents
- For each source semantic role:
- Identify a set of constituents in the source sentences
- Label aligned constituents with the semantic role
Syntax-based projection
- Find alignment between constituents
- For each source semantic role:
- Identify a set of constituents in the source sentences
- Label aligned constituents with the semantic role
- Define semantic alignment as an optimization task on a graph
- Graph for each sentence pair
- Choose an optimal alignment graph, maybe with some constraints:(注意:最优化对齐问题的写法)
- Covers all target constituents (edge cover)
- Edges in the alignment do not have common endpoints (matching)
Direct transfer of models
- Is this realistic at all?
- Requires (maximally) language-independent feature representation(设计跨语言的通用特征)
- Have been tried successfully for syntax
- Performance depends on how different the languages are
- Is this realistic at all?
Language independent feature representations
- Instead of words use either
- cross-lingual word clusters [Tackstrom et al., 2012] or
- cross-lingual distributed word features [Klementiev et al., 2012]
- Instead of fine-grain part-of-speech (PoS) tags use coarse universal PoS tags[Petrov et al., 2012]
- Instead of rich (constituent or dependency) syntax either use either
- unlabeled dependencies or
transfer syntactic annotation from the source language before transferring semantic annotation and use it
CoNLL-2009 data (dependency representation for semantics)
- Target syntax is obtained using direct transfer
- Only accuracy on labeling arguments (not identification)
- Instead of words use either
methods creating surrogate supervision
- Choose examples (sentences) to label from an unlabeled dataset(*How do we choose
examples*?) - Automatically annotate the examples
- Add them to the labeled training set
- Train a classifier on the expanded training set
Optional: Repeat
Semi-supervised learning
- Choose examples (sentences) to label from an unlabeled dataset(*How do we choose
There are three main groups of semi-supervised learning (SSL) methods considered for SRL:
- methods creating surrogate supervision: automatically annotate unlabeled data and treat it as new labeled data (annotation projection / bootstrapping methods)
parameter sharing methods: use unlabeled data to induce less sparse representations of words (clusters or distributed representations)
semi-*un**supervised learning: adding labeled data (and other forms of supervision) *to guide unsupervised models
Creating surrogate supervision
- Choose examples (sentences) to label from an unlabeled dataset (How do we choose examples?)
- Automatically annotate the examples(**How do we
annotate examples?**) - Add them to the labeled training set
- Train a classifier on the expanded training set
Optional: Repeat(Makes sense only if the classifier is
used at stages 1 or 2)- Basic self-training
- Use the classifier itself to label examples (and, often, its confidence to choose examples at
stage 1) - Does not produce noticeable improvement for SRL [He and Gildea, 2006]
- Need a better method for choosing and annotating unlabeled examples
Monolingual projection: an idea
- Assumptions: sentences similar in their lexical material and syntactic structure are likely to share the same frame-semantic structure
- An example:
- Labeled sentence: [His back] Impactor [thudded] Impact [against the wall] Impactee
- Unlabeled sentence: The rest of his body thumped against the front of the cage
- An Implementation (roughly)
- Choose labeled examples which are similar to an unlabeled example (compute scored alignments between them, select pairs with high scores)
- Use alignments to project semantic role information to the unlabeled sentences
- How do we compute these alignments?
Monolingual projection: alignment
- Start with an unlabeled sentence, and a target predicate
- Check a labeled sentence (one by one)
- Find the best alignment(Use a heuristic to select the
alignment domain) with Score = Lexical Score + Syntactic Score
parameter sharing methods
- Reducing sparsity of word representations
- Lexical features are crucial for accurate semantic role labeling
- However, they are problematic as they are sparse
- Less sparse features capturing lexical information are needed
- Representations can be learnt from unlabeled data in the context of the language model task, for example:
- Brown clusters [Brown et al., 1992]
- Distributed word representations [Bengio et al., 2003] and then used as features in SRL systems
- Lexical features are crucial for accurate semantic role labeling
Challenge: they might not capture phenomena relevant to SRL or not have needed granularity.
Learning lexical representations
Share words representations across tasks and learn
simultaneously to be useful for both tasks- Unsupervised learning( agglomerative clustering / generative modeling)
- Reducing sparsity of word representations
Defining Unsupervised SRL
- Semantic role labeling is typically divided into two sub-tasks:
- Identification: identification of predicate arguments
- Arguably, the easier sub-task, can be
handled with heuristics, e.g. [Lang and Lapata, 2010]
- Arguably, the easier sub-task, can be
- Labeling: assignment of their sematic roles
- Equivalent to clustering of argument occurrences (or “coloring” them)
- Identification: identification of predicate arguments
Goal: induce semantic roles automatically from unannotated texts
- Semantic role labeling is typically divided into two sub-tasks:
- Crosslingual annotation and model transfer
Evaluating Unsupervised SRL
- Before we begin, a note about evaluating unsupervised SRL
- We do not have labels for clusters, so we use standard clustering metrics instead
- Purity (PU) measures the degree to which each induced role contains arguments sharing the same gold (“true”) role
- Collocation (CO) evaluates the degree to which arguments with the same gold roles are assigned to a single induced role
- Report F1, harmonic mean of PU and CO
3.1. agglomerative clustering [Lang and Lapata, 2011b]
Role Labeling as Clustering of Argument Keys
- Associate argument occurrences with syntactic signatures or argument keys
- Will include simple syntactic cues such as verb voice and position relative to predicate
Argument keys are designed to map to a single semantic role as much as possible (for an individual predicate)
Instead of clustering argument occurrences, the method clusters their argument keys(聚类相同的关键特征)
Here, we would cluster ACTIVE:RIGHT:OBJ and ACTIVE:RIGHT:PMOD_up together
- Associate argument occurrences with syntactic signatures or argument keys
Role Labeling via “Split-Merge” Clustering
- Agglomerative clustering of arguments
- Start with each argument key in its own cluster (high purity, low collocation)
- Merge clusters together to improve collocation
For a pair of clusters score
- whether a pair contains lexically similar arguments
- whether arguments have similar parts of speech
- whether the constraint that arguments in a clause should be in different roles is satisfied
Prioritization
- Instead of greedily choosing the highest scoring pair at each step, start with larger clusters and select best match for each of them(非贪心算法,全局最优)
3.2 generative modeling [Titov and Klementiev, 2012a][Titov and Klementiev, 2012b][Titov and Klementiev, 2011]
- Agglomerative clustering of arguments
A Bayesian model for role labeling
- Idea: propose a generative model for inducing argument clusters
- As before, clusters are of argument keys, not argument occurrences
- Learning signals are similar to Lang and Lapata (2011a, 2011b), e.g.
- Selection preferences(distribution of argument
fillers is sparse for every role) - Duplicate roles are unlikely to occur. E.g. this clustering is a bad idea:
John taught students math
- Selection preferences(distribution of argument
How can we encode these signals in a generative story?
The approaches we discussed induce roles for each predicate independently
- These clusterings define permissible alternations
- But many alternations are shared across verbs
Can we share this information across verbs?
Idea: keep track of how likely a pair of argument keys should be clustered
- Define a similarity matrix (or similarity graph)
A formal way to encode this: dd-CRP
- Can use CRP to define a prior on the partition of argument keys:
- The first customer (argument key) sits the first table (role)
- m-th customer sits at a table according to:
- An extension is distance-dependent CRP (dd-CRP):
- m-th customer chooses a customer to sit with according to:
- Can use CRP to define a prior on the partition of argument keys:
Qualitative
- Looking into induced graph encoding ‘priors’ over clustering arguments keys, the most highly ranked pairs encode (or partially encode)
- Passivization
- Near-equivalence of subordinating conjunctions and prepositions
- Benefactive alternation
- Dativization
- Recovery of unnecessary splits introduced by argument keys
Generalization of the role induction model
- The model can be generalized for joint induction of predicate-argument structure of an entire sentence
- start with a (transformed) syntactic dependency graph (~ argument identification)
- predict decomposition and labeling of its parts
- label on nodes are frames (or semantic classes of arguments)
- labels on edges are roles (frame elements)
- The model can be generalized for joint induction of predicate-argument structure of an entire sentence
- Idea: propose a generative model for inducing argument clusters
Conclusions
- We looked in examples of key directions in exploiting unlabeled data and cross-lingual correspondences
- a lot of relevant recent work has not been covered
- Still a new direction with a lot of ongoing work
- research in the related area of information extraction should also closely watched
- We looked in examples of key directions in exploiting unlabeled data and cross-lingual correspondences
NN for SRL (tutorial of EMNLP2017)
Outline: the fall and rise of syntax in SRL
- Early SRL methods
- Symbolic approaches + Neural networks (syntax-aware models)
- Syntax-agnostic neural methods
- Syntax-aware neural methodsEarly SRL methods(pipeline)
Given a predicate
- Argument identification
- Hand-crafted rules on the full syntactic tree [Xue and Palmer, 2004]
- Binary classifier [Pradhan et al., 2005; Toutanova et al., 2008]
- Both [Punyakanok et al., 2008]
- Role labeling
- Labeling is performed using a classifier (SVM, logistic regression)
- For each argument we get a label distribution
- Argmax over roles will result in a local assignment
- Disadvantage: No guarantee the labeling is well formed
- overlapping arguments, duplicate core roles, etc.
- Global and/or constrained inference
- Enforce linguistic and structural constraint (e.g., no overlaps, discontinuous arguments, reference arguments, …)
- Viterbi decoding (k-best list with constraints) [Täckström et al., 2015]
- Dynamic programming [Täckström et al.,2015; Toutanova et al., 2008]
- Integer linear programming [Punyakanok et al., 2008]
- Re-ranking [Toutanova et al., 2008; Bjö̈rkelund et al., 2009]
- Argument identification
Early symbolic models
- 3 steps pipeline
- Massive feature engineering
- argument identification
- role labeling
- re-ranking
- Most of the features are syntactic [Gildea and Jurafsky, 2002]
Symbolic approaches + Neural networks (syntax-aware models)
Fitzgerald et al., 2015
- model
- Rule based argument identification
- as in [Xue and Palmer, 2004]but for dependency parsing
- Neural network for local role labeling
- Global structural inference based on dynamic programming
- [Täckström et al., 2015]
- Rule based argument identification
- innovation
- Predicate-role composition
- Predicate-specific role representation
- Learning distributed predicate representation across different formalisms
- State of the art on FrameNet dataset
- Feature embeddings
- Use “simple” span features
- Let the network figure out how to compose them
- Reduced feature engineering
- Predicate-role composition
- model
Roth and Lapata, 2016: Dependency path embeddings
- model
- Dependency-based SRL
- Syntactic paths between predicates and arguments are an important feature
- It may be extremely sparse
- Creating a distributed representation can solve the problem
- Use LSTM [Hochreiter and Schmidhuber, 1995] to encode paths
- Neural network with dependency path embeddings as local classifier
- Argument identification
- Role labeling
- Global re-ranking of k-best local assignments
- Dependency-based SRL
- innovation
- Encode syntactic paths with LSTMs
- Overcome sparsity
- Combination of symbolic features and continuous syntactic paths
- Encode syntactic paths with LSTMs
- model
Syntax-agnostic neural methods (the fall)
- SRL as a sequence labeling task
- Argument identification and role labeling in one step(end-to-end)
- General architecture
- Word encoding
- Sentence encoding (via LSTM)
- Decoding
- No use of any kind of treebank syntax (not trivial to encode it)
Differentiable end-to-end
[Collobert et al., (2011)]
- Zhou and Xu, 2015: Sentence encoding
- model
- Pretrained word embedding
- Distance from the predicate
- Predicate context (for disambiguation)
- Predicate region mark
- Bidirectional LSTM
- Forward (left context)
- Backward (right context)
- Snake BiLSTM
- Conditional Random Field
- [Lafferty et al., 2001]
- Markov assumption between role labels
innovation
- No syntax
- Minimal word representation
Sentence encoding with “Snake” BiLSTM
- He et al., 2017 ‘What Works and What’s Next’
- model
- wdd
innovation
- No syntax
- Super minimal word representation
Exploit at best the representational power of NN
- Highway networks
Recurrent dropout
- Marcheggiani et al., 2017 ‘A Simple and Accurate Syntax-Agnostic Neural Model for Dependency-based Semantic Role Labeling’
model
- Dependency-based SRL
- Shallow syntactic information (POS tags)
- Intuitions from syntactic dependency parsing
- Local classifier
- Word encoding
- Pretrained word embedding
- Randomly initialized embedding
- Randomly initialized embedding of POS tags
- Embeddings of the predicate lemmas
- Predicate flag
- Standard (non-snake) BI-LSTM
- Forward LSTM encode left context
- Backward LSTM encode right contex
- Forw. and Backw. states are concatenate
innovation
- Little bit of syntax (POS tags)
- More sophisticated word representation
- Fast local classifier conditioned on predicate representation
- SRL as a sequence labeling task
Syntax-aware neural methods (syntax strikes back!)
Is syntax important for semantics?
- POS tags are beneficial [Marcheggiani et al., 2017]
- Gold syntax is beneficial (but hard to encode) [He at al., 2017]
Encoding syntax with Graph Convolutional Networks
[Marcheggiani and Titov, 2017]
- Marcheggiani and Titov, 2017 ‘Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling’
- model
- Word encoding [Marcheggiani et. al, 2017]
- Sentence encoding with BiLSTM [Marcheggiani et. al, 2017]
- Syntax encoding with Graph Convolutional Networks (GCN)
- Skip connections[Kipf and Welling, 2016]
- Each word is enriched with the representation of its syntactic neighborhood(Longer dependencies are captured)
- Local classifier [Marcheggiani et. al, 2017]
- innovation
- Encoding structured prior linguistic knowledge in NN
- Syntax
- Semantic
- Coreference
- Discourse
- Complement LSTM with skip connections for long dependencies
- Encoding structured prior linguistic knowledge in NN
We can live without syntax (out of domain)
But life with syntax is better
- and the better the syntax (parsers) the better our semantic role labeler
What’s the (present) future?
- Multi-task learning
- Swayamdiptaet al. (2017) frame-semantic parsing + syntax
- Peng et al. (2017) multi-task on different semantic formalisms
- Neural networks work (I kid you not) …
- … but we do have (a lot of) linguistic prior knowledge…
- … and it is time to use it again.