Summary of Open-domain relation extraction papers

最新推荐文章于 2022-07-06 10:24:25 发布

macrich

最新推荐文章于 2022-07-06 10:24:25 发布

阅读量563

点赞数 2

分类专栏：知识图谱

本文链接：https://blog.csdn.net/rxc205/article/details/89409215

版权

知识图谱专栏收录该内容

1 篇文章 0 订阅

订阅专栏

前面的内容(图片)是别人的总结，网站我忘记了，表示感谢！！！

è¿éåå¾çæè¿°

词法分析：

è¿éåå¾çæè¿°

句法分析(syntactic parsing)分两种：

句法结构分析(syntactic structure parsing)：句法结构一般用树状数据结构表示，称为：句法分析树(syntactic parsing tree)简称分析树（parsing tree）,完成这种分析过程的程序模块称为：句法结构分析器（syntactic parser）简称（parser）
依存关系分析(dependency parsing)：依存句法通过分析语言单位内成分之间的依存关系解释其句法结构，主张句子中动词是支配其他成分的中心成分。而它本身却不受其他任何成分的支配，所有受支配成分都以某种关系从属于支配者。

浅层句法分析--部分句法分析/语块划分(shallow parsing)

浅层句法分析只要求识别其中的某些结构相对简单的成分，如非递归的名词短语、动词短语等。这些识别出来的结构通常被称作语块(chunk)，语块和短语这两个概念通常可以换用。
两个子任务：
1. 语块的识别和分析
2. 语块之间的依附关系分析

è¿éåå¾çæè¿°

语义分析

对于不同的语言单位，语义分析的任务各不相同。在词的层次上，语义分析的基本任务是进行词义消歧（WSD），在句子层面上是语义角色标注（SRL），在篇章层面上是指代消歧，也称共指消解。

其中：语义角色标注 (Semantic Role Labeling, SRL) 是一种浅层的语义分析技术，标注句子中某些短语为给定谓词的论元 (语义角色) ，即句子的谓词（Predicate）- 论元（Argument）结构，如施事、受事、时间和地点等。谓词是对主语的陈述或说明，指出“做什么”、“是什么”或“怎么样，代表了一个事件的核心，跟谓词搭配的名词称为论元。语义角色是指论元在动词所指事件中担任的角色。主要有：施事者（Agent）、受事者（Patient）、客体（Theme）、经验者（Experiencer）、受益者（Beneficiary）、工具（Instrument）、处所（Location）、目标（Goal）和来源（Source）等。

è¿éåå¾çæè¿°

1.Open Information Extraction from the Web【2007】

这是第一代open IE,主要贡献：

提出了open IE,
提出了 TEXTRUNNER,一个完全的开放域抽取框架，并且分析了它的框架，比较它和传统的抽取方法KNOWITALL，精度提高了33%,
统计了TEXTRUNNER抽取的实体，可以发现它的扩展性。

关于KNOWITALL [Etzioni et al., 2005] ：

a state-ofthe-art Web extraction system that addresses the automation challenge by learning to label its own training examples using a small set of domain-independent extraction patterns. KNOWITALL also addresses corpus heterogeneity by relying on a part-of-speech tagger instead of a parser, and by not requiring a NER. However, KNOWITALL requires large numbers of search engine queries and Web page downloads. As a result, experiments using KNOWITALL can take weeks to complete. Finally, KNOWITALL takes relation names as input. Thus, the extraction process has to be run, and rerun, each time a relation of interest is identiﬁed. The OIE paradigm retains KNOWITALL’s beneﬁts but eliminates its inefﬁciencies.

后面就是介绍TEXTRUNNER和实验

2.The Tradeoffs Between Open and Traditional Relation Extraction 【Michele Banko and Oren Etzioni，ACL 2008】

分析了传统的抽取方法：
1. 输入基于人工抽取模式的、输入基于给定标签学习到的模式。当转换到新的预料时耗时费力。
现在的抽取方法,Open IE的难度：
1. ,an OpenIE system has to locate both the set of entities believed to participate in a relation, and the salient textual cues that indicate the relation among them.
2. a relation-independen text raction process makes it difﬁcult to leverage the full set of features typically used when performing extraction one relation at a time.
relation-speciﬁc (“lexicalized”) extraction-词汇化
relation-independent (“unlexicalized”) extraction-非词汇化
提出了O-CRF抽取模式：在精度和F1值上面都有提高，recall一般。
提出了R1-ORF和混合的O-CRF+O-NB

3.BLSTM-CRF for Sequence Tagging【2015】

在序列标注工作中提出了对比和提出了很多的模型，LSTM、BI-LSTM、LSTM-CRF、BI-LSTM-CRF

4.Effectiveness and Efficiency of Open Relation Extraction【Filipe, EMNLP 2013】

比较了8个抽取方法在五个不同难度数据集上面的表现，并且提出了EXEMPLAR：which applies a key idea in semantic approaches (namely, to identify the precise connection between the argument and the predicate words in a relation) over a dependency parse tree (i.e., without applying SRL). The goal is to achieve the higher accuracy of the semantic approaches at the lower computational cost of the dependency parsing approaches. EXEMPLAR is a rule-based system derived from a carefu lstudy of all dependency types identiﬁed by the Stanford parser.
现在的ORE依赖：
1. shallow parsing,
  1. ReVerb (Fader et al., 2011) and SONEX (Merhav et al., 2012), 都是TextRunner 的后代，增加了抽取类型。
2. dependency parsing
  1. PATTY (Nakashole et al., 2012)基于依赖关系图, OLLIE (Mausam et al., 2012)基于自举的模式模板, and TreeKernel (Xu et al., 2013) 语义基于依存树。
3. semantic role labelling (SRL)
  1. Lund (Johansson and Nugues, 2008) and SwiRL (Surdeanu et al., 2003), Lund only able to label arguments with verb predicates. Lund, is based on dependency parsing and making it able to extract relations with both verb and noun predicates.
做了实验，给出了度量

5.Neural Models for Sequence Chunking【2017.1】

提出了三个 use of DNN for Sequence Chunking 的方案：

1.

2.

3. 6.Neural Open Information Extraction【2018.5】

7.Supervised Open Information Extraction【2018.6】

8.Supervised Neural Models Revitalize the Open Relation Extraction【2018.9】

9.Open Information Extraction using Wikipedia【2010.7】

提出了自监督学习的抽取器--WOE,Speciﬁcally, WOE generates relation-speciﬁc training examples by matching Infobox attribute values to corresponding sentences，but WOE abstracts these examples to relation independent training data to learn an unlexicalized extractor, akin to that of TextRunner. [ ！！！重要 ]

这个模型有两种模式:

Restricted to POS tag features
Dependency-parse features ，这个精度和召回率更高

The key idea underlying WOE is the automatic construction of training examples by heuristically matching Wikipedia infobox values and corresponding text;

一共包括三个组件：preprocessor, matcher, Learning Extractors

其中有两个 Extractors：
1. WOE-parse: using features from dependency-parse trees.WOE-parse uses a pattern learner to classify whether the shortest dependency path between two noun phrases indicates a semantic relation.
2. WOE-pos: limited to shallow features like POS tags. WOE-pos (like TextRunner) trains a CRF to output certain text between noun phrases when the text denotes such a relation.
3. Neither extractor uses individual words or lexical information for features.

后面是实验部分。

12.Toward an Architecture for Never-Ending Language Learning【2010】

作者：john159151
来源：CSDN
原文：https://blog.csdn.net/john159151/article/details/53573441

论文大体内容：
本文构建出一个NELL（never-ending language learner）的framework，主要能够从web中永不停地抽取信息，构建Knowledge base，然后使用知识不断提升之后task的效果。最后经过67天的实验，NELL抽取出了242000+个beliefs，准确率是74%。

1、NELL的Knowledge包括两种：

①categories：由名词短语组成的，如cities, companies, and sports teams；
②relations：一对名词短语之间的关系，如hasOfficesIn（organization, location）；

2、NELL framework：

è¿éåå¾çæè¿°

CPL（Coupled Pattern Learner）：使用名词短语与上下文模式（“mayor of X” and “X plays for Y ”）的共现统计关系进行抽取categories和relations；
CSEAL（Coupled SEAL）：对categories和relations进行query互联网的内容，看是否有互斥的relation，用于过滤抽取到的categories和relations；
CMC（Coupled Morphological Classifier）：对每个category建立一个binary L2-regularized logistic regression models，用于对名词短语的分类，确定是否为categories；
RL（Rule Learner）：用于学习rules，学习到的rules用于推断新的关系实例；
KI（Knowledge Integrator）：使candidate facts提升为beliefs，包括两种策略，
1. (i)上面的4个components中有一个的后验概率特别高(>0.9）
2. (ii)有多个components的后验概率都比较高；
belief与fact的区别：belief是高置信度的fact，通过人工评测，存在时效性的fact，如coach of the team，在这里也可以转化为belief，没有排除时效性的限制；
Knowledge base的表示：NoSQL的多个key-value[1]；

10.Open Information Extraction The Second Generation【2011】

这篇文章提出了二代open IE系统仅仅基于动词的ReVerb并且增加了argument identifier--> ARGLEARNER 的R2A2。两种方法重要之处：对随机抽取的句子进行彻底的语言分析，并说明了动词（谓词）在抽取中的重要性和简单经典性。
1. REVERB：implements a novel relation phrase identiﬁer based on generic syntactic and lexical constraints
2. R2A2：adds an argument identiﬁer, ARGLEARNER, to better extract the arguments for these relation phrases.
分析了之前抽取的方法的缺点 incoherent 和 uninformative

13.PATTY: A Taxonomy of Relational Patterns with Semantic Types【2012.7】

11.Open Language Learning for Information Extraction【2012.7】

REVERB and WOE的问题：

they extract only relations that are mediated by verbs，
they ignore context, thus extracting tuples that are not asserted as factual
1. REVERB, which uses shallow syntactic processing to identify relation phrases that begin with a verb and occur between the argument phrases
2. WOE-parse,which uses bootstrapping from entries in Wikipedia info-boxes to learn extraction patterns in dependency parses.

所以，提出了OLLIE：

它加入了很多的抽取中介，原来的只有动词，现在还加入了名词，形容词等等
上下文分析步骤考虑了句子的上下文信息，在抽取中提高精度，扩大了相关短语的词汇范围。
加入了更加精确地提取方法：
主要的贡献：
1. 提出了基于学习开放模式模板的新方法。基于关系独立的依存解析树，使用自举的方式引导训练集自动学习。
2. 它的工作方式：
  - First,it uses a set of high precision seed tuples from REVERB to bootstrap a large training set.
  - Second,it learns open pattern templates over this training set.
  - Next, OLLIE applies these pattern templates at extraction time.