文献学习-联合抽取-Joint entity recognition and relation extraction as a multi-head selection problem

振哥在，世界充满爱！

已于 2023-06-24 20:51:56 修改

阅读量938

点赞数 2

分类专栏： NLP 知识抽取资源学习笔记文章标签：自然语言处理知识图谱语言模型 nlp 机器学习神经网络 word2vec

于 2023-06-24 20:48:47 首次发布

本文链接：https://blog.csdn.net/qq_30507287/article/details/131343188

版权

知识抽取同时被 3 个专栏收录

8 篇文章

订阅专栏

资源学习笔记

8 篇文章

订阅专栏

NLP

6 篇文章

订阅专栏

论文信息

（1）题目：Joint entity recognition and relation extraction as a multi-head selection problem （作为一个多头选择问题的联合实体识别和关系提取）

（2）文章下载地址：Redirecting

（3）相关代码：-

（4）作者信息：-

摘要：

1、Introduction

2、Related Work

2.1 Named entity recognition

2.2 Relation extraction

2.3 Joint entity and relation extraction

3、Joint model

3.1 Embedding layer

3.2 Bidirectional LSTM encoding layer

3.3 Named entity recognition

3.4 Relation extraction as multi-head selection

3.5 Edmonds' algorithm

4、Experimental setup

4.1 Datasets and evaluation metrics

4.2 Word embeddings

4.3 Hyperparameters and implementation details

5、Result and discussion

5.1 Results

5.2 Analysis of feature contribution （消融实验）

6、Conclusion

Appendix A

摘要：

State-of-the-art models for joint entity recognition and relation extraction strongly rely on external natural language processing (NLP) tools such as POS (part-of-speech) taggers and dependency parsers. Thus, the performance of such joint models depends on the quality of the features obtained from these NLP tools. However, these features are not always accurate for various languages and contexts. In this paper, we propose a joint neural model which performs entity recognition and relation extraction simultaneously, without the need of any manually extracted features or the use of any external tool. Specifically, we model the entity recognition task using a CRF (Conditional Random Fields) layer and the relation extraction task as a multi-head selection problem (i.e., potentially identify multiple relations for each entity). We present an extensive experimental setup, to demonstrate the effectiveness of our method using datasets from various contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch). Our model outperforms the previous neural model that use automatically extracted features, while it performs within a reasonable margin of feature-based neural models, or even beats them.

用于联合实体识别和关系提取的最先进模型强烈依赖于外部自然语言处理 (NLP) 工具，例如 POS（词性）标记器和依存解析器。因此，此类联合模型的性能取决于从这些 NLP 工具中获得的特征的质量。但是，这些功能对于各种语言和上下文并不总是准确的。在本文中，我们提出了一种联合神经模型，它同时执行实体识别和关系提取，无需任何手动提取特征或使用任何外部工具。具体来说，我们使用 CRF（条件随机场）层对实体识别任务进行建模，并将关系提取任务建模为多头选择问题（即，可能为每个实体识别多个关系）。我们提出了广泛的实验设置，以使用来自不同背景（即新闻、生物医学、房地产）和语言（即英语、荷兰语）的数据集来证明我们方法的有效性。我们的模型优于以前使用自动提取特征的神经模型，同时它的性能在基于特征的神经模型的合理范围内，甚至击败了它们。

1、Introduction

The goal of the entity recognition and relation extraction is to discover relational structures of entity mentions from unstructured texts. It is a central problem in information since it is critical for tasks such as knowledge base population and question answering.

实体识别和关系提取的目标是从非结构化文本中发现实体提及的关系结构。它是信息提取的核心问题，因为它对知识库填充和问题回答等任务至关重要。

The problem is traditionally approached as two separate subtasks, namely (i) named entity recognition (NER) (Nadeau & Sekine, 2007) and (ii) relation extraction (RE) (Bach & Badaskar, 2007), in a pipeline setting. The main limitations of the pipeline models are: (i) error propagation between the components (i.e., NER and RE) and (ii) possible useful information from the one task is not exploited by the other (e.g., identifying a Works for relation might be helpful for the NER module in detecting the type of the two entities, i.e., PER, ORG and vice versa). On the other hand, more recent studies propose to use joint models to detect entities and their relations overcoming the aforementioned issues and achieving state-of-the-art performance (Li & Ji, 2014; Miwa & Sasaki, 2014).

这段主要描述了管道方法的局限性：（1）组件（NER和RE）之间的错误传播；（2）来自一个任务的可能有用信息未被另一个任务利用，例如：识别关系时可能会有助于NER模块检测两个实体的类型，即PER、ORG，反之亦然。

温馨提示：有必要阅读一下(Li & Ji, 2014; Miwa & Sasaki, 2014)

The previous joint models heavily rely on hand-crafted features. Recent advances in neural networks alleviate the issue of manual feature engineering, but some of them still depend on NLP tools (e.g., POS taggers, dependency parsers). Miwa and Bansal (2016) propose a Recurrent Neural Network (RNN)-based joint model that uses a bidirectional sequential LSTM (Long Short Term Memory) to model the entities and a tree-LSTM that takes into account dependency tree information to model the relations between the entities. The dependency information is extracted using an external dependency parser. Similarly, in the work of Li, Zhang, Fu, and Ji (2017) for entity and relation extraction from biomedical text, a model which also uses tree-LSTMs is applied to extract dependency information. Gupta, Schütze, and Andrassy (2016) propose a method that relies on RNNs but uses a lot of hand-crafted features and additional NLP tools to extract features such as POS-tags, etc. Adel and Schütze (2017) replicate the context around the entities with Convolutional Neural Networks (CNNs). Note that the aforementioned works examine pairs of entities for relation extraction, rather than modeling the whole sentence directly. This means that relations of other pairs of entities in the same sentence — which could be helpful in deciding on the relation type for a particular pair — are not taken into account. Katiyar and Cardie (2017) propose a neural joint model based on LSTMs where they model the whole sentence at once, but still they do not have a principled way to deal with multiple relations. Bekoulis, Deleu, Demeester, and Develder (2018) introduce a quadratic scoring layer to model the two tasks simultaneously. The limitation of this approach is that only a single relation can be assigned to a token, while the time complexity for the entity recognition task is increased compared to the standard approaches with linear complexity.

段落综述：之前的联合模型严重依赖手工制作的功能。神经网络的最新进展缓解了手动特征工程的问题，但其中一些仍然依赖于NLP工具（例如词性标注器、依存解析器）。

Miwa 和Bansal (2016) 提出一种基于循环神经网络RNN的联合模型，该模型使用双向序列的LSTM来建模实体，并使用考虑依赖树信息的tree-LSTM来建模两个实体之间的关系。使用外部依赖性解析器提取依赖信息。

类似地，Li, Zhang, Fu, and Ji (2017)从生物医学文本中提取实体和关系的工作中，使用了tree-LSTMs 区提取依赖信息。

Gupta, Schütze, and Andrassy (2016) 提出一种依赖RNN但使用大量手工制作的特征和额外的NLP工具来提取POS标签等特征的方法。

Adel and Schütze (2017) 使用CNN复制实体周围的上下文。这些工作是通过检查实体对进行关系抽取，而不是直接对整个句子进行建模。这意味着同一句子中其他实体对的关系不会被考虑在内（这可能有助于确定特定对的关系类型）。

Katiyar and Cardie (2017) 提出一种基于LSTM的神经联合模型，他们一次对整个句子进行建模，但仍然没有处理多重关系的原则方法。

Bekoulis, Deleu, Demeester, and Develder (2018) 引入了二次评分层来同时对两个任务进行建模。这种方法的局限性在于只能将单个关系分配给令牌，而与线性复杂度的标准方法相比，实体识别任务的时间复杂度有所增加。

In this work, we focus on a new general purpose joint model that performs the two tasks of entity recognition and relation extraction simultaneously, and that can handle multiple relations together. Our model achieves state-of-the-art performance in a number of different contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch) without relying on any manually engineered features nor additional NLP tools. In summary, our proposed model (which will be detailed next in Section 3) solves several shortcomings that we identified in related works (Section 2) for joint entity recognition and relation extraction: (i) our model does not rely on external NLP tools nor hand-crafted features, (ii) entities and relations within the same text fragment (typically a sentence) are extracted simultaneously, where (iii) an entity can be involved in multiple relations at once.

段落小结：此段整体描述一下文章的工作内容

在这项工作中，我们专注于一种新的通用联合模型，它同时执行实体识别和关系提取两项任务，并且可以一起处理多个关系。

总之，我们提出的模型解决了我们在相关工作中发现的联合实体识别和关系提取的几个缺点：（1）我们的模型不依赖于外部NLP工具，也不依赖于手工制作的特征；（2）同时提取同一文本片段内的实体和关系；（3）一个实体可以同时涉及多个关系。

Specifically, the model of Miwa and Bansal (2016) depends on dependency parsers, which perform particularly well on specific languages (i.e., English) and contexts (i.e., news). Yet, our ambition is to develop a model that generalizes well in various setups, therefore using only automatically extracted features that are learned during training. For instance, Miwa and Bansal (2016) and Li et al. (2017) use exactly the same model in different contexts, i.e., news (ACE04) and biomedical data (ADE), respectively. Comparing our results to the ADE dataset, we obtain a 1.8% improvement on the NER task and ∼ 3% on the RE task. On the other hand, our model performs within a reasonable margin ( ∼ 0.6% in the NER task and ∼ 1% on the RE task) on the ACE04 dataset without the use of pre-calculated features. This shows that the model of Miwa and Bansal (2016) strongly relies on the features extracted by the dependency parsers and cannot generalize well into different contexts where dependency parser features are weak. Comparing to Adel and Schütze (2017), we train our model by modeling all the entities and the relations of the sentence at once. This type of inference is beneficial in obtaining information about neighboring entities and relations instead of just examining a pair of entities each time. Finally, we solve the underlying problem of the models proposed by Katiyar and Cardie (2017) and Bekoulis, Deleu, Demeester, and Develder (2017), who essentially assume classes (i.e., relations) to be mutually exclusive: we solve this by phrasing the relation extraction component as a multi-label prediction
problem.

段落小结：和上文中提到的几个文献进行对照分析，体现本文工作的优势。

具体来说，

Miwa 和 Bansal（2016）的模型依赖于依存解析器，它在特定语言（即英语）和上下文（即新闻）上表现特别好。然而，我们的目标是开发一种能够在各种设置中很好地推广的模型，因此仅使用在训练期间学习的自动提取的特征。（本文优势1）

例如，Miwa 和 Bansal (2016) 以及 Li 等人 (2017) 在不同的环境中使用完全相同的模型，即分别在新闻 (ACE04) 和生物医学数据 (ADE) 中。将我们的结果与 ADE 数据集进行比较，我们在 NER 任务上获得了 1.8% 的改进，在 RE 任务上获得了 ∼ 3% 的改进。另一方面，我们的模型在 ACE04 数据集上的表现处于合理的范围内（NER 任务中约为 0.6%，RE 任务中约为 1%），而无需使用预先计算的特征。这表明 Miwa 和 Bansal (2016) 的模型强烈依赖于依存解析器提取的特征，并且不能很好地泛化到依存解析器特征较弱的不同上下文中。（本文优势2：相反本文中提出的模型能够泛化到特征较弱的不同上下文中）

与 Adel 和 Schütze (2017) 相比，我们通过一次性对所有实体和句子关系进行建模来训练我们的模型。这种类型的推理有利于获取有关相邻实体和关系的信息，而不是每次只检查一对实体。（本文优势3）

最后，我们解决了 Katiyar 和 Cardie (2017) 以及 Bekoulis、Deleu、Demeester 和 Develder (2017) 提出的模型的根本问题，他们本质上假设类（即关系）是互斥的：我们通过将关系提取组件表述为多标签预测来解决这一问题。（本文优势4）

To demonstrate the effectiveness of the proposed method, we conduct the largest experimental evaluation to date (to the best of our knowledge) in jointly performing both entity recognition and relation extraction (see Sections 4 and 5), using different datasets from various domains (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch). Specifically, we apply our method to four datasets, namely ACE04 (news), Adverse Drug Events (ADE), Dutch Real Estate Classifieds (DREC) and CoNLL’04 (news). Our method outperforms all state-of-the-art methods that do not rely on any additional features or tools, while performance is very close (or even better in the biomedical dataset) compared to methods that do exploit hand-engineered features or NLP tools.

段落小结：简略说明实验信息和效果。

为了证明所提出方法的有效性，我们使用来自不同领域（即，数据集）的不同数据集，进行了迄今为止最大规模的实验评估（据我们所知），联合执行实体识别和关系提取（参见第 4 节和第 5 节）。（即新闻、生物医学、房地产）和语言（即英语、荷兰语）。具体来说，我们将我们的方法应用于四个数据集，即 ACE04（新闻）、不良药物事件（ADE）、荷兰房地产分类（DREC）和 CoNLL'04（新闻）。我们的方法优于所有不依赖任何附加功能或工具的最先进方法，而与利用手工设计功能或 NLP 工具的方法相比，性能非常接近（甚至在生物医学数据集中更好）。

2、Related Work

The tasks of entity recognition and relation extraction can be applied either one by one in a pipeline setting (Bekoulis et al., 2017; Fundel, Küffner, & Zimmer, 2007; Gurulingappa, MateenRajput & Toldo, 2012) or in a joint model (Bekoulis et al., 2018; Miwa & Bansal, 2016; Miwa & Sasaki, 2014). In this section, we present related work for each task (i.e., named entity recognition and relation extraction) as well as prior work into joint entity and
relation extraction.

实体识别和关系提取的任务可以在管道设置中逐一应用（Bekoulis et al, 2017; Fundel, Küffner, & Zimmer, 2007; Gurulingappa, MateenRajput & Toldo, 2012），也可以在联合模型中应用（Bekoulis）等人，2018；Miwa 和 Bansal，2016；Miwa 和 Sasaki，2014）。在本节中，我们介绍每个任务的相关工作（即命名实体识别和关系提取）以及联合实体和关系提取的先前工作。

2.1 Named entity recognition

In our work, NER is the first task which we solve in order to address the end-to-end relation extraction problem. A number of different methods for the NER task that are based on hand-crafted features have been proposed, such as CRFs (Lafferty, McCallum, & Pereira, 2001), Maximum Margin Markov Networks (Taskar, Guestrin, & Koller, 2003) and support vector machines (SVMs) for structured output (Tsochantaridis, Hofmann, Joachims, & Altun, 2004), to name just a few. Recently, deep learning methods such as CNN- and RNN-based models have been combined with CRF loss functions (Collobert et al., 2011; Huang, Xu, & Yu, 2015; Lample, Ballesteros, Subramanian, Kawakami, & Dyer, 2016; Ma & Hovy, 2016) for NER. These methods achieve state-of-the-art performance on publicly available NER datasets without relying on hand-crafted features.

在我们的工作中，NER 是我们为了解决端到端关系提取问题而解决的第一个任务。已经提出了许多基于手工特征的 NER 任务的不同方法，例如条件随机场 (Lafferty, McCallum, & Pereira, 2001)、最大裕度（间隔）马尔可夫网络 (Taskar, Guestrin, & Koller, 2003) 和用于结构化输出的支持向量机 (SVM)（Tsochantaridis、Hofmann、Joachims 和 Altun，2004 年）等等。最近，基于 CNN 和 RNN 的模型等深度学习方法已与 CRF 损失函数相结合（Collobert et al, 2011；Huang, Xu, & Yu, 2015；Lample, Ballesteros, Subramanian, Kawakami, & Dyer, 2016； Ma & Hovy，2016) NER。这些方法在公开的 NER 数据集上实现了最先进的性能，而不依赖于手工制作的特征。(上述参考文献描述了NER的相关方法和工作)

2.2 Relation extraction

We consider relation extraction as the second task of our joint model. The main approaches for relation extraction rely either on hand-crafted features (Kambhatla, 2004; Zelenko, Aone, & Richardella, 2003) or neural networks (Socher, Huval, Manning, & Ng, 2012; Zeng, Liu, Lai, Zhou, & Zhao, 2014). Feature-based methods focus on obtaining effective hand-crafted features, for instance defining kernel functions (Culotta & Sorensen, 2004; Zelenko et al., 2003) and designing lexical, syntactic, semantic features, etc. (Kambhatla, 2004; Rink & Harabagiu, 2010). Neural network models have been proposed to overcome the issue of manually designing hand-crafted features leading to improved performance. CNN- (dos Santos, Xiang, & Zhou, 2015; Xu, Feng, Huang, & Zhao, 2015; Zeng et al., 2014) and RNN-based (Socher, Chen, Manning, & Ng, 2013; Xu, Mou et al., 2015; Zhang & Wang, 2015) models have been introduced to automatically extract lexical and sentence level features leading to a deeper language understanding. Vu, Adel, Gupta, and Schütze (2016) combine CNNs and RNNs using an ensemble scheme to achieve state-of-the-art results.

我们将关系提取视为联合模型的第二个任务。关系提取的主要方法依赖于手工制作的特征（Kambhatla，2004；Zelenko，Aone，＆Richardella，2003）或神经网络（Socher，Huval，Manning，＆Ng，2012；Zeng，Liu，Lai，Zhou，＆Zhao，2014）。基于特征的方法侧重于获得有效的手工特征，例如定义核函数（Culotta & Sorensen，2004；Zelenko et al.，2003）以及设计词汇、句法、语义特征等（Kambhatla，2004；Rink & Harabagiu，2010）。人们提出了神经网络模型来克服手动设计手工特征的问题，从而提高性能。 CNN（dos Santos，Xiang，＆Zhou，2015；Xu，Feng，Huang，＆Zhao，2015；Zeng et al.，2014）和基于 RNN（Socher，Chen，Manning，＆Ng，2013；Xu，Mou et al., 2015;Zhang & Wang, 2015）引入了模型来自动提取词汇和句子级特征，从而实现更深入的语言理解。Vu、Adel、Gupta 和 Schütze（2016）使用集成方案将 CNN 和 RNN 结合起来，以实现最先进的结果。(上述参考文献描述了RE的相关方法和工作：手工制作特征，神经网络，基于特征的方法（定义核函数，词汇，纠纷，语义特征），CNN，RNN，CNN+RNN)

2.3 Joint entity and relation extraction

Entity and relation extraction includes the task of (i) identifying the entities (described in Section 2.1) and (ii) extracting the relations among them (described in Section 2.2). Feature-based joint models (Kate & Mooney, 2010; Li & Ji, 2014; Miwa & Sasaki, 2014; Yang & Cardie, 2013) have been proposed to simultaneously solve the entity recognition and relation extraction (RE) subtasks. These methods rely on the availability of NLP tools (e.g., POS taggers) or manually designed features and thus (i) require additional effort for the data preprocessing, (ii) perform poorly in different application and language settings where the NLP tools are not reliable, and (iii) increase the computational complexity. In this paper, we introduce a joint neural network model to overcome the aforementioned issues and to automatically perform end-to-end relation extraction without the need of any manual feature engineering or the use of additional NLP components.

实体和关系提取包括 (i) 识别实体（第 2.1 节中描述）和 (ii) 提取实体之间的关系（第 2.2 节中描述）的任务。基于特征的联合模型（Kate & Mooney，2010；Li & Ji，2014；Miwa & Sasaki，2014；Yang & Cardie，2013）被提出来同时解决实体识别和关系提取（RE）子任务。这些方法依赖于 NLP 工具（例如词性标注器）的可用性或手动设计的功能，因此（i）需要额外的数据预处理工作，（ii）在 NLP 工具不可靠的不同应用程序和语言设置中表现不佳，以及（iii）增加计算复杂度。在本文中，我们引入了一种联合神经网络模型来克服上述问题，并自动执行端到端关系提取，而无需任何手动特征工程或使用额外的 NLP 组件。

Neural network approaches have been considered to address the problem in a joint setting (end-to-end relation extraction) and typically include the use of RNNs and CNNs (Li et al., 2017; Miwa & Bansal, 2016; Zheng et al., 2017). Specifically, Miwa and Bansal (2016) propose the use of bidirectional tree-structured RNNs to capture dependency tree information (where parse trees are extracted using state-of-the-art dependency parsers) which has been proven beneficial for relation extraction (Xu, Feng et al., 2015; Xu, Mou et al., 2015). Li et al. (2017) apply the work of Miwa and Bansal (2016) to biomedical text, reporting state-of-the-art performance for two biomedical datasets. Gupta et al. (2016) propose the use of a lot of hand-crafted features along with RNNs. Adel and Schütze (2017) solve the entity classification task (which is different from NER since in entity classification the boundaries of the entities are known and only the type of the entity should be predicted) and relation extraction problems using an approximation of a global normalization objective (i.e., CRF): they replicate the context of the sentence (left and right part of the entities) to feed one entity pair at a time to a CNN for relation extraction. Thus, they do not simultaneously infer other potential entities and relations within the same sentence. Katiyar and Cardie (2017) and Bekoulis et al. (2018) investigate RNNs with attention for extracting relations between entity mentions without using any dependency parse tree features. Different from Katiyar and Cardie (2017), in this work, we frame the problem as a multi-head selection problem by using a sigmoid loss to obtain multiple relations and a CRF loss for the NER component. This way, we are able to independently predict classes that are not mutually exclusive, instead of assigning equal probability values among the tokens. We overcome the issue of additional complexity described by Bekoulis et al. (2018), by dividing the loss functions into a NER and a relation extraction component. Moreover, we are able to handle multiple relations instead of just predicting single ones, as was described for the application of structured real estate advertisements of Bekoulis et al. (2018).
神经网络方法被认为可以解决联合设置中的问题（端到端关系提取），通常包括使用 RNN 和 CNN（Li et al.，2017；Miwa 和 Bansal，2016 ；Zheng et al.，2017）。具体来说，Miwa 和 Bansal（2016）提出使用双向树结构 RNN 来捕获依存树信息（其中使用最先进的依存解析器提取解析树），这已被证明有利于关系提取（Xu，Feng et al.，2015；Xu、Mou et al.，2015）。 Li et al. (2017) 将 Miwa 和 Bansal (2016) 的工作应用于生物医学文本，报告了两个生物医学数据集的最先进性能。 Gupta et al. (2016) 提出将大量手工制作的特征与 RNN 一起使用。 Adel 和 Schütze (2017) 使用全局归一化的近似值解决了实体分类任务（这与 NER 不同，因为在实体分类中，实体的边界是已知的，并且只应该预测实体的类型）和关系提取问题目标（即 CRF）：它们复制句子的上下文（实体的左右部分），一次将一个实体对提供给 CNN 进行关系提取。因此，它们不会同时推断同一个句子中的其他潜在实体和关系。 Katiyar 和 Cardie (2017) 以及 Bekoulis et al. (2018) 研究 RNN，重点关注在不使用任何依赖解析树特征的情况下提取实体提及之间的关系。与 Katiyar 和 Cardie (2017) 不同，在这项工作中，我们通过使用 sigmoid 损失来获得多重关系和 NER 组件的 CRF 损失，将问题框架为多头选择问题。这样，我们就能够独立预测不互斥的类，而不是在标记之间分配相等的概率值。我们通过将损失函数划分为 NER 和关系提取组件，克服了 Bekoulis et al. (2018) 描述的额外复杂性问题。此外，我们能够处理多种关系，而不仅仅是预测单一关系，正如 Bekoulis et al. (2018) 的结构化房地产广告应用中所描述的那样。

3、Joint model

下面是联合模型的总体描述。

In this section, we present our multi-head joint model illustrated in Fig. 1. The model is able to simultaneously identify the entities (i.e., types and boundaries) and all the possible relations between them at once. We formulate the problem as a multi-head selection problem extending previous work (Bekoulis et al., 2018; Zhang, Cheng, & Lapata, 2017) as described in Section 2.3. By multi-head, we mean that any particular entity may be involved in multiple relations with other entities. The basic layers of the model, shown in Fig. 1, are: (i) embedding layer, (ii) bidirectional sequential LSTM (BiLSTM) layer, (iii) CRF layer and the (iv) sigmoid scoring layer. In Fig. 1, an example sentence from the CoNLL04 dataset is presented. The input of our model is a sequence of tokens (i.e., words of the sentence) which are then represented as word vectors (i.e., word embeddings). The BiLSTM layer is able to extract a more complex representation for each word that incorporates the context via the RNN structure. Then the CRF and the sigmoid layers are able to produce the outputs for the two tasks. The outputs for each token (e.g., Smith) are twofold: (i) an entity recognition label (e.g., I-PER, denoting the token is inside a named entity of type PER) and (ii) a set of tuples comprising the head tokens of the entity and the types of relations between them (e.g., {(Center, Works for), (Atlanta, Lives in)}). Since we assume token-based encoding, we consider only the last token of the entity as head of another token, eliminating redundant relations. For instance, there is a Works for relation between entities “John Smith” and “Disease Control Center”. Instead of connecting all tokens of the entities, we connect only “Smith” with “Center”. Also, for the case of no relation, we introduce the “N” label and we predict the token itself as the head.

头部选择问题扩展了之前的工作（Bekoulis et al.，2018 ；Zhang，Cheng 和 Lapata，2017 ），如第 2.3 节所述。通过多头，我们的意思是任何特定实体都可能涉及与其他实体的多种关系。如图 1 所示，该模型的基本层包括：(i) 嵌入层、(ii) 双向序列 LSTM (BiLSTM) 层、(iii) CRF 层和 (iv) sigmoid 评分层。在图 1 中，给出了来自 CoNLL04 数据集的例句。我们模型的输入是一系列标记（即句子中的单词），然后将其表示为单词向量（即单词嵌入）。 BiLSTM 层能够通过 RNN 结构提取包含上下文的每个单词的更复杂的表示。然后 CRF 和 sigmoid 层就能够生成这两个任务的输出。每个token（例如，Smith）的输出是双重的：（i）实体识别标签（例如，I-PER，表示该令牌位于 PER 类型的命名实体内部）和（ii）一组包含头的元组实体的标记以及它们之间的关系类型（例如，{（中心，为）工作，（亚特兰大，居住于）}）。由于我们假设基于token的编码，因此我们仅将实体的最后一个token视为另一个token的头，从而消除了冗余关系。例如，有一个用于实体“John Smith”和“Disease Control Center”之间关系的"Works for"。我们没有连接实体的所有token，而是仅将“Smith”与“Center”连接。(只是关联实体的部分词之间的语义关系，感觉有点投机取巧的意思)。此外，对于没有关系的情况，我们引入“N”标签，并将令牌本身预测为头部。

Fig. 1. The multi-head selection model for joint entity and relation extraction. The input of our model is the words of the sentence which are then represented as word vectors (i.e., embeddings). The BiLSTM layer extracts a more complex representation for each word. Then the CRF and the sigmoid layers are able to produce the outputs for the two tasks. The outputs for each token (e.g., Smith) are: (i) an entity recognition label (e.g., I-PER) and (ii) a set of tuples comprising the head tokens of the entity and the types of relations between them (e.g., {(Center, Works for), (Atlanta, Lives in)}).

图1.用于联合实体和关系提取的多头选择模型。我们模型的输入是句子中的单词，然后将其表示为单词向量（即嵌入）。 BiLSTM 层为每个单词提取更复杂的表示。然后 CRF 和 sigmoid 层就能够生成这两个任务的输出。每个标记（例如，Smith）的输出是：（i）实体识别标签（例如，I-PER）和（ii）一组元组，包括实体的头标记和它们之间的关系类型（例如，，{（中心，工作于），（亚特兰大，居住于）}）。

3.1 Embedding layer

嵌入层的策略：将一个句子W作为一个token序列，词嵌入将token映射到一个词向量Wword2vec上。文中使用Skip-Gram word2vec模型进行预训练的词嵌入(Mikolov, Sutskever, Chen, Corrado, & Dean, 2013)。

文中还使用了字符嵌入（character embeddings），因为其经常应用于神经NER（Lample et al., 2016; Ma & Hovy, 2016）。字符嵌入能够捕获前缀和后缀等形态特征。例如，在药物不良事件（ADE）数据集中，后缀（suffix）"toxicity（毒性）"可以指定药物不良事件实体，例如“neurotoxicity(神经毒性)” or “hepatotoxicity（肝毒性）”，因此信息量很大。另一个句子是荷兰房地产数据集DREC中的荷兰后缀“kamer”,用于指定空间实体“badkamer”和“slaapkamer”。字符级嵌入是在训练过程中学习的，类似于Ma and Hovy (2016) 和 Lample et al. (2016)。在文献Lample et al.(2016)的工作中，字符嵌入使得NER的F1值提升了高达3%。本文的工作，通过结合字符嵌入，在表2 中体现了总体F1得分增加了~2%。

图2说明了基于其特征的词嵌入生成的神经架构。每个单词的字符由字符向量（即嵌入）表示。字符嵌入被输入BiLSTM并将两个最终状态（前向和后向）连接起来。向量Wchars 是单词的字符级表示。然后将该向量进一步连接到词级表示Wword2vec以获得完整的词嵌入向量。

图2 嵌入层详细结构信息。将单词“Man”的字符级嵌入经过BiLSTM后，将两个最终状态拼接起来，形成Wchars，然后再讲Wchars拼接到Wword2vec上面。

3.2 Bidirectional LSTM encoding layer

RNN通常用于对序列数据进行建模，并已成功应用于各种NLP任务（Lample et al., 2016; Miwa and Bansal, 2016; Sutskever、Vinyals and Le, 2014）。本文中使用多层LSTM，这是一种特定类型的RNN，能够很好地捕获长期依赖性（Bengio、Simard and Frasconi, 1994; Pascanu、Mikolov and Bengio, 2013）。我们采用BiLSTM, 它能够从左到右（过去到未来； past to future）和从右到左（未来到过去；future to past）。通过连接时间步i的前向和后向输出来组合每个单词的双向信息。

3.3 Named entity recognition

文中将实体识别任务制定为序列标记问题，类似于之前的联合学习模型（文献）和命名实体识别（文献）使用BIO编码方案。每个实体由句子中的多个连续标记组成，为句子中每个标记分配一个标签。这样就能识别实体参数（开始和结束位置）及其类型（例如ORG）。为此（To do so），将B-type(开始)分配给实体的第一个标记，将I-type（内部）分配给实体内的每个其他标记，如果标记不是实体的一部分，则分配O标记（外部）。在CRF层中，可以观察到我们分配了B-ORG和I-ORG标签来分别指示实体“Disease Control Center”的开始和内部标记。在BiLSTM层之上，文中使用softmax或CRF层来计算每个标记最可能的实体标签。文中计算每个实体标签的每个token Wi的得分如下：

$s^{(e)}(h_{i})=V^{(e)}f(U^{(e)}h_{i}+b^{(e)})$

参数说明：（e）表示实体识别的任务；

f(.)是逐元素激活函数（即relu、tanh）；

$V^{(e)} \in \mathbb{R}^{p\times l}$ , $U^{(e)} \in \mathbb{R}^{l \times 2d}$ , $b^{(e)} \in \mathbb{R}^{l}$ ，其中d是LSTM的隐藏层大小，p为NER标签（例如B-ORG）的数量，l 为层宽度。

计算给定标记Wi的所有候选标签的概率为：

$PR(tag|w_{i})=softmax(s(h_{i}))$ ，其中 $PR(tag|w_{i}) \in \mathbb R^{p}$ 。

文中将softmax方法用于实体分类（EC）任务（类似于NER），其中假设给定边界，我们只需预测每个标记的实体类型（例如PER）。CRF方法用于NER任务，其中包括实体类型和边界识别。

在softmax方法中，文中在预测时以贪婪的方式将实体类型分配给标记（即，所选标签只是所有可能的标签集中得分最高的标签）。尽管假设独立的标签分布有利于实体分类任务（例如，POS 标记），但当标签之间存在强依赖性时，情况并非如此。具体来说，在 NER 中，BIO 标记方案施加了一些限制（foreces several restrictions: 例如，B-LOC 后面不能跟 I-PER）。即使 BiLSTM 捕获有关相邻单词的信息，softmax 方法也允许本地决策（即，对于每个token wi 的标签）。尽管如此，在特定token的标签决策中并未考虑相邻标签。例如，在实体“John Smith”中，将“Smith”标记为 PER 对于确定“John”是 B-PER 很有用。为此(To this end)，对于 NER，我们使用linear-chain CRF，类似于 Lample et al. (2016)，其中报告使用 CRF 时 F1 NER 点提高了 ∼ 1%。在我们的例子中，通过使用 CRF，我们还报告了约 1% 的整体性能改进，如表 2 所示（参见第 5.2 节）。假设词向量 w、得分向量序列 $s^{(e)}_{1},...,s^{(e)}_{n}$ 和标签预测向量 $y^{(e)}_{1},...,y^{(e)}_{n}$ ，线性链(the linear-chain CRF) CRF 分数定义为：

$S(y^{(e)}_{1}, ..., y^{(e)}_{n})=\sum_{i=0}^{n}s^{(e)}_{i,y^{(e)}_{i}}+\sum_{i=1}^{n-1}T_{y^{(e)}_{i}, y^{(e)}_{i+1}}$

参数说明： $S \in \mathbb R$ , $s^{(e)}_{i,y^{(e)}_{i}}$ 是标记Wi的预测标签的分数。T是一个方形转换矩阵，其中每个条目表示从一个标签到另一个标签的转换分数。

$T \in \mathbb R^{(p+2)\times (p+2)}$ 因为 $y^{(e)}_{0}$ 和 $y^{(e)}_{n}$ 是两个辅助标签，分别代表句子的起始标签和结束标签。然后，给定标签序列相对于输入句子w的所有可能标签序列的概率定义为：

$Pr(y^{(e)}_{1}, ..., y^{(e)}_{n}|w)=\frac{e^{S(y^{(e)}_{1},..., y^{(e)}_{n})}}{\sum_{\widetilde{y^{(e)}_{1}}, ..., \widetilde{y^{(e)}_{n}}}^{}e^{S(\widetilde{y^{(e)}_{1}} ,..., \widetilde{y^{(e)}_{n}})}}$

文中使用Viterbi获得得分最高的标签序列y。训练softmax（用于EC任务）和CRF（用于NER）通过最小化交叉熵损失。文中通过学习标签嵌入，使用实体标签作为关系提取层的输入，受到Miwa和Bansal (2016)的启发，其中报告称F1提高了2%（使用标签嵌入）。文中，标签嵌入导致F1分数增加1%。下一层的输入是双重的，LSTM的输出状态和学习到的标签嵌入表示，编码命名实体的知识可用于关系提取的直觉（？？？感知？？？）。训练中，使用标准实体标签，而在预测时，使用预测的实体标签作为下一层的输入。下一层的输入是隐藏的LSTM状态h_i和标记w_i的标签嵌入g_i的拼接（concatenation）。

$z_{i}=[h_{i};g_{i}], i=0,1,...,n$

3.4 Relation extraction as multi-head selection

本节中，将描述关系提取任务，视为多头选择问题（Bekoulis et al., 2018; Zhang et al., 2017）。在我们方法的一般描述中，每个token w_i 可以有多个头(也就是与其他tokens有多种关系)。预测元组（y_i, c_i）,其中y_i是头向量，c_i 是每个标记w_i对应关系的向量。这与之前依赖解析方法的标准头部选择不同（Zhang et al., 2017）,原因（1）它被扩展为预测多个头部，并且（2）头部和关系的决策是联合做出的（即，首先预测头部，然后在下一步中使用附加分类器预测关系）。给定一个标记序列（token sequence）w 和一个关系标签集合R作为输入，目标是识别每个token w_i 最可能的头y_i （包含于 w）的向量和最可能的对应关系标签 r_i （包含于R）的向量。我们计算给定标签 r_k 的标记w_i 和w_j之间的分数，如下：

公式7是token w_j 被选为 token w_i 的头部的概率，它们之间有关系标签 r_k ，其中 $\sigma$ 表示sigmoid函数。我们在训练期间最小化交叉熵损失Lrel:

其中，y_i 包含于 w , r_i 包含于R 是 w_i 的头和相关关系标签的真实向量，m 是 w_i 的关系（头）的数量。训练后，保持头y^_i 和关系标签r^_i 的组合超过基于估计联合概率的阈值。与之前联合模型的工作（Katiyar & Cardie, 2017）不同，文中能够预测多个关系，考虑到类别是独立的而不是互斥的（不同类别的概率之和不一定为1）。对于联合实体和关系提取任务，文中最终目标计算Lner + Lrel。

3.5 Edmonds' algorithm

文中的模型能够同时提取实体提及项及其之间的关系。（注：经常会出现entity mentions 这一概念，翻译过来表示“实体提及”。所谓的实体提及表示在上下文中出现过，每出现一次则mentions+1）。To demonstrate the effectiveness and the general purpose nature of our model, 文中还在最近提出的荷兰房地产分类（DREC）数据集（Bekoulis et al., 2017）上对齐进行了测试，其中实体需要形成树结构。通过使用阈值推理，不能保证关系的树结构。因为文中应该对所提模型实施树结构约束。为此，我们使用Edmonds的有向图最大生成树算法对系统的输出进行后处理（Chu & Liu, 1965; Edmonds, 1967）。构建一个完全连接的有向图G=（V，E），其中定点V表示识别实体的最后一个标记（如NER预测），边E表示最高评分关系，以其分数作为权重。Edmonds' algorithm 适用于尚未通过阈值推理形成树的情况。

4、Experimental setup

4.1 Datasets and evaluation metrics

实验在下面四个数据集上完成：

（1）自动内容提取；ACE04 （Doddington et al., 2004）;

（2）药物不良事件，ADE （Gurulingappa, Mateen-Rajput et al., 2012）;

（3）荷兰房地产分类，DREC （Bekoulis et al., 2017）;

（4）具有实体和关系识别语料库的CoNLL'04数据集（Roth & Yih，2004）。

4.2 Word embeddings

文中使用之前工作中使用的预训练的word2vec嵌入，以便为文中的模型保留相同的输入，并获得不受输入嵌入影响的可比较结果。具体来说，文中使用了Miwa and Bansal (2016) 的工作中使用的200 维度的词嵌入来处理在维基百科上训练的ACE04数据集；

文中获得了Adel and Schutze （2017）使用的50维词嵌入，该嵌入也在维基百科上针对CoNLL04语料库进行了训练。

使用Bekoulis et al., 2018 使用的128维word2vec嵌入，该嵌入在DREC数据集的887K荷兰房地产广告的大量集合上进行训练。

对于ADE数据集，使用Li et al. 2017 使用的200维嵌入，并结合Pubmed and PMC 文本及从英语维基百科提取的文本进行训练（Moen & Ananiadou，2013）。

4.3 Hyperparameters and implementation details

文中使用Python和TensorFlow机器学习库开发了联合模型（Abadi et al., 2016）。使用Adam优化器（Kingma & Ba, 2015）进行训练，学习率为10-3，文中将LSTM的大小固定为d=64，将神经网络的层宽度固定为 l=64 （实体和关系评分层）。使用dropout （Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014）来规范文中的网络结构。Dropout应用于两个任务的输入嵌入和隐藏层之间。不同的嵌入取值不一致（0.2~0.4）。基于字符的LSTM的隐藏维度为25。所有数据集的标签嵌入大小固定为b=25，但是CoNLL04除外，其中标签嵌入没有效果且没用使用。文中使用tanh和relu的激活函数，文中仅在ACE04中使用relu激活，在所有其他数据集中使用tanh。

文中采用基于验证集的提前停止技术。在本研究检查的所有数据集中，在60-200个epoch后获得最佳超参数，具体取决于数据集的大小。文中根据验证集中的结果选择最佳时期。有关于每个超参数对模型性能影响的更多相信信息，可见附录A。

5、Result and discussion

5.1 Results

所提出的模型如下：

（1）multi-head是提出的模型，带有用于NER的CRF层和用于多头预测的sigmoid损失；

（2）multi-head+E是提出的模型，添加了Edmonds算法，为了保证DREC数据集的树形结构输出；

（3）single-head是建议的方法，但它使用softmax损失而不是sigmoid来预测每个token的一个head；

（4）multi-head EC 是建议的方法，假设边界给定，用于预测实体类的softmax以及多头选择的sigmoid 损失。

表1还指出了不同的设置是否包括手工制作的特征或源自NLP工具（例如，词性标注器，依存解析器）的特征。我们用√符号表示该模型包含这种附加特征，×符号表示该模型仅基于自动提取的特征。文中模型的所有变量都不依赖于任何附加功能。下一栏中声明每个实验进行的评估类型。在这里包括不同的评估类型，以便能够将我们的结果与以前的研究进行比较。具体来说，使用三种评估类型：

（1）strict：如果实体的边界和类型都正确，则认为实体是正确的；当关系的类型和参数实体都正确时，关系是正确的；（实体边界，实体类型，关系都正确）

（2）Boundaries: 如果只有实体的边界正确（不考虑实体类型），则实体被认为是正确的；当关系的类型和参数实体都正确时，关系是正确的；（实体边界，关系正确）

（3）Relaxed：假设给定边界，如果multi-token entity中至少一个包含的令牌类型是正确的，我们就将multi-token 实体评分为正确；当关系的类型和参数实体都正确时，关系就是正确的。（multi-token 时只要一个包含于标准实体中，则关系正确就是正确）

接下来的三列中中，展示实体识别任务的结果，后三列是关系提取任务的结果。最后一栏是在两个子任务的平均F1值。表1中用粗体标记了使用自动提取特征的模型中每个数据集的最佳结果。

ACE04实验结果，文中的模型在两项任务中均优于Katiyar and Cardie (2017) 的模型约2%。这种改进可以通过使用多头选择方法可以自然地捕获多种关系并将建模为多标签问题。与Katiyar and Cardie 2017的工作不同，类别概率不一定等于1，因为类别被认为是独立的。此外，文中使用CRF层对NER任务进行建模，以捕获顺序标记之间的依赖关系。最后，通过使用字符级嵌入获得更有效的单词表示。

另一方面，与Miwa and Bansal 2016 相比，模型在合理的范围内执行（NER 任务为~0.5%，RE任务为~1%）。这种差异可以通过一下事实来解释：Miwa 和 Bansal (2016) 的模型依赖于词性标注和依存句法分析得出的句法特征。然而，此类功能依赖于 NLP 工具，而这些工具对于各种语言和上下文并不总是准确的。例如，Li 等人 (2017) 的 ADE 生物医学数据集采用了相同的模型，在该数据集中，我们的模型报告 RE 任务的改进超过 3%。这表明我们的模型能够生成自动提取的特征，这些特征在所有上下文（例如新闻、生物医学）中都表现得相当好。

对于CoNLL04数据集，有两种不同的评估设置，即Relaxed和Strict。In the relaxed setting，假设给定实体的边界，执行 EC 任务而不是 NER 任务。采用这种设置是为了产生与之前的研究可比较的结果（Adel & Schütze，2017；Gupta et al.，2016）。与 Adel 和 Schütze (2017) 类似，文中展示了单一模型而非集成的结果。我们观察到，我们的模型大大优于之前所有不依赖复杂手工特征的模型（两项任务均> 4%）。与之前的研究不同的是，这些研究考虑实体对来获取实体类型和相应的关系，我们一次对整个句子进行建模。这样，我们的方法就能够直接推断句子的所有实体和关系，并从它们可能的交互中受益，而当一次一个地对每个实体对单独进行训练时，这些交互是无法建模的。在同一设置中，我们还报告了 Gupta et al.（2016）的结果，其中他们使用来自 NLP 工具的多个复杂的手工特征。

文中模型在 EC 任务中表现稍好，并且就总体 F1 分数而言，差距在 1% 以内。整体性能的差异是由于我们的模型仅使用自动生成的特征。我们还报告了在同一数据集上进行 NER（即预测实体类型和边界）并使用严格评估措施进行评估的结果，类似于 Miwa 和 Sasaki (2014)。我们的结果不能直接与 Miwa 和 Sasaki (2014) 的工作进行比较，因为我们使用 Gupta et al. (2016) 提供的分割。然而，在这种情况下，我们提供 Miwa 和 Sasaki (2014) 的结果作为参考。我们报告总体 F1 分数提高了 ∼ 2%，这表明与基于特征的方法相比，我们的神经模型能够提取更多信息表示。

文中还报告了 DREC 数据集的结果，具有两种不同的评估设置。具体来说，我们使用Boundaries和Strict的设置。我们将 Bekoulis et al.（2018）之前的结果转换为边界设置，以使它们与我们的模型具有可比性，因为在他们的工作中，他们报告了基于标记的 F1 分数，这不是关系提取问题中的常见评估指标。此外，在他们的工作中，他们只专注于识别实体的边界，而不是类型（例如，楼层、空间）。在边界评估中，我们对这两项任务都实现了约 3% 的改进。这是因为它们的二次评分层有利于RE任务，但使NER变得复杂，NER通常被建模为序列标记任务。此外，文中使用大多数相关工作中使用的严格评估来报告结果。利用每个实体只有一个头的先验知识，我们可以简化模型并每次仅预测一个头（即使用 softmax 损失）。单头模型和多头模型之间的差异很小（两项任务均 < 0.1%）。这表明我们的模型（多头）可以适应各种环境，即使设置是单头（就应用程序而言，因此在训练和测试数据方面也是如此）。

最后，我们将我们的模型与之前在 ADE 数据集上的工作（Li et al, 2017; Li et al, 2016）进行比较。之前的模型（Li et al, 2017；Li et al, 2016）都使用手工制作的特征或源自 NLP 工具的特征。然而，使用Strict的评估指标，我们的模型能够优于这两个模型。我们报告 NER 任务和 RE 任务分别提高了 ∼ 2% 和 ∼ 3%。 Li 等人（2017）的工作与 Miwa 和 Bansal（2016）类似，并且强烈依赖依存解析器来提取句法信息。从我们的模型中获得更好结果的一个可能的解释是，使用外部工具获得的预先计算的句法信息对于生物医学数据来说不是那么准确或不重要。

5.2 Analysis of feature contribution （消融实验）

我们对表 2 中报告的 ACE04 数据集进行消融测试，以分析联合模型各个部分的有效性。当我们删除标签嵌入层并仅使用 LSTM 隐藏状态作为 RE 任务的输入时，RE 任务的性能会下降（以 F1 分数计约 1%）。这表明 NER 标签正如预期的那样为 RE 组件提供了有意义的信息。

删除字符嵌入还会大幅降低 NER (~ 1%) 和 RE (~ 2%) 任务的性能。这说明通过字符的表示来组成单词是有效的，并且我们的方法受益于附加信息，例如标记中的大写字母、后缀和前缀（即其字符序列）。

最后，我们通过删除 CRF 损失层并用 softmax 代替来对 NER 任务进行实验。假设标签的独立分布（即 softmax）会导致 NER 模块的 F1 性能略有下降，并且 RE 任务的性能下降约 2%。发生这种情况是因为 CRF 损失能够捕获数据集中存在的强标签依赖性（例如，I-LOC 无法遵循 B-PER），而不是仅仅假设每个令牌的标签决策独立于相邻令牌的标签决策标记。

6、Conclusion

在这项工作中，我们提出了一种联合神经模型，可以同时从文本数据中提取实体和关系。我们的模型包含一个用于实体识别任务的 CRF 层和一个用于关系提取任务的 sigmoid 层。具体来说，我们将关系提取任务建模为多头选择问题，因为一个实体可以具有多个关系。以前执行此任务的模型严重依赖外部 NLP 工具（即词性标注器、依存解析器）。因此，这些模型的性能受到提取特征的准确性的影响。与之前的研究不同，我们的模型会自动生成特征，而不是依赖手工制作的特征或现有的 NLP 工具。鉴于其独立于此类 NLP 或其他特征生成工具，我们的方法可以轻松应用于任何语言和上下文。我们通过进行大规模实验研究证明了我们方法的有效性。我们的模型能够胜过自动生成特征的神经方法，而结果与基于特征的神经网络方法略有相似（有时更好）。

作为未来的工作，我们的目标是探索实体识别模块的实体预训练的有效性。 Miwa 和 Bansal (2016) 的工作已证明这种方法对于实体和关系提取模块都是有益的。此外，我们计划探索一种减少二次关系评分层计算量的方法。例如，一种简单的方法是在 sigmoid 层中仅使用已被识别为实体的标记。

Appendix A

在本节中，我们报告多头选择框架的其他结果。具体来说，我们 (i) 将我们的模型与 Lample 等人 (2016) 的模型进行比较（即仅在 NER 任务上进行优化），(ii) 探索网络的几个超参数（例如，dropout、LSTM 大小、字符嵌入大小）），以及（iii）使用与之前作品中使用不同的词嵌入来报告 F1 分数。

在主论文的表 1 中，我们重点将我们的模型与能够同时解决两个任务（即 NER 和关系提取）的其他联合模型进行比较，主要证明将关系提取表述为多头选择的优越性问题（能够一次提取多个关系）。在这里，在表 A.1 中，我们评估了联合多头模型的第一个模块的性能：我们将模型的 NER 组件的性能与 Lample 等人最先进的 NER 模型进行了比较等（2016）。结果表明，在 4 个数据集中的 3 个中，我们的模型比 Lample 的 NER 基线有边际性能改进。我们模型的 NER 部分的改进并不显着，因为 (i) 我们的 NER 部分几乎与 Lample 的相同，并且 (ii) 神经系统中 NER 性能的最新进展相对较小（改进约为 0.1 F1 点 –例如，Ma and Hovy (2016) 和 Lample et al (2016) 在 CoNLL-2003 测试集上的贡献分别为 0.01% 和 0.17% F1 点）。这一微小的改进表明，通过共享底层 LSTM 层来实现两个组件的交互确实是有益的（例如，识别关系的 Works 可能有助于 NER 模块检测两个实体的类型，即 PER、ORG 和反之亦然）。请注意，单独改进 NER 并不是我们多头模型的目标，而是我们的目标是将我们的模型与同时解决实体识别和关系识别任务的其他联合模型进行比较。因此，我们并没有设想在我们的联合模型的每个单独构建块中声称或实现最先进的性能。

表 A.2–A.4 分别显示了我们的模型在不同嵌入 dropout、LSTM 层 dropout 和 LSTM 输出 dropout 超参数值的测试集上的性能。请注意，第 5 节中结果使用的超参数值是通过调整开发集获得的，这些值在下表中以粗体表示。我们一次改变一个超参数，以评估特定超参数的效果。这些表的主要结果有两个：(i) 低 dropout 值（例如 0、0.1）导致总体 F1 分数的性能下降（参见表 A.3，其中 ACE04 上报告的 F1 下降约 3%）数据集）和（ii）平均丢失值（即 0.2-0.4）导致一致相似的结果。

在表 A.5-A.8 中，我们分别报告了 LSTM 大小、字符嵌入大小、标签嵌入大小和神经网络 l 层宽度（实体和关系评分层）的不同值的结果。报告的结果表明，不同的超参数设置确实会导致明显的性能差异，但我们没有观察到任何明显的趋势。此外，我们没有观察到任何显着的性能改进会影响模型的整体排名，如表 1 所示。另一方面，结果表明，增加（字符和标签）嵌入大小和层维度会导致 CoNLL04 数据集的性能略有下降。这可以通过以下事实来解释：CoNLL04 数据集相对较小，并且使用更多可训练的模型参数（即更大的超参数值）可以使我们的多头选择方法在训练集上快速过拟合。在几乎任何其他情况下，超参数的变化不会影响表 1 中报告的模型的排名。

在主要结果中（参见第 5 节），为了保证与之前的工作进行公平比较并获得不受输入嵌入影响的可比较结果，我们使用了之前研究中也使用的嵌入。为了评估我们的系统输入变化的性能，我们还报告了在 ACE04 数据集上使用不同词嵌入的结果（参见表 A.9）（即 Adel & Schütze，2017；Li et al.，2017）。我们的结果表明，即使使用不同的词嵌入，我们的模型仍然比其他不依赖额外 NLP 工具的作品（例如我们的模型）表现更好。