CHAPTER 17 Information Extraction
Speech and Language Processing ed3 读书笔记
This chapter presents techniques for extracting limited kinds of semantic content from text. This process of information extraction (IE), turns the unstructured information embedded in texts into structured data, for example for populating a relational database to enable further processing.
We begin with the first step in most IE tasks, finding the proper names or named entities in a text. The task of named entity recognition (NER) is to find each mention of a named entity in the text and label its type. Once all the named entities in a text have been extracted, they can be linked together in sets corresponding to real-world entities, inferring mentions refer to the same thing. This is the joint task of coreference resolution and entity linking which we defer till Chapter 20.
Next, we turn to the task of relation extraction: finding and classifying semantic relations among the text entities. These are often binary relations like child-of, employment, part-whole, and geospatial relations. Relation extraction has close links to populating a relational database.
Finally, we discuss three tasks related to events. Event extraction is finding events in which these entities participate. Event coreference (Chapter 20) is needed to figure out which event mentions in a text refer to the same event.
To figure out when the events in a text happened we extract temporal expressions like days of the week (Friday and Thursday), relative expressions like two days from now or next year and times such as 3:30 P.M… These expressions must be normalized onto specific calendar dates or times of day to situate events in time.
Finally, many texts describe recurring stereotypical events or situations. The task of template filling is to find such situations in documents and fill in the template slots. These slot-fillers may consist of text segments extracted directly from the text, or concepts like times, amounts, or ontology entities that have been inferred from text elements through additional processing.
17.1 Named Entity Recognition
The first step in information extraction is to detect the entities in the text. A named entity is, roughly speaking, anything that can be referred to with a proper name: a person, a location, an organization. The term is commonly extended to include things that aren’t entities per se, including dates, times, and other kinds of temporal expressions temporal expressions, and even numerical expressions like prices. Here’s the sample text introduced earlier with the named entities marked:
Citing high fuel prices, [ ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco].
The text contains 13 mentions of named entities including 5 organizations, 4 locations, 2 times, 1 person, and 1 mention of money.
In addition to their use in extracting events and the relationship between participants, named entities are useful for many other language processing tasks.
Figure 17.1 shows typical generic named entity types. Many applications will also need to use specific entity types like proteins, genes, commercial products, or works of art.
[外链图片转存失败(img-nytkBmHB-1562247232544)(17.1.png)]
Named entity recognition means finding spans of text that constitute proper names and then classifying the type of the entity. Recognition is difficult partly because of the ambiguity of segmentation; we need to decide what’s an entity and what isn’t, and where the boundaries are. Another difficulty is caused by type ambiguity. The mention JFK can refer to a person, the airport in New York, or any number of schools, bridges, and streets around the United States. Some examples of this kind of cross-type confusion are given in Figures 17.2 and 17.3.
[外链图片转存失败(img-o0NSQ6xH-1562247232546)(17.2.png)]
[外链图片转存失败(img-SXYOyJqS-1562247232546)(17.3.png)]
17.1.1 NER as Sequence Labeling
The standard algorithm for named entity recognition is as a word-by-word sequence labeling task, in which the assigned tags capture both the boundary and the type. A sequence classifier like an MEMM/CRF or a bi-LSTM is trained to label the tokens in a text with tags that indicate the presence of particular kinds of named entities. Consider the following simplified excerpt from our running example.
[ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
Figure 17.4 shows the same excerpt represented with IOB tagging. In IOB tagging we introduce a tag for the beginning (B) and inside (I) of each entity type, and one for tokens outside (O) any entity. The number of tags is thus 2n + 1 tags, where n is the number of entity types. IOB tagging can represent exactly the same information as the bracketed notation.
[外链图片转存失败(img-GwyKfqSI-1562247232547)(17.4.png)]
We’ve also shown IO tagging, which loses some information by eliminating the B tag. Without the B tag IO tagging is unable to distinguish between two entities of the same type that are right next to each other. Since this situation doesn’t arise very often (usually there is at least some punctuation or other delimitor), IO tagging may be sufficient, and has the advantage of using only n+ 1 tags.
In the following three sections we introduce the three standard families of algorithms for NER tagging: feature based (MEMM/CRF), neural (bi-LSTM), and rule-based.
17.1.2 A feature-based algorithm for NER
The first approach is to extract features and train an MEMM or CRF sequence model of the type we saw for part-of-speech tagging in Chapter 8. Figure 17.5 lists standard features used in such feature-based systems. We’ve seen many of these features before in the context of part-of-speech tagging, particularly for tagging unknown words. This is not surprising, as many unknown words are in fact named entities. Word shape features are thus particularly important in the context of NER. Recall that word shape features are used to represent the abstract letter pattern of the word by mapping lower-case letters to ‘x’, upper-case to ‘X’, numbers to ’d’, and retaining punctuation. Thus for example I.M.F would map to X.X.X. and DC10-30 would map to XXdd-dd. A second class of shorter word shape features is also used. In these features consecutive character types are removed, so DC10-30 would be mapped to Xd-d but I.M.F would still map to X.X.X. This feature by itself accounts for a considerable part of the success of feature-based NER systems for English news text. Shape features are also particularly important in recognizing names of proteins and genes in biological texts.
[外链图片转存失败(img-mfHp5o5p-1562247232547)(17.5.png)]
For example the named entity token L’Occitane would generate the following non-zero valued feature values:
prefix(wi) = L suffix(wi) = tane
prefix(wi) = L’ suffix(wi) = ane
prefix(wi) = L’O suffix(wi) = ne
prefix(wi) = L’Oc suffix(wi) = e
word-shape(wi) = X’Xxxxxxxx short-word-shape(wi) = X’Xx
A gazetteer is a list of place names, often providing millions of entries for locations with detailed geographical and political information. A related resource is name-lists; the United States Census Bureau also provides extensive lists of first names and surnames derived from its decadal census in the U.S. Similar lists of corporations, commercial products, and all manner of things biological and mineral are also available from a variety of sources. Gazetteer and name features are typically implemented as a binary feature for each name list. Unfortunately, such lists can be difficult to create and maintain, and their usefulness varies considerably. While gazetteers can be quite effective, lists of persons and organizations are not always helpful (Mikheev et al., 1999).
Feature effectiveness depends on the application, genre, media, and language. For example, shape features, critical for English newswire texts, are of little use with automatic speech recognition transcripts, or other non-edited or informally-edited sources, or for languages like Chinese that don’t use orthographic case. The features in Fig. 17.5 should therefore be thought of as only a starting point.
Figure 17.6 illustrates the result of adding part-of-speech tags, syntactic base-phrase chunk tags, and some shape information to our earlier example.
[外链图片转存失败(img-Ha0z3dEU-1562247232547)(17.6.png)]
Given such a training set, a sequence classifier like an MEMM can be trained to label new sentences. Figure 17.7 illustrates the operation of such a sequence labeler at the point where the token Corp. is next to be labeled. If we assume a context window that includes the two preceding and following words, then the features available to the classifier are those shown in the boxed area.
[外链图片转存失败(img-0t2dfDwP-1562247232548)(17.7.png)]
17.1.3 A neural algorithm for NER
The standard neural algorithm for NER is based on the bi-LSTM introduced in Chapter 9. Recall that in that model, word and character embeddings are computed for input word w i w_i wi. These are passed through a left-to-right LSTM and a right-to-left LSTM, whose outputs are concatenated (or otherwise combined) to produce a single output layer at position i i i. In the simplest method, this layer can then be directly passed onto a softmax that creates a probability distribution over all NER tags, and the most likely tag is chosen as t i t_i ti.
For named entity tagging this greedy approach to decoding is insufficient, since it doesn’t allow us to impose the strong constraints neighboring tokens have on each other (e.g., the tag I-PER must follow another I-PER or B-PER). Instead a CRF layer is normally used on top of the bi-LSTM output, and the Viterbi decoding algorithm is used to decode. Fig. 17.8 shows a sketch of the algorithm.
[外链图片转存失败(img-6QCBIv1V-1562247232548)(17.8.png)]
17.1.4 Rule-based NER
While machine learned (neural or MEMM/CRF) sequence models are the norm in academic research, commercial approaches to NER are often based on pragmatic combinations of lists and rules, with some smaller amount of supervised machine learning (Chiticariu et al., 2013). For example IBM System T is a text understanding architecture in which a user specifies complex declarative constraints for tagging tasks in a formal query language that includes regular expressions, dictionaries, semantic constraints, NLP operators, and table structures, all of which the system compiles into an efficient extractor (Chiticariu et al., 2018)
One common approach is to make repeated rule-based passes over a text, allowing the results of one pass to influence the next. The stages typically first involve the use of rules that have extremely high precision but low recall. Subsequent stages employ more error-prone statistical methods that take the output of the first pass into account.
- First, use high-precision rules to tag unambiguous entity mentions.
- Then, search for substring matches of the previously detected names.
- Consult application-specific name lists to identify likely name entity mentions from the given domain.
- Finally, apply probabilistic sequence labeling techniques that make use of the tags from previous stages as additional features.
The intuition behind this staged approach is twofold. First, some of the entity mentions in a text will be more clearly indicative of a given entity’s class than others. Second, once an unambiguous entity mention is introduced into a text, it is likely that subsequent shortened versions will refer to the same entity (and thus the same type of entity).
17.1.5 Evaluation of Named Entity Recognition
The familiar metrics of recall, precision, and F1 measure are used to evaluate NER systems. Remember that recall is the ratio of the number of correctly labeled responses to the total that should have been labeled; precision is the ratio of the number of correctly labeled responses to the total labeled; and F-measure is the harmonic mean of the two. For named entities, the entity rather than the word is the unit of response. Thus in the example in Fig. 17.6, the two entities Tim Wagner and AMR Corp. and the non-entity said would each count as a single response.
The fact that named entity tagging has a segmentation component which is not present in tasks like text categorization or part-of-speech tagging causes some problems with evaluation. For example, a system that labeled American but not American Airlines as an organization would cause two errors, a false positive for O and a false negative for I-ORG. In addition, using entities as the unit of response but words as the unit of training means that there is a mismatch between the training and test conditions.
17.2 Relation Extraction
Next on our list of tasks is to discern the relationships that exist among the detected entities. Let’s return to our sample airline text:
Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco].
The text tells us, for example, that Tim Wagner is a spokesman for American Airlines, that United is a unit of UAL Corp., and that American is a unit of AMR. These binary relations are instances of more generic relations such as part-of or employs that are fairly frequent in news-style texts. Figure 17.9 lists the 17 relations used in the ACE relation extraction evaluations and Fig. 17.10 shows some sample relations. We might also extract more domain-specific relation such as the notion of an airline route. For example from this text we can conclude that United has routes to Chicago, Dallas, Denver, and San Francisco.
[外链图片转存失败(img-yxP8dPWA-1562247232549)(17.9.png)]
[外链图片转存失败(img-ti2EXevW-1562247232549)(17.10.png)]
Figure 17.11 shows a model-based view of the set of entities and relations that can be extracted from our running example. Notice how this model-theoretic view subsumes the NER task as well; named entity recognition corresponds to the identification of a class of unary relations.
[外链图片转存失败(img-RqeckZcO-1562247232551)(17.11.png)]
Sets of relations have been defined for many other domains as well. For example UMLS, the Unified Medical Language System from the US National Library of Medicine has a network that defines 134 broad subject categories, entity types, and 54 relations between the entities, such as the following:
Entity | Relation | Entity |
---|---|---|
Injury | disrupts | Physiological Function |
Bodily Location | location-of | Biologic Function |
Anatomical Structure | part-of | Organism |
Pharmacologic Substance | causes | Pathological Function |
Pharmacologic Substance | treats | Pathologic Function |
Given a medical sentence like this one:
(17.1) Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes
We could thus extract the UMLS relation:
Echocardiography, Doppler Diagnoses Acquired stenosis
Wikipedia also offers a large supply of relations, drawn from infoboxes, structured tables associated with certain Wikipedia articles. For example, the Wikipedia infobox for Stanford includes structured facts like state = "California" \verb|state = "California"| state = "California" or president = "Mark Tessier-Lavigne" \verb|president = "Mark Tessier-Lavigne"| president = "Mark Tessier-Lavigne". These facts can be turned into relations like president-of or located-in. or into relations in a metalanguage called RDF (Resource Description Framework). An RDF triple is a tuple of entity-relation-entity, called a subject-predicate-object expression. Here’s a sample RDF triple:
subject predicate object
Golden Gate Park location San Francisco
For example the crowdsourced DBpedia (Bizer et al., 2009) is an ontology derived from Wikipedia containing over 2 billion RDF triples. Another dataset from Freebase Wikipedia infoboxes, Freebase (Bollacker et al., 2008), has relations like
people/person/nationality
location/location/contains
WordNet or other ontologies offer useful ontological relations that express hieris-a archical relations between words or concepts. For example WordNet has the is-a or hypernym hypernym relation between classes,
Giraffe is-a ruminant is-a ungulate is-a mammal is-a vertebrate …
WordNet also has Instance-of relation between individuals and classes, so that for example San Francisco is in the Instance-of relation with city. Extracting these relations is an important step in extending or building ontologies.
There are five main classes of algorithms for relation extraction: hand-written patterns, supervised machine learning, semi-supervised (via bootstrapping and via distant supervision), and unsupervised. We’ll introduce each of these in the next sections.
17.2.1 Using Patterns to Extract Relations
The following lexico-syntactic pattern
N
P
0
such as
N
P
1
{
N
P
2
,
…
,
(
a
n
d
∣
o
r
)
N
P
i
}
,
i
≥
1
NP_0 \textrm{ such as }NP_1\{NP_2,\ldots, (and|or) NP_i\}, i \ge 1
NP0 such as NP1{NP2,…,(and∣or)NPi},i≥1
implies the following semantics
∀
N
P
i
,
i
≥
1
,
hyponym
(
N
P
i
,
N
P
0
)
\forall NP_i, i\ge 1, \textrm{ hyponym}(NP_i, NP_0)
∀NPi,i≥1, hyponym(NPi,NP0)
allowing us to infer
hyponym(Gelidium, red algae)
\textrm{hyponym(Gelidium, red algae)}
hyponym(Gelidium, red algae)
Figure 17.12 shows five patterns Hearst (1992a, 1998) suggested for inferring the hyponym relation; we’ve shown
NP
H
\textrm{NP}_\textrm{H}
NPH as the parent/hyponym. Modern versions of the pattern-based approach extend it by adding named entity constraints. For example if our goal is to answer questions about “Who holds what office in which organization?”, we can use patterns like the following:
PER, POSITION of ORG:
George Marshall, Secretary of State of the United States
PER (named|appointed|chose|etc.) PER Prep? POSITION
Truman appointed Marshall Secretary of State
PER [be]? (namedjappointedjetc.) Prep? ORG POSITION
George Marshall was named US Secretary of State
Hand-built patterns have the advantage of high-precision and they can be tailored to specific domains. On the other hand, they are often low-recall, and it’s a lot of work to create them for all possible patterns.
[外链图片转存失败(img-HwAzGsf2-1562247232552)(17.12.png)]
17.2.2 Relation Extraction via Supervised Learning
The most straightforward approach has three steps, illustrated in Fig. 17.13.
Step one is to find pairs of named entities (usually in the same sentence).
In step two, a filtering classifier is trained to make a binary decision as to whether a given pair of named entities are related (by any relation). Positive examples are extracted directly from all relations in the annotated corpus, and negative examples are generated from within-sentence entity pairs that are not annotated with a relation.
In step 3, a classifier is trained to assign a label to the relations that were found by step 2.
The use of the filtering classifier can speed up the final classification and also allows the use of distinct feature-sets appropriate for each task. For each of the two classifiers, we can use any of the standard classification techniques (logistic regression, neural network, SVM, etc.)
[外链图片转存失败(img-QrFRdQyp-1562247232554)(17.13.png)]
For the feature-based classifiers like logistic regression or random forests the most important step is to identify useful features. Let’s consider features for classifying the relationship between American Airlines (Mention 1, or M1) and Tim Wagner (Mention 2, M2) from this sentence:
(17.5) American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said
Useful word features include
- The headwords of M1 and M2 and their concatenation
Airlines Wagner Airlines-Wagner - Bag-of-words and bigrams in M1 and M2
American, Airlines, Tim, Wagner, American Airlines, Tim Wagner - Words or bigrams in particular positions
M2: -1 spokesman
M2: +1 said - Bag of words or bigrams between M1 and M2:
a, AMR, of, immediately, matched, move, spokesman, the, unit - Stemmed versions of the same
Embeddings can be used to represent words in any of these features. Useful named entity features include
- Named-entity types and their concatenation
(M1: ORG, M2: PER, M1M2: ORG-PER) - Entity Level of M1 and M2 (from the set NAME, NOMINAL, PRONOUN)
M1: NAME [it or he would be PRONOUN]
M2: NAME [the company would be NOMINAL] - Number of entities between the arguments (in this case 1, for AMR)
The syntactic structure of a sentence can also signal relationships among its entities. Syntax is often featured by using strings representing syntactic paths: the (dependency or constituency) path traversed through the tree in getting from one entity to the other.
- Base syntactic chunk sequence from M1 to M2
NP NP PP VP NP NP - Constituent paths between M1 and M2
N P ↑ N P ↑ S ↑ S ↓ N P NP \uparrow NP \uparrow S \uparrow S\downarrow NP NP↑NP↑S↑S↓NP - Dependency-tree paths
A i r l i n e s ← s u b j m a t c h e d ← c o m p s a i d → s u b j W a g n e r Airlines \leftarrow_{subj} matched \leftarrow_{comp} said \to_{subj} Wagner Airlines←subjmatched←compsaid→subjWagner
Figure 17.14 summarizes many of the features we have discussed that could be used for classifying the relationship between American Airlines and Tim Wagner from our example text.
[外链图片转存失败(img-waw70CyZ-1562247232555)(17.14.png)]
Neural models for relation extraction similarly treat the task as supervised classification. One option is to use a similar architecture as we saw for named entity tagging: a bi-LSTM model with word embeddings as inputs and a single softmax classification of the sentence output as a 1-of-N relation label. Because relations often hold between entities that are far part in a sentence (or across sentences), it may be possible to get higher performance from algorithms like convolutional nets (dos Santos et al., 2015) or chain or tree LSTMS (Miwa and Bansal 2016, Peng et al. 2017).
In general, if the test set is similar enough to the training set, and if there is enough hand-labeled data, supervised relation extraction systems can get high accuracies. But labeling a large training set is extremely expensive and supervised models are brittle: they don’t generalize well to different text genres. For this reason, much research in relation extraction has focused on the semi-supervised and unsupervised approaches we turn to next.
17.2.3 Semisupervised Relation Extraction via Bootstrapping
Supervised machine learning assumes that we have lots of labeled data. Unfortunately, this is expensive. But suppose we just have a few high-precision seed patterns, like those in Section 17.2.1, or perhaps a few seed tuples. That’s enough seed tuples to bootstrap a classifier! Bootstrapping proceeds by taking the entities in the seed pair, and then finding sentences (on the web, or whatever dataset we are using) that contain both entities. From all such sentences, we extract and generalize the context around the entities to learn new patterns. Fig. 17.15 sketches a basic algorithm.
[外链图片转存失败(img-Gziq5lGO-1562247232556)(17.15.png)]
Suppose, for example, that we need to create a list of airline/hub pairs, and we know only that Ryanair has a hub at Charleroi. We can use this seed fact to discover new patterns by finding other mentions of this relation in our corpus. We search for the terms Ryanair, Charleroi and hub in some proximity. Perhaps we find the following set of sentences:
(17.6) Budget airline Ryanair, which uses Charleroi as a hub, scrapped all weekend flights out of the airport.
(17.7) All flights in and out of Ryanair’s Belgian hub at Charleroi airport were grounded on Friday…
(17.8) A spokesman at Charleroi, a main hub for Ryanair, estimated that 8000 passengers had already been affected.
We extract general patterns such as the following:
/ [ORG], which uses [LOC] as a hub /
/ [ORG]’s hub at [LOC] /
/ [LOC] a main hub for [ORG] /
These new patterns can then be used to search for additional tuples.
Bootstrapping systems also assign confidence values to new tuples to avoid semantic drift. In semantic drift, an erroneous pattern leads to the introduction of erroneous tuples, which, in turn, lead to the creation of problematic patterns and the meaning of the extracted relations ‘drifts’. Consider the following example:
(17.9) Sydney has a ferry hub at Circular Quay.
If accepted as a positive example, this expression could lead to the incorrect introduction of the tuple ⟨ S y d n e y , C i r c u l a r Q u a y ⟩ \langle Sydney, CircularQuay\rangle ⟨Sydney,CircularQuay⟩. Patterns based on this tuple could propagate further errors into the database.
Given a document collection D \mathscr{D} D, a current set of tuples T T T , and a proposed pattern p p p, we need to track two factors:
- hits: the set of tuples in T T T that p p p matches while looking in D \mathscr{D} D
- finds: The total set of tuples that p p p finds in D \mathscr{D} D
The following equation balances these considerations (Riloff and Jones, 1999).
C
o
n
f
R
l
o
g
F
(
p
)
=
h
i
t
s
p
f
i
n
d
s
p
×
log
(
f
i
n
d
s
p
)
Conf_{RlogF}(p) = \frac{hits_p}{finds_p}\times \log(finds_p)
ConfRlogF(p)=findsphitsp×log(findsp)
This metric is generally normalized to produce a probability.
We can assess the confidence in a proposed new tuple by combining the evidence supporting it from all the patterns
P
′
P'
P′’ that match that tuple in
D
\mathscr{D}
D (Agichtein and noisy-or Gravano, 2000). One way to combine such evidence is the noisy-or technique. Assume that a given tuple is supported by a subset
P
′
P'
P′ of the patterns in
P
P
P, each with its own confidence assessed as above. In the noisy-or model, we make two basic assumptions. First, that for a proposed tuple to be false, all of its supporting patterns must have been in error, and second, that the sources of their individual failures are all independent. If we loosely treat our confidence measures as probabilities, then the probability of any individual pattern p failing is
1
−
C
o
n
f
(
p
)
1- Conf (p)
1−Conf(p); the probability of all of the supporting patterns for a tuple being wrong is the product of their individual failure probabilities, leaving us with the following equation for our confidence in a new tuple.
C
o
n
f
(
t
)
=
1
−
∏
p
∈
P
′
(
1
−
C
o
n
f
(
p
)
)
Conf (t) = 1- \prod_{p\in P'}(1- Conf (p))
Conf(t)=1−p∈P′∏(1−Conf(p))
Setting conservative confidence thresholds for the acceptance of new patterns and tuples during the bootstrapping process helps prevent the system from drifting away from the targeted relation.
17.2.4 Distant Supervision for Relation Extraction
https://blog.csdn.net/m0_38031488/article/details/79852238
The distant supervision method of Mintz et al. (2009) combines the advantages of bootstrapping with supervised learning. Instead of just a handful of seeds, distant supervision uses a large database (DBPedia or Freebase) to acquire a huge number of seed examples, creates lots of noisy pattern features from all these examples and then combines them in a supervised classifier.
For example suppose we are trying to learn the place-of-birth relationship between people and their birth cities. In the seed-based approach, we might have only 5 examples to start with. But Wikipedia-based databases like DBPedia or Freebase have tens of thousands of examples of many relations; including over 100,000 examples of place-of-birth, (<Edwin Hubble, Marshfield>, <Albert Einstein, Ulm>, etc.,). The next step is to run named entity taggers on large amounts of text—Mintz et al. (2009) used 800,000 articles from Wikipedia—and extract all sentences that have two named entities that match the tuple, like the following:
…Hubble was born in Marshfield…
…Einstein, born (1879), Ulm…
…Hubble’s birthplace in Marshfield…
Training instances can now be extracted from this data, one training instance for each identical tuple <relation, entity1, entity2>. Thus there will be one training instance for each of:
<born-in, Edwin Hubble, Marshfield>
<born-in, Albert Einstein, Ulm>
<born-year, Albert Einstein, 1879>
and so on.
We can then apply feature-based or neural classification. For feature-based classification, standard supervised relation extraction features like the named entity labels of the two mentions, the words and dependency paths in between the mentions, and neighboring words. Each tuple will have features collected from many training instances; the feature vector for a single training instance like (<born-in,Albert Einstein, Ulm> will have lexical and syntactic features from many different sentences that mention Einstein and Ulm.
Because distant supervision has very large training sets, it is also able to use very rich features that are conjunctions of these individual features. So we will extract thousands of patterns that conjoin the entity types with the intervening words or dependency paths like these:
PER was born in LOC
PER, born (XXXX), LOC
PER’s birthplace in LOC
To return to our running example, for this sentence:
(17.12) American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said we would learn rich conjunction features like this one:
M1 = ORG & M2 = PER & nextword=“said”& path= NP " NP " S " S # NP
The result is a supervised classifier that has a huge rich set of features to use in detecting relations. Since not every test sentence will have one of the training relations, the classifier will also need to be able to label an example as no-relation. This label is trained by randomly selecting entity pairs that do not appear in any Freebase relation, extracting features for them, and building a feature vector for each such tuple. The final algorithm is sketched in Fig. 17.16.
[外链图片转存失败(img-mxp7prvZ-1562247232557)(17.16.png)]
Distant supervision shares advantages with each of the methods we’ve examined. Like supervised classification, distant supervision uses a classifier with lots of features, and supervised by detailed hand-created knowledge. Like pattern-based classifiers, it can make use of high-precision evidence for the relation between entities. Indeed, distance supervision systems learn patterns just like the hand-built patterns of early relation extractors. For example the is-a or hypernym extraction system of Snow et al. (2005) used hypernym/hyponym NP pairs from WordNet as distant supervision, and then learned new patterns from large amounts of text. Their system induced exactly the original 5 template patterns of Hearst (1992a), but also 70,000 additional patterns including these four:
NP
H
_H
H like NP Many hormones like leptin…
NP
H
_H
H called NP …using a markup language called XHTML
NP is a NP
H
_H
H Ruby is a programming language…
NP, a NP
H
_H
H IBM, a company with a long…
This ability to use a large number of features simultaneously means that, unlike the iterative expansion of patterns in seed-based systems, there’s no semantic drift. Like unsupervised classification, it doesn’t use a labeled training corpus of texts, so it isn’t sensitive to genre issues in the training corpus, and relies on very large amounts of unlabeled data. Distant supervision also has the advantage that it can create training tuples to be used with neural classifiers, where features are not required.
But distant supervision can only help in extracting relations for which a large
enough database already exists. To extract new relations without datasets, or relations for new domains, purely unsupervised methods must be used.
17.2.5 Unsupervised Relation Extraction
The goal of unsupervised relation extraction is to extract relations from the web when we have no labeled training data, and not even any list of relations. This task is often called open information extraction or Open IE. In Open IE, the relations are simply strings of words (usually beginning with a verb).
For example, the ReVerb system (Fader et al., 2011) extracts a relation from a sentence s in 4 steps:
- Run a part-of-speech tagger and entity chunker over s
- For each verb in s, find the longest sequence of words w that start with a verb and satisfy syntactic and lexical constraints, merging adjacent matches.
- For each phrase w, find the nearest noun phrase x to the left which is not a relative pronoun, wh-word or existential “there”. Find the nearest noun phrase y to the right.
- Assign confidence c to the relation r = ( x , w , y ) r = (x,w,y) r=(x,w,y) using a confidence classifier and return it.
A relation is only accepted if it meets syntactic and lexical constraints. The syntactic constraints ensure that it is a verb-initial sequence that might also include nouns (relations that begin with light verbs like make, have, or do often express the core of the relation with a noun, like have a hub in):
V | VP | VW*P
V = verb particle? adv?
W = (noun | adj | adv | pron | det )
P = (prep | particle | inf. marker)
The lexical constraints are based on a dictionary D that is used to prune very rare, long relation strings. The intuition is to eliminate candidate relations that don’t occur with sufficient number of distinct argument types and so are likely to be bad examples. The system first runs the above relation extraction algorithm offline on 500 million web sentences and extracts a list of all the relations that occur after normalizing them (removing inflection, auxiliary verbs, adjectives, and adverbs). Each relation r is added to the dictionary if it occurs with at least 20 different arguments. Fader et al. (2011) used a dictionary of 1.7 million normalized relations.
Finally, a confidence value is computed for each relation using a logistic regression classifier. The classifier is trained by taking 1000 random web sentences, running the extractor, and hand labelling each extracted relation as correct or incorrect. A confidence classifier is then trained on this hand-labeled data, using features of the relation and the surrounding words. Fig. 17.17 shows some sample features used in the classification.
(x,r,y) covers all words in s
the last preposition in r is for
the last preposition in r is on
len(s)
≥
\ge
≥ 10
there is a coordinating conjunction to the left of r in s
r matches a lone V in the syntactic constraints
there is preposition to the left of x in s
there is an NP to the right of y in s
Figure 17.17 Features for the classifier that assigns confidence to relations extracted by the Open Information Extraction system REVERB (Fader et al., 2011).
For example the following sentence:
(17.13) United has a hub in Chicago, which is the headquarters of United Continental Holdings.
has the relation phrases has a hub in and is the headquarters of (it also has has and is, but longer phrases are preferred). Step 3 finds United to the left and Chicago to the right of has a hub in, and skips over which to find Chicago to the left of is the headquarters of. The final output is:
r1: <United, has a hub in, Chicago>
\verb|r1: <United, has a hub in, Chicago>|
r1: <United, has a hub in, Chicago>
r2: <Chicago, is the headquarters of, United Continental Holdings>
\verb|r2: <Chicago, is the headquarters of, United Continental Holdings>|
r2: <Chicago, is the headquarters of, United Continental Holdings>
The great advantage of unsupervised relation extraction is its ability to handle a huge number of relations without having to specify them in advance. The disadvantage is the need to map these large sets of strings into some canonical form for adding to databases or other knowledge sources. Current methods focus heavily on relations expressed with verbs, and so will miss many relations that are expressed nominally.
17.2.6 Evaluation of Relation Extraction
Supervised relation extraction systems are evaluated by using test sets with human-annotated, gold-standard relations and computing precision, recall, and F-measure. Labeled precision and recall require the system to classify the relation correctly, whereas unlabeled methods simply measure a system’s ability to detect entities that are related.
Semi-supervised and unsupervised methods are much more difficult to evaluate, since they extract totally new relations from the web or a large text. Because these methods use very large amounts of text, it is generally not possible to run them solely on a small labeled test set, and as a result it’s not possible to pre-annotate a gold set of correct instances of relations.
For these methods it’s possible to approximate (only) precision by drawing a random sample of relations from the output, and having a human check the accuracy of each of these relations. Usually this approach focuses on the tuples to be extracted from a body of text rather than on the relation mentions; systems need not detect every mention of a relation to be scored correctly. Instead, the evaluation is based on the set of tuples occupying the database when the system is finished. That is, we want to know if the system can discover that Ryanair has a hub at Charleroi; we don’t really care how many times it discovers it. The estimated precision
P
^
\hat P
P^ is then
P
^
=
♯
of correctly extracted relation tuples in the sample
♯
total of extracted relation tuples in the sample
.
\hat P = \frac{\sharp\textrm{ of correctly extracted relation tuples in the sample}} {\sharp\textrm{total of extracted relation tuples in the sample}}.
P^=♯total of extracted relation tuples in the sample♯ of correctly extracted relation tuples in the sample.
Another approach that gives us a little bit of information about recall is to compute precision at different levels of recall. Assuming that our system is able to rank the relations it produces (by probability, or confidence), we can separately compute precision for the top 1000 new relations, the top 10,000 new relations, the top 100,000, and so on. In each case we take a random sample of that set. This will show us how the precision curve behaves as we extract more and more tuples. But there is no way to directly evaluate recall.
17.3 Extracting Times
17.3.1 Temporal Expression Extraction
[外链图片转存失败(img-RWqt0QkV-1562247232558)(17.18.png)]
Temporal expressions are grammatical constructions that have temporal lexical triggers as their heads. Lexical triggers might be nouns, proper nouns, adjectives, and adverbs; full temporal expressions consist of their phrasal projections: noun phrases, adjective phrases, and adverbial phrases. Figure 17.19 provides examples.
[外链图片转存失败(img-H9VphaDp-1562247232560)(17.19.png)]
Let’s look at the TimeML annotation scheme, in which temporal expressions are annotated with an XML tag, TIMEX3, and various attributes to that tag (Pustejovsky et al. 2005, Ferro et al. 2005). The following example illustrates the basic use of this scheme (we defer discussion of the attributes until Section 17.3.2).
A fare increase initiated last week by UAL Corp’s United Airlines was matched by competitors over the weekend, marking the second successful fare increase in
two weeks.
The temporal expression recognition task consists of finding the start and end of all of the text spans that correspond to such temporal expressions. Rule-based approaches to temporal expression recognition use cascades of automata to recognize patterns at increasing levels of complexity. Tokens are first part-of-speech tagged, and then larger and larger chunks are recognized from the results from previous stages, based on patterns containing trigger words (e.g., February) or classes (e.g., MONTH). Figure 17.20 gives a fragment from a rule-based system.
[外链图片转存失败(img-KSotqHz6-1562247232560)(17.20.png)]
Sequence-labeling approaches follow the same IOB scheme used for namedentity tags, marking words that are either inside, outside or at the beginning of a TIMEX3-delimited temporal expression with the I, O, and B tags as follows:
A fare increase initiated last week by UAL Corp’s…
O O O O B I O O O
Features are extracted from the token and its context, and a statistical sequence labeler is trained (any sequence model can be used). Figure 17.21 lists standard features used in temporal tagging.
[外链图片转存失败(img-PgHVzRN2-1562247232561)(17.21.png)]
Temporal expression recognizers are evaluated with the usual recall, precision, and F-measures. A major difficulty for all of these very lexicalized approaches is avoiding expressions that trigger false positives:
(17.15) 1984 tells the story of Winston Smith…
(17.16) …U2’s classic Sunday Bloody Sunday
17.3.2 Temporal Normalization
Normalized times are represented with the VALUE attribute from the ISO 8601 standard for encoding temporal values (ISO8601, 2004). Fig. 17.22 reproduces our earlier example with the value attributes added in.
[外链图片转存失败(img-vhB9qT0G-1562247232561)(17.22.png)]
Figure 17.23 describes some of the basic ways that other times and durations are represented. Consult ISO8601 (2004), Ferro et al. (2005), and Pustejovsky et al. (2005) for more details.
[外链图片转存失败(img-iaBqltWp-1562247232562)(17.23.png)]
Most current approaches to temporal normalization are rule-based (Chang and Manning 2012, Strotgen and Gertz 2013). Patterns that match temporal expressions are associated with semantic analysis procedures. As in the compositional rule-to-rule approach introduced in Chapter 15, the meaning of a constituent is computed from the meaning of its parts using a method specific to the constituent, although here the semantic composition rules involve temporal arithmetic rather than λ \lambda λ-calculus attachments.
Fully qualified date expressions contain a year, month, and day in some conventional form. The units in the expression must be detected and then placed in the correct place in the corresponding ISO pattern. The following pattern normalizes expressions like April 24, 1916.
F
Q
T
E
→
M
o
n
t
h
D
a
t
e
,
Y
e
a
r
{
Y
e
a
r
.
v
a
l
−
M
o
n
t
h
.
v
a
l
−
D
a
t
e
.
v
a
l
}
FQTE \to Month\ Date , Year \verb| |\{Year.val - Month.val - Date.val\}
FQTE→Month Date,Year {Year.val−Month.val−Date.val}
The non-terminals Month, Date, and Year represent constituents that have already been recognized and assigned semantic values, accessed through the *.val notation. The value of this FQTE constituent can, in turn, be accessed as FQTE.val during further processing.
Fully qualified temporal expressions are fairly rare in real texts. Most temporal expressions in news articles are incomplete and are only implicitly anchored, often with respect to the dateline of the article, which we refer to as the document’s temporal anchor. The values of temporal expressions such as today, yesterday, or tomorrow can all be computed with respect to this temporal anchor. The semantic procedure for today simply assigns the anchor, and the attachments for tomorrow and yesterday add a day and subtract a day from the anchor, respectively. Of course, given the cyclic nature of our representations for months, weeks, days, and times of day, our temporal arithmetic procedures must use modulo arithmetic appropriate to the time unit being used.
Unfortunately, even simple expressions such as the weekend or Wednesday introduce a fair amount of complexity. In our current example, the weekend clearly refers to the weekend of the week that immediately precedes the document date. But this won’t always be the case, as is illustrated in the following example.
(17.17) Random security checks that began yesterday at Sky Harbor will continue at least through the weekend.
In this case, the expression the weekend refers to the weekend of the week that the anchoring date is part of (i.e., the coming weekend). The information that signals this meaning comes from the tense of continue, the verb governing the weekend.
Relative temporal expressions are handled with temporal arithmetic similar to that used for today and yesterday. The document date indicates that our example article is ISO week 27, so the expression last week normalizes to the current week minus 1. To resolve ambiguous next and last expressions we consider the distance from the anchoring date to the nearest unit. Next Friday can refer either to the immediately next Friday or to the Friday following that, but the closer the document date is to a Friday, the more likely it is that the phrase will skip the nearest one. Such ambiguities are handled by encoding language and domain-specific heuristics into the temporal attachments.
17.4 Extracting Events and their Times
The task of event extraction is to identify mentions of events in texts. For the purposes of this task, an event mention is any expression denoting an event or state that can be assigned to a particular point, or interval, in time. The following markup of the sample text on page 345 shows all the events in this text.
[EVENT Citing] high fuel prices, United Airlines [EVENT said] Friday it has [EVENT increased] fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp., immediately [EVENT matched] [EVENT the move], spokesman Tim Wagner [EVENT said]. United, a unit of UAL Corp., [EVENT said] [EVENT the increase] took effect Thursday and [EVENT applies] to most routes where it [EVENT competes] against discount carriers, such as Chicago to Dallas and Denver to San Francisco.
In English, most event mentions correspond to verbs, and most verbs introduce events. However, as we can see from our example, this is not always the case. Events can be introduced by noun phrases, as in the move and the increase, and some verbs fail to introduce events, as in the phrasal verb took effect, which refers to when the event began rather than to the event itself. Similarly, light verbs such as make, take, and have often fail to denote events; for light verbs the event is often expressed by the nominal direct object (took a flight), and these light verbs just provide a syntactic
structure for the noun’s arguments.
Various versions of the event extraction task exist, depending on the goal. For example in the TempEval shared tasks (Verhagen et al. 2009) the goal is to extract events and aspects like their aspectual and temporal properties. Events are to be classified as actions, states, reporting events (say, report, tell, explain), perception events, and so on. The aspect, tense, and modality of each event also needs to be extracted. Thus for example the various said events in the sample text would be annotated as (class=REPORTING, tense=PAST, aspect=PERFECTIVE).
Event extraction is generally modeled via supervised learning, detecting events via sequence models with IOB tagging, and assigning event classes and attributes with multi-class classifiers. Common features include surface information like parts of speech, lexical items, and verb tense information; see Fig. 17.24.
[外链图片转存失败(img-8adWsCUQ-1562247232562)(17.24.png)]
17.4.1 Temporal Ordering of Events
With both the events and the temporal expressions in a text having been detected, the next logical task is to use this information to fit the events into a complete timeline. Such a timeline would be useful for applications such as question answering and summarization. This ambitious task is the subject of considerable current research but is beyond the capabilities of current systems.
A somewhat simpler, but still useful, task is to impose a partial ordering on the events and temporal expressions mentioned in a text. Such an ordering can provide many of the same benefits as a true timeline. An example of such a partial ordering is the determination that the fare increase by American Airlines came after the fare increase by United in our sample text. Determining such an ordering can be viewed as a binary relation detection and classification task similar to those described earlier in Section 17.2. The temporal relation between events is classified into one of the standard set of Allen relations shown in Fig. 17.25 (Allen, 1984), using feature-based classifiers as in Section 17.2, trained on the TimeBank corpus with features like words/embeddings, parse paths, tense and aspect.
[外链图片转存失败(img-pGUDuvs2-1562247232563)(17.25.png)]
The TimeBank corpus consists of text annotated with much of the information we’ve been discussing throughout this section (Pustejovsky et al., 2003b). TimeBank 1.2 consists of 183 news articles selected from a variety of sources, including the Penn TreeBank and PropBank collections. Each article in the TimeBank corpus has had the temporal expressions and event mentions in them explicitly annotated in the TimeML annotation (Pustejovsky et al., 2003a). In addition to temporal expressions and events, the TimeML annotation provides temporal links between events and temporal expressions that specify the nature of the relation between them. Consider the following sample sentence and its corresponding markup shown in Fig. 17.26, selected from one of the TimeBank documents.
(17.18) Delta Air Lines earnings soared 33% to a record in the fiscal first quarter, bucking the industry trend toward declining profits.
[外链图片转存失败(img-GFrqIZ2P-1562247232563)(17.26.png)]
As annotated, this text includes three events and two temporal expressions. The events are all in the occurrence class and are given unique identifiers for use in further annotations. The temporal expressions include the creation time of the article, which serves as the document time, and a single temporal expression within the text.
In addition to these annotations, TimeBank provides four links that capture the temporal relations between the events and times in the text, using the Allen relations from Fig. 17.25. The following are the within-sentence temporal relations annotated for this example.
- Soaring e 1 _{e1} e1 is included in the fiscal first quarter t 58 _{t58} t58
- Soaring e 1 _{e1} e1 is before 1989-10-26 t 57 _{t57} t57
- Soaring e 1 _{e1} e1 is simultaneous with the bucking e 3 _{e3} e3
- Declining e 4 _{e4} e4 includes soaring e 1 _{e1} e1
17.5 Template Filling
Many texts contain reports of events, and possibly sequences of events, that often correspond to fairly common, stereotypical situations in the world. These abstract situations or stories, related to what have been called scripts (Schank and Abelson, 1977), consist of prototypical sequences of sub-events, participants, and their roles. The strong expectations provided by these scripts can facilitate the proper classification of entities, the assignment of entities into roles and relations, and most critically, the drawing of inferences that fill in things that have been left unsaid. In their simplest form, such scripts can be represented as templates consisting of fixed sets of slots that take as values slot-fillers belonging to particular classes. The task of template filling is to find documents that invoke particular scripts and then fill the slots in the associated templates with fillers extracted from the text. These slot-fillers may consist of text segments extracted directly from the text, or they may consist of concepts that have been inferred from text elements through some additional processing.
A filled template from our original airline story might look like the following.
KaTeX parse error: Can't use function '$' in math mode at position 80: …NES\\ AMOUNT: &$̲6\\ EFFECTIVE D…
This template has four slots (LEAD AIRLINE, AMOUNT, EFFECTIVE DATE, FOLLOWER). The next section describes a standard sequence-labeling approach to filling slots. Section 17.5.2 then describes an older system based on the use of cascades of finite-state transducers and designed to address a more complex template-filling task that current learning-based systems don’t yet address.
17.5.1 Machine Learning Approaches to Template Filling
In the standard paradigm for template filling, we are trying to fill fixed known templates with known slots, and also assumes training documents labeled with examples of each template, and the fillers of each slot marked in the text. This is to create one template for each event in the input documents, with the slots filled with text from the document.
The task is generally modeled by training two separate supervised systems. The first system decides whether the template is present in a particular sentence. This task is called template recognition or sometimes, in a perhaps confusing bit of terminology, event recognition. Template recognition can be treated as a text classification task, with features extracted from every sequence of words that was labeled in training documents as filling any slot from the template being detected. The usual set of features can be used: tokens, embeddings, word shapes, part-of-speech tags,
syntactic chunk tags, and named entity tags.
The second system has the job of role-filler extraction. A separate classifier is trained to detect each role (LEAD-AIRLINE, AMOUNT, and so on). This can be a binary classifier that is run on every noun-phrase in the parsed input sentence, or a sequence model run over sequences of words. Each role classifier is trained on the labeled data in the training set. Again, the usual set of features can be used, but now trained only on an individual noun phrase or the fillers of a single slot.
Multiple non-identical text segments might be labeled with the same slot label. For example in our sample text, the strings United or United Airlines might be labeled as the LEAD AIRLINE. These are not incompatible choices and the coreference resolution techniques introduced in Chapter 20 can provide a path to a solution.
A variety of annotated collections have been used to evaluate this style of approach to template filling, including sets of job announcements, conference calls for papers, restaurant guides, and biological texts. Recent work focuses on extracting templates in cases where there is no training data or even predefined templates, by inducing templates as sets of linked events (Chambers and Jurafsky, 2011).
17.5.2 Earlier Finite-State Template-Filling Systems
The templates above are relatively simple. But consider the task of producing a template that contained all the information in a text like this one (Grishman and Sundheim, 1995):
Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.
The MUC-5 ‘joint venture’ task (the Message Understanding Conferences were a series of U.S. government-organized information-extraction evaluations) was to produce hierarchically linked templates describing joint ventures. Figure 17.27 shows a structure produced by the FASTUS system (Hobbs et al., 1997). Note how the filler of the ACTIVITY slot of the TIE-UP template is itself a template with slots.
[外链图片转存失败(img-rGLeOka4-1562247232564)(17.27.png)]
Early systems for dealing with these complex templates were based on cascades of transducers based on hand-written rules, as sketched in Fig. 17.28.
[外链图片转存失败(img-8XNjMygO-1562247232564)(17.28.png)]
The first four stages use hand-written regular expression and grammar rules to do basic tokenization, chunking, and parsing. Stage 5 then recognizes entities and events with a FST-based recognizer and inserts the recognized objects into the appropriate slots in templates. This FST recognizer is based on hand-built regular expressions like the following (NG indicates Noun-Group and VG Verb-Group), which matches the first sentence of the news story above.
NG(Company/ies) VG(Set-up) NG(Joint-Venture) with NG(Company/ies)
\verb|NG(Company/ies) VG(Set-up) NG(Joint-Venture) with NG(Company/ies)|
NG(Company/ies) VG(Set-up) NG(Joint-Venture) with NG(Company/ies)
VG(Produce) NG(Product)
\verb|VG(Produce) NG(Product)|
VG(Produce) NG(Product)
The result of processing these two sentences is the five draft templates (Fig. 17.29) that must then be merged into the single hierarchical structure shown in Fig. 17.27. The merging algorithm, after performing coreference resolution, merges two activities that are likely to be describing the same events.
17.6 Summary
This chapter has explored techniques for extracting limited forms of semantic content from texts.
- Named entities can be recognized and classified by featured-based or neural sequence labeling techniques.
- Relations among entities can be extracted by pattern-based approaches, supervised learning methods when annotated training data is available, lightly supervised bootstrapping methods when small numbers of seed tuples or seed patterns are available, distant supervision when a database of relations is available, and unsupervised or Open IE methods.
- Reasoning about time can be facilitated by detection and normalization of temporal expressions through a combination of statistical learning and rule-based methods.
- Events can be detected and ordered in time using sequence models and classifiers trained on temporally- and event-labeled data like the TimeBank corpus.
- Template-filling applications can recognize stereotypical situations in texts and assign elements from the text to roles represented as fixed sets of slots.