Named Entity Recognition
The process of finding people and things is referred to as Named Entity Recognition
(NER)
Common entity
- People
- Locations
- Organizations
- Money
- Time
- URLs
These entites in a document are vital and usful for NLP tasks like:
- conducting simple searches,
- processing queries
- resolving references
- disambiguation of text
- finding the meaning of text
NER process involves two tasks
- Detection of entities
- Classification of entities
Detection is concerned with the position of an entity within text
Once the position is located, it is important to determine what type of entity was discovered.
Why NER is difficult
Although the tokenization of a text will
reveal its components, understanding what they are can be difficult.
-
Reasons
-
- Language Amibiguity
-
- Phrases can be challenging
-
- Sentence level process thus related to sentence boundary
-
- Specialized text such as URLs, e-mail addresses, and specialized numbers can be difficult to isolate
Techniques for name recognition
Regular Expression
Dictionary-based approaches
- train a model to detect(Statistical Classifiers)
- The disadvantage of this approach is that it requires someone to annotate the sample text, which is a time-consuming process. In addition, it is domain dependent. To measure how well a model has been trained, several measures are used:
- Precision: It is the percentage of entities found that match exactly the spans found in the evaluation data
- Recall: It is the percentage of entities defined in the corpus that were found in the same location
- Performance measure: It is the harmonic mean of precision and recall given by F1 = 2 * Precision * Recall / (Recall + Precision)
Regular Expressions for NER
Two general approaches
- Java build in Regex for relatively simple and consistent entities
- Thirty Party Regex class
//JDK build-in
String regularExpressionText
= "He left his email address (rgb@colorworks.com) and his "
+ "phone number,800-555-1234. We believe his current address "
+ "is 100 Washington Place, Seattle, CO 12345-1234. I "
+ "understand you can also call at 123-555-1234 between "
+ "8:00 AM and 4:30 most days. His URL is http://example.com "
+ "and he was born on February 25, 1954 or 2/25/1954.";
String phoneNumberRE = "\\d{3}-\\d{3}-\\d{4}";
String urlRE = "\\b(https?|ftp|file|ldap)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]*[-A-Za-z0-9+&@#/%=~_|]";
String emailRE = "[a-zA-Z0-9'._%+-]+@(?:[a-zA-Z0-9-]+\\.)+[a-zA-Z]{2,4}";
String timeRE = "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?";
Pattern pattern = Pattern.compile(phoneNumberRE + "|" + urlRE + "|" + emailRE + "|" + timeRE);
java.util.regex.Matcher matcher = pattern.matcher(regularExpressionText);
while(matcher.find()){
System.out.println(matcher.group() + " [" + matcher.start()
+ ":" + matcher.end() + "] " + matcher.groupCount());
}
// OpenNLP train
try (OutputStream modelOutputStream = new BufferedOutputStream(new FileOutputStream(new File("modelFile")));) {
ObjectStream<String> lineStream = new PlainTextByLineStream(
new FileInputStream("en-ner-person.train"), "UTF-8");
ObjectStream<NameSample> sampleStream =
new NameSampleDataStream(lineStream);
TokenNameFinderModel model = NameFinderME.train(
"en", "person", sampleStream,
Collections.<String, Object>emptyMap());
model.serialize(modelOutputStream);
TokenNameFinderEvaluator evaluator =
new TokenNameFinderEvaluator(new NameFinderME(model));
lineStream = new PlainTextByLineStream(
new FileInputStream("en-ner-person.eval"), "UTF-8");
sampleStream = new NameSampleDataStream(lineStream);
evaluator.evaluate(sampleStream);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());
} catch (IOException ex) {
// Handle exception
ex.printStackTrace();
}
NER involves detecting entities and then classifying them. Common categories include names, locations, and things. This is an important task that many applications use to support searching, resolving references, and finding the meaning of the text. The process is frequently used in downstream tasks.