Finding People and Things--NER

最新推荐文章于 2022-09-13 01:31:35 发布

HoiDev

最新推荐文章于 2022-09-13 01:31:35 发布

阅读量380

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/qq_33938256/article/details/52763777

版权

NLP 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Named Entity Recognition
Why NER is difficult
Techniques for name recognition
- Regular Expressions for NER

Named Entity Recognition

The process of finding people and things is referred to as Named Entity Recognition
(NER)

Common entity

People
Locations
Organizations
Money
Time
URLs

These entites in a document are vital and usful for NLP tasks like:

conducting simple searches,
processing queries
resolving references
disambiguation of text
finding the meaning of text

NER process involves two tasks

Detection of entities
Classification of entities

Detection is concerned with the position of an entity within text

Once the position is located, it is important to determine what type of entity was discovered.

Why NER is difficult

Although the tokenization of a text will
reveal its components, understanding what they are can be difficult.

Reasons

Language Amibiguity

Phrases can be challenging

Sentence level process thus related to sentence boundary

Specialized text such as URLs, e-mail addresses, and specialized numbers can be difficult to isolate

Techniques for name recognition

Regular Expression
Dictionary-based approaches

train a model to detect(Statistical Classifiers)

The disadvantage of this approach is that it requires someone to annotate the sample text, which is a time-consuming process. In addition, it is domain dependent.

To measure how well a model has been trained, several measures are used:

Precision: It is the percentage of entities found that match exactly the spans found in the evaluation data

Recall: It is the percentage of entities defined in the corpus that were found in the same location

Performance measure: It is the harmonic mean of precision and recall given by F1 = 2 * Precision * Recall / (Recall + Precision)

Regular Expressions for NER

Two general approaches

Java build in Regex for relatively simple and consistent entities
Thirty Party Regex class

//JDK build-in
String regularExpressionText
        = "He left his email address (rgb@colorworks.com) and his "
        + "phone number,800-555-1234. We believe his current address "
        + "is 100 Washington Place, Seattle, CO 12345-1234. I "
        + "understand you can also call at 123-555-1234 between "
        + "8:00 AM and 4:30 most days. His URL is http://example.com "
        + "and he was born on February 25, 1954 or 2/25/1954.";

        String phoneNumberRE = "\\d{3}-\\d{3}-\\d{4}";
        String urlRE = "\\b(https?|ftp|file|ldap)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]*[-A-Za-z0-9+&@#/%=~_|]";
        String emailRE = "[a-zA-Z0-9'._%+-]+@(?:[a-zA-Z0-9-]+\\.)+[a-zA-Z]{2,4}";
        String timeRE = "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?";
        Pattern pattern = Pattern.compile(phoneNumberRE + "|" + urlRE + "|" + emailRE + "|" + timeRE);
        java.util.regex.Matcher matcher = pattern.matcher(regularExpressionText);

        while(matcher.find()){
            System.out.println(matcher.group() + " [" + matcher.start()
            + ":" + matcher.end() + "] " + matcher.groupCount());

        }

// OpenNLP train
        try (OutputStream modelOutputStream = new BufferedOutputStream(new FileOutputStream(new File("modelFile")));) {

            ObjectStream<String> lineStream = new PlainTextByLineStream(
                    new FileInputStream("en-ner-person.train"), "UTF-8");

            ObjectStream<NameSample> sampleStream =
                    new NameSampleDataStream(lineStream);
            TokenNameFinderModel model = NameFinderME.train(
                    "en", "person", sampleStream,
                    Collections.<String, Object>emptyMap());
            model.serialize(modelOutputStream);



            TokenNameFinderEvaluator evaluator =
                    new TokenNameFinderEvaluator(new NameFinderME(model));
                    lineStream = new PlainTextByLineStream(
                    new FileInputStream("en-ner-person.eval"), "UTF-8");
                    sampleStream = new NameSampleDataStream(lineStream);
                    evaluator.evaluate(sampleStream);

            FMeasure result = evaluator.getFMeasure();
            System.out.println(result.toString());

        } catch (IOException ex) {
            // Handle exception
            ex.printStackTrace();
        }

NER involves detecting entities and then classifying them. Common categories include names, locations, and things. This is an important task that many applications use to support searching, resolving references, and finding the meaning of the text. The process is frequently used in downstream tasks.