Finding People and Things--NER

Named Entity Recognition

The process of finding people and things is referred to as Named Entity Recognition
(NER)

Common entity

  • People
  • Locations
  • Organizations
  • Money
  • Time
  • URLs

These entites in a document are vital and usful for NLP tasks like:

  • conducting simple searches,
  • processing queries
  • resolving references
  • disambiguation of text
  • finding the meaning of text

NER process involves two tasks

  • Detection of entities
  • Classification of entities

Detection is concerned with the position of an entity within text

Once the position is located, it is important to determine what type of entity was discovered.

Why NER is difficult

Although the tokenization of a text will
reveal its components, understanding what they are can be difficult.

Reasons
  • Language Amibiguity
  • Phrases can be challenging
  • Sentence level process thus related to sentence boundary
  • Specialized text such as URLs, e-mail addresses, and specialized numbers can be difficult to isolate

Techniques for name recognition

  • Regular Expression

  • Dictionary-based approaches

  • train a model to detect(Statistical Classifiers)
The disadvantage of this approach is that it requires someone to annotate the sample text, which is a time-consuming process. In addition, it is domain dependent.
To measure how well a model has been trained, several measures are used:
Precision: It is the percentage of entities found that match exactly the spans found in the evaluation data
Recall: It is the percentage of entities defined in the corpus that were found in the same location
Performance measure: It is the harmonic mean of precision and recall given by F1 = 2 * Precision * Recall / (Recall + Precision)

Regular Expressions for NER

Two general approaches

  • Java build in Regex for relatively simple and consistent entities
  • Thirty Party Regex class
//JDK build-in
String regularExpressionText
        = "He left his email address (rgb@colorworks.com) and his "
        + "phone number,800-555-1234. We believe his current address "
        + "is 100 Washington Place, Seattle, CO 12345-1234. I "
        + "understand you can also call at 123-555-1234 between "
        + "8:00 AM and 4:30 most days. His URL is http://example.com "
        + "and he was born on February 25, 1954 or 2/25/1954.";

        String phoneNumberRE = "\\d{3}-\\d{3}-\\d{4}";
        String urlRE = "\\b(https?|ftp|file|ldap)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]*[-A-Za-z0-9+&@#/%=~_|]";
        String emailRE = "[a-zA-Z0-9'._%+-]+@(?:[a-zA-Z0-9-]+\\.)+[a-zA-Z]{2,4}";
        String timeRE = "(([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?";
        Pattern pattern = Pattern.compile(phoneNumberRE + "|" + urlRE + "|" + emailRE + "|" + timeRE);
        java.util.regex.Matcher matcher = pattern.matcher(regularExpressionText);

        while(matcher.find()){
            System.out.println(matcher.group() + " [" + matcher.start()
            + ":" + matcher.end() + "] " + matcher.groupCount());

        }
// OpenNLP train
        try (OutputStream modelOutputStream = new BufferedOutputStream(new FileOutputStream(new File("modelFile")));) {

            ObjectStream<String> lineStream = new PlainTextByLineStream(
                    new FileInputStream("en-ner-person.train"), "UTF-8");

            ObjectStream<NameSample> sampleStream =
                    new NameSampleDataStream(lineStream);
            TokenNameFinderModel model = NameFinderME.train(
                    "en", "person", sampleStream,
                    Collections.<String, Object>emptyMap());
            model.serialize(modelOutputStream);



            TokenNameFinderEvaluator evaluator =
                    new TokenNameFinderEvaluator(new NameFinderME(model));
                    lineStream = new PlainTextByLineStream(
                    new FileInputStream("en-ner-person.eval"), "UTF-8");
                    sampleStream = new NameSampleDataStream(lineStream);
                    evaluator.evaluate(sampleStream);

            FMeasure result = evaluator.getFMeasure();
            System.out.println(result.toString());

        } catch (IOException ex) {
            // Handle exception
            ex.printStackTrace();
        }

NER involves detecting entities and then classifying them. Common categories include names, locations, and things. This is an important task that many applications use to support searching, resolving references, and finding the meaning of the text. The process is frequently used in downstream tasks.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值