如何使用斯坦福pos tagger进行词性标注[转—英文]

[转自]http://www.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/


Tagging text with Stanford POS Tagger in Java Applications

I was looking for a way to extract “Nouns” from a set of strings in Java and I found, using Google, the amazing stanford NLP (Natural Language Processing) Group POS.

The library provided lets you “tag” the words in your string. That is, for each word, the “tagger” gets whether it’s a noun, a verb ..etc. and then assigns the result to the word. For example:

“This is a sample sentence”

will be output as

?
1
This/DT is/VBZ a/DT sample/NN sentence/NN

To do this, the tagger has to load a “trained” file that contains the necessary information for the tagger to tag the string. This “trained” file is called a model and has the extension “.tagger”. There are several trained models provided by Stanford NLP group for different languages.

In this post I will show you how to use such library in your Java application using Eclipse IDE.

  1. Create a new project.
  2. Create a new folder called “taggers”.
  3. Download the zip file provided by stanford group.
  4. Extract the zip file and Open the extracted folder.
  5. You will find a folder called models, open it and copy the model you want to the “taggers” folder we created earlier + its corresponding (with the same name) “.props” file.
  6. Now we need to import the library to our project so that Eclipse does not complain when we use it in our code. So, right click your project > Build Path > Configure Build Path.
    In the new window, Open the libraries tab (from the top) and click the Add External Jars button.
    Locate the “stanford-postagger.jar” file that is found in the extracted folder.

  7. Now enough with the configuration and let’s start coding. In your project create a new Class and in its main method write:
    ?
    1
    2
    3
    4
    5
    // Initialize the tagger
     
    MaxentTagger tagger = new MaxentTagger(
     
    "taggers/left3words-distsim-wsj-0-18.tagger" );

    The MaxentTagger constructor takes the path to the model (trained file) as a parameter:

    “NAME_OF_FOLDER/NAME_OF_MODEL.tagger”.

    Once you write the code, Eclipse will tell you to import the MaxentTagger and inform you that it throws some exceptions. Use eclipse to add all that to the code.

    Finally, we tag the string we want:

    ?
    01
    02
    03
    04
    05
    06
    07
    08
    09
    10
    11
    // The sample string
     
    String sample = "This is a sample text" ;
     
    // The tagged string
     
    String tagged = tagger.tagString(sample);
     
    // Output the result
     
    System.out.println(tagged);

    This will output the same result that’s mentioned at the begining of the post.

    Here’s my entire class

    ?
    01
    02
    03
    04
    05
    06
    07
    08
    09
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    import java.io.IOException;
     
    import edu.stanford.nlp.tagger.maxent.MaxentTagger;
     
    public class TagText {
         public static void main(String[] args) throws IOException,
                 ClassNotFoundException {
     
             // Initialize the tagger
             MaxentTagger tagger = new MaxentTagger(
                     "taggers/left3words-distsim-wsj-0-18.tagger" );
     
             // The sample string
             String sample = "This is a sample text" ;
     
             // The tagged string
             String tagged = tagger.tagString(sample);
     
             // Output the result
             System.out.println(tagged);
         }
    }

Finally, We need to know what these “abbreviations” mean. For example in this output:

?
1
This/DT is/VBZ a/DT sample/NN sentence/NN

What does “NN” or “DT” mean? The tagger uses the Penn Treebank tag set for English language as stated on the library’s homepage. For a list of the abbreviations click here. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

For memory problems (quoting Akash’s comment below):

It turns out that the problem is that eclipse allocates on 256MB of memory by default. RightClick on the Project->Run as->Run Configurations->Go to the arguments tab-> under VM arguments type -Xmx2048m This will set the allocated memory to 2GB and all the tagger files should run now.

Updated:
Click here to download a sample project (for usage with Eclipse). It contains a tagger and a GUI example.

References
http://nlp.stanford.edu/software/tagger.shtml
http://www.englishclub.com/grammar/parts-of-speech_1.htm

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
About A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one): Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259. The tagger was originally written by Kristina Toutanova. Since that time, Dan Klein, Christopher Manning, William Morgan, Anna Rafferty, Michel Galley, and John Bauer have improved its speed, performance, usability, and support for other languages. The system requires Java 1.6+ to be installed. Depending on whether you're running 32 or 64 bit Java and the complexity of the tagger model, you'll need somewhere between 60 and 200 MB of memory to run a trained tagger (i.e., you may need to give java an option like java -mx200m). Plenty of memory is needed to train a tagger. It again depends on the complexity of the model but at least 1GB is usually needed, often more. Several downloads are available. The basic download contains two trained tagger models for English. The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值