如何使用斯坦福pos tagger进行词性标注[转—英文]

最新推荐文章于 2022-08-12 10:35:32 发布

WSKINGS

最新推荐文章于 2022-08-12 10:35:32 发布

阅读量6.8k

点赞数 1

分类专栏： NLP 文章标签： POS tagger NLP 斯坦福词性标注

NLP 专栏收录该内容

2 篇文章

订阅专栏

[转自]http://www.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/

Tagging text with Stanford POS Tagger in Java Applications

78 Replies

I was looking for a way to extract “Nouns” from a set of strings in Java and I found, using Google, the amazing stanford NLP (Natural Language Processing) Group POS.

The library provided lets you “tag” the words in your string. That is, for each word, the “tagger” gets whether it’s a noun, a verb ..etc. and then assigns the result to the word. For example:

“This is a sample sentence”

will be output as

1	`This/DT is/VBZ a/DT sample/NN sentence/NN`

To do this, the tagger has to load a “trained” file that contains the necessary information for the tagger to tag the string. This “trained” file is called a model and has the extension “.tagger”. There are several trained models provided by Stanford NLP group for different languages.

In this post I will show you how to use such library in your Java application using Eclipse IDE.

Create a new project.
Create a new folder called “taggers”.
Download the zip file provided by stanford group.
Extract the zip file and Open the extracted folder.
You will find a folder called models, open it and copy the model you want to the “taggers” folder we created earlier + its corresponding (with the same name) “.props” file.
Now we need to import the library to our project so that Eclipse does not complain when we use it in our code. So, right click your project > Build Path > Configure Build Path.
In the new window, Open the libraries tab (from the top) and click the Add External Jars button.
Locate the “stanford-postagger.jar” file that is found in the extracted folder.

Now enough with the configuration and let’s start coding. In your project create a new Class and in its main method write:

1

2

3

4

5

 
           // Initialize the tagger 
          
           MaxentTagger tagger =  
           new 
           MaxentTagger( 
          
           "taggers/left3words-distsim-wsj-0-18.tagger" 
           );

The MaxentTagger constructor takes the path to the model (trained file) as a parameter:

“NAME_OF_FOLDER/NAME_OF_MODEL.tagger”.

Once you write the code, Eclipse will tell you to import the MaxentTagger and inform you that it throws some exceptions. Use eclipse to add all that to the code.

Finally, we tag the string we want:

 
           // The sample string 
          
           String sample =  
           "This is a sample text" 
           ; 
          
           // The tagged string 
          
           String tagged = tagger.tagString(sample); 
          
           // Output the result 
          
           System.out.println(tagged);

This will output the same result that’s mentioned at the begining of the post.

Here’s my entire class

 
           import 
           java.io.IOException; 
          
           import 
           edu.stanford.nlp.tagger.maxent.MaxentTagger; 
          
           public 
           class 
            TagText { 
          
           public 
           static 
            void 
            main(String[] args)  
           throws 
           IOException, 
          
           ClassNotFoundException { 
          
           // Initialize the tagger 
          
           MaxentTagger tagger =  
           new 
           MaxentTagger( 
          
           "taggers/left3words-distsim-wsj-0-18.tagger" 
           ); 
          
           // The sample string 
          
           String sample =  
           "This is a sample text" 
           ; 
          
           // The tagged string 
          
           String tagged = tagger.tagString(sample); 
          
           // Output the result 
          
           System.out.println(tagged); 
          
           } 
          
           }

Finally, We need to know what these “abbreviations” mean. For example in this output:

1	`This/DT is/VBZ a/DT sample/NN sentence/NN`

What does “NN” or “DT” mean? The tagger uses the Penn Treebank tag set for English language as stated on the library’s homepage. For a list of the abbreviations click here. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

For memory problems (quoting Akash’s comment below):

It turns out that the problem is that eclipse allocates on 256MB of memory by default. RightClick on the Project->Run as->Run Configurations->Go to the arguments tab-> under VM arguments type -Xmx2048m This will set the allocated memory to 2GB and all the tagger files should run now.

Updated:
Click here to download a sample project (for usage with Eclipse). It contains a tagger and a GUI example.

References
http://nlp.stanford.edu/software/tagger.shtml
http://www.englishclub.com/grammar/parts-of-speech_1.htm