通过stanford-postagger对英文单词进行词性标注

1.models介绍

该版本的词性标注工具中有一个models文件夹,该文件夹下有两种类型的文件:.tagger类型和. props类型。其中.tagger类型的文件是词性标注训练出来的模型文件,. props类型是其对应的properties文件。models文件夹下所有的文件如下图:

2.程序及说明

    这个开源词性标注工具中有三种分类器,english-bidirectional-distsim.tagger   english-left3words-distsim.tagger   wsj-0-18-bidirectional-nodistsim.tagger,根据他的说明文档,标注的准确率大概在97.01%,另外,该工具还可以对中文、德文等语言进行词性标注。
    下面来看看标注程序及标注结果:

2.1.标注程序

[java]  view plain copy
  1. public class Tagger {  
  2.   
  3.   public static void main(String[] args) throws Exception {  
  4.     String str = "The list of prisoners who may be released in coming days includes militants" +  
  5.             " who threw firebombs, in one case at a bus carrying children; stabbed and shot" +  
  6.             " civilians, including women, elderly Jews and suspected Palestinian collaborators; " +  
  7.             "and ambushed and killed border guards, police officers, security agents and soldiers. " +  
  8.             "All of them have been in prison for at least two decades; some were serving life sentences.";  
  9.     MaxentTagger tagger = new MaxentTagger("c:/wsj-0-18-bidirectional-nodistsim.tagger");  
  10.     Long start = System.currentTimeMillis();  
  11.     List<List<HasWord>> sentences = MaxentTagger.tokenizeText(new StringReader(str));  
  12.     System.out.println("Tagging 用时"+(System.currentTimeMillis() - start)+"毫秒");  
  13.     for (List<HasWord> sentence : sentences) {  
  14.       ArrayList<TaggedWord> tSentence = tagger.tagSentence(sentence);  
  15.       System.out.println(Sentence.listToString(tSentence, false));  
  16.     }  
  17.   }  
  18.   
  19. }  

2.2.标注结果

[plain]  view plain copy
  1. Tagging 用时84毫秒  
  2. The/DT list/NN of/IN prisoners/NNS who/WP may/MD be/VB released/VBN in/IN coming/VBG days/NNS includes/VBZ militants/NNS who/WP threw/VBD firebombs/NNS ,/,   
  3. in/IN one/CD case/NN at/IN a/DT bus/NN carrying/VBG children/NNS ;/: stabbed/VBN and/CC shot/VBN civilians/NNS ,/, including/VBG women/NNS ,/, elderly/JJ   
  4. Jews/NNS and/CC suspected/JJ Palestinian/JJ collaborators/NNS ;/: and/CC ambushed/VBN and/CC killed/VBN border/NN guards/NNS ,/, police/NN officers/NNS ,/,   
  5. security/NN agents/NNS and/CC soldiers/NNS ./.All/DT of/IN them/PRP have/VBP been/VBN in/IN prison/NN for/IN at/IN least/JJS two/CD decades/NNS ;/: some/DT   
  6. were/VBD serving/VBG life/NN sentences/NNS ./.  

下面这张表,是英文单词的词性表


从上面的表和程序的标注结果来看,分词是很准确的。
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
About A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one): Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259. The tagger was originally written by Kristina Toutanova. Since that time, Dan Klein, Christopher Manning, William Morgan, Anna Rafferty, Michel Galley, and John Bauer have improved its speed, performance, usability, and support for other languages. The system requires Java 1.6+ to be installed. Depending on whether you're running 32 or 64 bit Java and the complexity of the tagger model, you'll need somewhere between 60 and 200 MB of memory to run a trained tagger (i.e., you may need to give java an option like java -mx200m). Plenty of memory is needed to train a tagger. It again depends on the complexity of the model but at least 1GB is usually needed, often more. Several downloads are available. The basic download contains two trained tagger models for English. The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值