stanford corenlp自定义切词类

stanford corenlp的中文切词有时不尽如意,那我们就需要实现一个自定义切词类,来完全满足我们的私人定制(加各种词典干预)。本篇文章就说下怎么把IKAnalyzer作为corenlp的切词工具。

《stanford corenlp的TokensRegex》提到了corenlp的配置CoreNLP-chinese.properties,其中customAnnotatorClass.segment就是用于指定切词类的,在这里我们只需要模仿ChineseSegmenterAnnotator来实现一个自己的Annotator,并设置在配置文件中即可。

public class IKSegmenterAnnotator extends ChineseSegmenterAnnotator {
     public IKSegmenterAnnotator() {
         super ();
     }
 
     public IKSegmenterAnnotator( boolean verbose) {
         super (verbose);
     }
 
     public IKSegmenterAnnotator(String segLoc, boolean verbose) {
         super (segLoc, verbose);
     }
 
     public IKSegmenterAnnotator(String segLoc, boolean verbose, String serDictionary, String sighanCorporaDict) {
         super (segLoc, verbose, serDictionary, sighanCorporaDict);
     }
 
     public IKSegmenterAnnotator(String name, Properties props) {
         super (name, props);
     }
 
     private List<String> splitWords(String str) {
         try {
             List<String> words = new ArrayList<String>();
             IKSegmenter ik = new IKSegmenter( new StringReader(str), true );
             Lexeme lex = null ;
             while ((lex = ik.next()) != null ) {
                 words.add(lex.getLexemeText());
             }
             return words;
         } catch (IOException e) {
             //LOGGER.error(e.getMessage(), e);
             System.out.println(e);
             List<String> words = new ArrayList<String>();
             words.add(str);
             return words;
         }
     }
 
     @Override
     public void runSegmentation(CoreMap annotation) {
         //0 2
         // A BC D E
         // 1 10 1 1
         // 0 12 3 4
         // 0, 0+1 ,
 
         String text = annotation.get(CoreAnnotations.TextAnnotation. class );
         List<CoreLabel> sentChars = annotation.get(ChineseCoreAnnotations.CharactersAnnotation. class );
         List<CoreLabel> tokens = new ArrayList<CoreLabel>();
         annotation.set(CoreAnnotations.TokensAnnotation. class , tokens);
 
         //List<String> words = segmenter.segmentString(text);
         List<String> words = splitWords(text);
         System.err.println(text);
         System.err.println( "--->" );
         System.err.println(words);
 
         int pos = 0 ;
         for (String w : words) {
             CoreLabel fl = sentChars.get(pos);
             fl.set(CoreAnnotations.ChineseSegAnnotation. class , "1" );
             if (w.length() == 0 ) {
                 continue ;
             }
             CoreLabel token = new CoreLabel();
             token.setWord(w);
             token.set(CoreAnnotations.CharacterOffsetBeginAnnotation. class , fl.get(CoreAnnotations.CharacterOffsetBeginAnnotation. class ));
             pos += w.length();
             fl = sentChars.get(pos - 1 );
             token.set(CoreAnnotations.CharacterOffsetEndAnnotation. class , fl.get(CoreAnnotations.CharacterOffsetEndAnnotation. class ));
             tokens.add(token);
         }
     }
}
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值