lucene 3的中文分词mmseg4j

最新推荐文章于 2024-07-19 15:22:03 发布

chaocy

最新推荐文章于 2024-07-19 15:22:03 发布

阅读量797

点赞数

分类专栏： lucene 文章标签： lucene deprecated input token api class

lucene 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

lucene 3.x版本采用了全新的API，作为过渡的2.9中那些deprecated方法在3.0中已经彻底废弃了。不过我也没有太多东西要改，主要是修正了`TokenStream`s的相关代码，似乎`TokenStream`也是3.0中最大的革新。

A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being Token-based to Attribute-based. While Token still exists in 2.9 as a convenience class, the preferred way to store the information of a Token is to use AttributeImpls.

lucene的中文分词是使用的mmseg4j 1.8.2，这个版本也是针对lucene 2.x的，因此首先对mmseg4j下手。

与lucene相关的代码全部位于com.chenlb.mmseg4j.analysis;中，可以看到要做修正的地方并不多，主要还是把MMSegTokenizer中的next()换作boolean incrementToken()

//class MMSegTokenizer
public MMSegTokenizer(Seg seg, Reader input) {
    super(input);
    mmSeg = new MMSeg(input, seg);
    offsetAtt = addAttribute(OffsetAttribute.class);
    termAtt = addAttribute(TermAttribute.class);
}

@Override
public boolean incrementToken() throws IOException {
    clearAttributes();
    Word word = mmSeg.next();
    if (word != null) {
        termAtt.setTermBuffer(word.getString());
        offsetAtt.setOffset(word.getStartOffset(), word.getEndOffset());
        return true;
    } else {
        return false;
    }
}

之前的Token-based或许比较好理解，但采用现在Attribute-based似乎更简洁，在之前需要next()地方，现在也得改用incrementToken()啰，比如这样一个方法

static void printTokenStream(TokenStream ts) throws IOException {
  TermAttribute termAtt = (TermAttribute)ts.getAttribute(TermAttribute.class);
  while (ts.incrementToken()) {
      System.out.println(termAtt.term());
  }
}

chaocy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene 3的中文分词mmseg4j

<br />lucene 3.x版本采用了全新的API，作为过渡的2.9中那些deprecated方法在3.0中已经彻底废弃了。不过我也没有太多东西要改，主要是修正了TokenStreams的相关代码，似乎TokenStream也是3.0中最大的革新。<br />A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being Token-based to Attribute-based. Whi
复制链接

扫一扫