Lucene4.10使用教程(五)：lucene的分词器

最新推荐文章于 2022-01-24 13:51:28 发布

一饼团队

最新推荐文章于 2022-01-24 13:51:28 发布

阅读量4.7k

点赞数

本文链接：https://blog.csdn.net/seven_zhao/article/details/42707267

版权

Lucene默认提供的分词器中有中文分词器，但是它的分词是基于单个字进行拆分的，所以在正式的项目中基本无用。所有要在项目中Lucene，需要添加另外的中分词器，比如IK、mmseg4j、paoding等。关于中文分词器的比较和适用情况，可以Google，文章很多，不是我们这里讨论的重点。如果需要使用中文分词器，也很简单，只要在使用分词器的地方，将分词器替换成我们的中文分词器即可，eg: IndexWriterConfig iwConfig = new IndexWriterConfig(luceneVersion, new StandardAnalyzer());将new StandardAnalyzer()替换为new MMSegAnalyzer()即可。

一般情况下，不需要自己编写分词器，使用第三方中文分词器就可以满足要求。

针对于有特殊要求的情况，还需要自己编写分词器。如果需要编写自己的中文分词器，可以参照Lucene-analyzers-commons-4.10.2.jar中cn包下的中文分词进行改造，编写符合特殊要求的分词器。假如我们需要对文档中的每个字符进行分词，那么核心代码如下：

public boolean incrementToken() throws IOException {
            clearAttributes();

            length = 0;
            start = offset;

            while (true) {

                final char c;
                offset++;

                if (bufferIndex >= dataLen) {
                    dataLen = input.read(ioBuffer);
                    bufferIndex = 0;
                }

                if (dataLen == -1) {
                  offset--;
                  return flush();
                } else
                    c = ioBuffer[bufferIndex++];

                switch(Character.getType(c)) {

                case Character.DECIMAL_DIGIT_NUMBER:
                case Character.LOWERCASE_LETTER:
                case Character.UPPERCASE_LETTER:
                   /**按照字母进行分词**/
   //               push(c);
   //               if (length == MAX_WORD_LEN) return flush();
   //               break;

                case Character.OTHER_LETTER:
                    if (length>0) {
                        bufferIndex--;
                        offset--;
                        return flush();
                    }
                    push(c);
                    return flush();

                default:
                    if (length>0) return flush();
                    break;
                }
            }
        }

另外，在使用时，需要在分词器中指定所用的Tokenizer，部分代码如下：

protected TokenStreamComponents createComponents(String fieldName,
           Reader reader) {
       Dictionary dic = Dictionary.getInstance("/Users/ChinaMWorld/Desktop/WorkSpace/Johnny/lucene/lucene/src/main/java/com/johnny/lucene03/analyzer/data/");
       Tokenizer source = new MySameTokenizer(reader);
       return new TokenStreamComponents(source, new MySameFilter(source, csw));
   }