Lucene实现自定义分词器

最新推荐文章于 2020-05-02 21:08:51 发布

iteye_2688

最新推荐文章于 2020-05-02 21:08:51 发布

阅读量160

点赞数

1、分词按照空格、横杠、点号进行拆分；

2、实现hi与hello的同义词查询功能；

3、实现hi与hello同义词的高亮显示；

MyAnalyzer实现代码：

    public class MyAnalyzer extends Analyzer {  
        private int analyzerType;  
          
        public MyAnalyzer(int type) {  
            super();  
            analyzerType = type;  
        }  
      
        @Override  
        protected TokenStreamComponents createComponents(String fieldName, Reader reader) {  
            MyTokenizer tokenizer = new MyTokenizer(fieldName, reader, analyzerType);  
            return new TokenStreamComponents(tokenizer);  
        }  
    }

MyTokenizer实现代码：

    public class MyTokenizer extends Tokenizer {  
        public class WordUnit{  
            WordUnit(String word, int start, int length){  
                this.word = word;  
                this.start = start;  
                this.length = length;  
    //System.out.println("\tWordUnit: " + word + "|" + start + "|" + length);  
            }  
              
            String word;  
            int start;  
            int length;  
        }  
          
        private int analyzerType;  
        private int endPosition;  
        private Iterator<WordUnit> it;  
        private ArrayList<WordUnit> words;  
          
        private final CharTermAttribute termAtt;  
        private final OffsetAttribute offsetAtt;  
          
        public MyTokenizer(String fieldName, Reader in, int type) {  
            super(in);  
              
            it = null;  
            endPosition = 0;  
            analyzerType = type;  
            offsetAtt = addAttribute(OffsetAttribute.class);  
            termAtt = addAttribute(CharTermAttribute.class);  
            addAttribute(PayloadAttribute.class);  
        }     
      
        @Override  
        public boolean incrementToken() throws IOException {  
            clearAttributes();  
              
            char[] inputBuf = new char[1024];  
            if(it == null) {  
                int bufSize = input.read(inputBuf);  
                if(bufSize <= 0) return false;  
      
                int beginIndex = 0;  
                int endIndex = 0;  
                words = new ArrayList<WordUnit>();  
                for(endIndex = 0; endIndex < bufSize; endIndex++) {  
                    if(inputBuf[endIndex] != '-' && inputBuf[endIndex] != ' ' && inputBuf[endIndex] != '.') continue;  
      
                    addWord(inputBuf, beginIndex, endIndex);  
                    beginIndex = endIndex + 1;  
                }  
                addWord(inputBuf, beginIndex, endIndex);//add the last  
                  
                if(words.isEmpty()) return false;  
                it = words.iterator();  
            }  
              
            if(it != null && it.hasNext()){  
                WordUnit word = it.next();  
                termAtt.append(word.word);  
                termAtt.setLength(word.word.length());  
                  
                endPosition = word.start + word.length;  
                offsetAtt.setOffset(word.start, endPosition);  
      
                return true;  
            }  
                  
            return false;  
        }  
      
        @Override  
        public void reset() throws IOException {  
            super.reset();  
              
            it = null;  
            endPosition = 0;  
        }  
      
        @Override  
        public final void end() {  
            int finalOffset = correctOffset(this.endPosition);  
            offsetAtt.setOffset(finalOffset, finalOffset);  
        }  
          
        private void addWord(char[] inputBuf, int begin, int end){  
            if(end <= begin) return;  
      
            String word = new String(inputBuf, begin, end - begin);  
            words.add(new WordUnit(word, begin, end - begin));  
          
            if(analyzerType == 0 && word.equals("hi")) words.add(new WordUnit("hello", begin, 2));  
            if(analyzerType == 0 && word.equals("hello")) words.add(new WordUnit("hi", begin, 5));  
        }  
    }

索引的时候分词器类型：analyzerType=0；

搜索的时候分词器类型：analyzerType=1；

高亮的时候分词器类型：analyzerType=0；

搜索hello时的效果如下：

    Score doc 0 hightlight to: look <em>hello</em> on  
    Score doc 1 hightlight to: I am <em>hi</em> China Chinese

可以看到含有hi的文档也被搜索出来，同样也会高亮。

iteye_2688

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫