Lucene实现自定义分词器

1、分词按照空格、横杠、点号进行拆分;

2、实现hi与hello的同义词查询功能;

3、实现hi与hello同义词的高亮显示;

MyAnalyzer实现代码:

    public class MyAnalyzer extends Analyzer {  
        private int analyzerType;  
          
        public MyAnalyzer(int type) {  
            super();  
            analyzerType = type;  
        }  
      
        @Override  
        protected TokenStreamComponents createComponents(String fieldName, Reader reader) {  
            MyTokenizer tokenizer = new MyTokenizer(fieldName, reader, analyzerType);  
            return new TokenStreamComponents(tokenizer);  
        }  
    }  


MyTokenizer实现代码:

    public class MyTokenizer extends Tokenizer {  
        public class WordUnit{  
            WordUnit(String word, int start, int length){  
                this.word = word;  
                this.start = start;  
                this.length = length;  
    //System.out.println("\tWordUnit: " + word + "|" + start + "|" + length);  
            }  
              
            String word;  
            int start;  
            int length;  
        }  
          
        private int analyzerType;  
        private int endPosition;  
        private Iterator<WordUnit> it;  
        private ArrayList<WordUnit> words;  
          
        private final CharTermAttribute termAtt;  
        private final OffsetAttribute offsetAtt;  
          
        public MyTokenizer(String fieldName, Reader in, int type) {  
            super(in);  
              
            it = null;  
            endPosition = 0;  
            analyzerType = type;  
            offsetAtt = addAttribute(OffsetAttribute.class);  
            termAtt = addAttribute(CharTermAttribute.class);  
            addAttribute(PayloadAttribute.class);  
        }     
      
        @Override  
        public boolean incrementToken() throws IOException {  
            clearAttributes();  
              
            char[] inputBuf = new char[1024];  
            if(it == null) {  
                int bufSize = input.read(inputBuf);  
                if(bufSize <= 0) return false;  
      
                int beginIndex = 0;  
                int endIndex = 0;  
                words = new ArrayList<WordUnit>();  
                for(endIndex = 0; endIndex < bufSize; endIndex++) {  
                    if(inputBuf[endIndex] != '-' && inputBuf[endIndex] != ' ' && inputBuf[endIndex] != '.') continue;  
      
                    addWord(inputBuf, beginIndex, endIndex);  
                    beginIndex = endIndex + 1;  
                }  
                addWord(inputBuf, beginIndex, endIndex);//add the last  
                  
                if(words.isEmpty()) return false;  
                it = words.iterator();  
            }  
              
            if(it != null && it.hasNext()){  
                WordUnit word = it.next();  
                termAtt.append(word.word);  
                termAtt.setLength(word.word.length());  
                  
                endPosition = word.start + word.length;  
                offsetAtt.setOffset(word.start, endPosition);  
      
                return true;  
            }  
                  
            return false;  
        }  
      
        @Override  
        public void reset() throws IOException {  
            super.reset();  
              
            it = null;  
            endPosition = 0;  
        }  
      
        @Override  
        public final void end() {  
            int finalOffset = correctOffset(this.endPosition);  
            offsetAtt.setOffset(finalOffset, finalOffset);  
        }  
          
        private void addWord(char[] inputBuf, int begin, int end){  
            if(end <= begin) return;  
      
            String word = new String(inputBuf, begin, end - begin);  
            words.add(new WordUnit(word, begin, end - begin));  
          
            if(analyzerType == 0 && word.equals("hi")) words.add(new WordUnit("hello", begin, 2));  
            if(analyzerType == 0 && word.equals("hello")) words.add(new WordUnit("hi", begin, 5));  
        }  
    }  

索引的时候分词器类型:analyzerType=0;

搜索的时候分词器类型:analyzerType=1;

高亮的时候分词器类型:analyzerType=0;

搜索hello时的效果如下:

    Score doc 0 hightlight to: look <em>hello</em> on  
    Score doc 1 hightlight to: I am <em>hi</em> China Chinese  


可以看到含有hi的文档也被搜索出来,同样也会高亮。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值