LUCENE3.0 自学吧 8 filter

最新推荐文章于 2022-12-31 17:50:11 发布

sustbeckham

最新推荐文章于 2022-12-31 17:50:11 发布

阅读量1.1k

点赞数

分类专栏： Lucene 文章标签： filter lucene import token class string

本文链接：https://blog.csdn.net/sustbeckham/article/details/5810109

版权

Lucene 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

TokenFilter, 个人觉得很好理解，就是把不需要的东西过滤掉。

例如分词后的结果如下：

【 what are you doing man 】

也许我们认为 are.you.what 这三个这个词语太普遍了，不具有查询的意义。则可以在查询之前将其剔除掉，实际上索引存的信息就是有关于【 doing man 】的信息。这个事情就交给 Filter 来做了。

下面是剔除长度不过关的 LengthFilter

package org.apache.lucene.analysis; import java.io.IOException; import org.apache.lucene.analysis.tokenattributes.TermAttribute; /** * Removes words that are too long or too short from the stream. */ public final class LengthFilter extends TokenFilter { final int min; final int max; private TermAttribute termAtt; /** * Build a filter that removes words that are too long or too * short from the text. */ public LengthFilter(TokenStream in, int min, int max) { super(in); this.min = min; this.max = max; termAtt = addAttribute(TermAttribute.class); } /** * Returns the next input Token whose term() is the right len */ @Override public final boolean incrementToken() throws IOException { // return the first non-stop word found while (input.incrementToken()) { int len = termAtt.termLength(); if (len >= min && len <= max) { return true; } // note: else we ignore it but should we index each part of it? } // reached EOS -- return false return false; } } 很好理解，例子说明。 package com.fpi.lucene.studying.test; import java.io.IOException; import java.io.Reader; import java.io.StringReader; import org.apache.lucene.analysis.LengthFilter; import org.apache.lucene.analysis.LetterTokenizer; public class JustTest2 { public static void main(String[] args) throws IOException { Reader read = new StringReader("what are you doing man?it's none of your bussiness!"); LetterTokenizer token = new LetterTokenizer(read); //LengthFilter,设置最大长度为8，最小长度为3 LengthFilter filter = new LengthFilter(token,3,8); while(filter.incrementToken()){ System.out.println(filter.toString()); } } }

输出结果：

(startOffset=0,endOffset=4,term=what)

(startOffset=5,endOffset=8,term=are)

(startOffset=9,endOffset=12,term=you)

(startOffset=13,endOffset=18,term=doing)

(startOffset=19,endOffset=22,term=man)

(startOffset=28,endOffset=32,term=none)

(startOffset=36,endOffset=40,term=your)

OK 长度大于 8 和小于 2 的分词全部被删除。

其他类似。