项目中用到了lucene构建索引,但是公司有自己的分词器,为了保持跟其它模块的一致性,必须将分词器整合进lucene中,其实网上这样的例子会比较多,不过很多都是不完整的,自己在这里贴出来个完整的,思想比较简单,基本就是按照自己的分词器分完词之后按照空格分隔,然后利用lucene的WhitespaceTokenizer来重新进行分隔。
代码如下:
import java.io.BufferedReader; import java.io.Reader; import java.io.IOException; import java.io.StringReader; import java.util.List; import com.xx.xx.liantong.core.token.PreProcess; import org.apache.commons.lang.StringUtils; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.util.Version; /** * * @author xyl * */ public final class SgiAnalyzer extends Analyzer { private PreProcess processor = null; public void setProcessor(PreProcess processor) { this.processor = processor; } public String readerToString(Reader reader) throws IOException { BufferedReader br = new BufferedReader(reader); String sReader = null; StringBuffer sBuffer = new StringBuffer(""); while ((sReader = br.readLine()) != null) { sBuffer.append(sReader); } return sBuffer.toString(); } @Override public TokenStream tokenStream(String fieldName, Reader reader) { String inputString = null; try { inputString = readerToString(reader); } catch (IOException e) { e.printStackTrace(); } try { List<string> resultList = processor.tokenize(inputString); if ((resultList == null) || (resultList.size() < 1)) { return null; } String words = StringUtils.join(resultList, " "); TokenStream results = new WhitespaceTokenizer(Version.LUCENE_36,new StringReader(words)); return results; } catch (Exception e) { e.printStackTrace(); } return null; } }
其中PreProcess是分词服务的客户端,我们将分词利用thrift封装成了一种服务,基于此服务可以实现比较定制化的功能,在函数tokenize中可以自定义一些规则比如停用词等。