介绍lucene的一个简单的分析器实现,基于4.x。这个分析器的示例,abcd分词为a,ab,abc,abcd。是用在联系人,文章名称,应用名称等搜索建议上的。用户输入前几个字或者前几个拼音,就能显示出建议的结果。比如,在联系人里面搜liug,就能提示你有刘刚,刘钢等联系人,这个需要一个字符一个字符的切割。源码在github上,点击打开链接 。
一:类
只有3个类,主类和Toknizer工厂和Tokenizer
1.PrefixAnalyzer,因为没有过滤器,所以直接只用tokenizer来处理就可以了
public final class PrefixAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
final Tokenizer source = new PrefixTokenizer(reader);
return new TokenStreamComponents(source);
}
}
2.PrefixTokenizer,只需要要把输入的字符从0开始到1到length一个个切开,存储就可以了
public final class PrefixTokenizer extends Tokenizer {
public PrefixTokenizer(Reader in) {
super(in);
}
public PrefixTokenizer(AttributeSource source, Reader in) {
super(source, in);
}
public PrefixTokenizer(AttributeFactory factory, Reader in) {
super(factory, in);
}
private int offset = 0, dataLen=0;
private final static int MAX_WORD_LEN = 255;
private final static int IO_BUFFER_SIZE = 1024;
private final char[] ioBuffer = new char[IO_BUFFER_SIZE];
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
@Override
public boolean incrementToken() throws IOException {
clearAttributes();
//读取输入流,当处理完一次输入流后,在读输入流datalen就等于-1
if (offset >= dataLen) {
dataLen = input.read(ioBuffer);
offset = 0;
}
if (dataLen == -1 || dataLen > MAX_WORD_LEN) {
return false;
}
//偏移量加一,写入索引词典
offset++;
termAtt.copyBuffer(ioBuffer, 0, offset);
offsetAtt.setOffset(correctOffset(0), correctOffset(offset));
return true;
}
@Override
public final void end() {
// set final offset
final int finalOffset = correctOffset(offset);
this.offsetAtt.setOffset(finalOffset, finalOffset);
}
@Override
public void reset() throws IOException {
super.reset();
offset = dataLen = 0;
}
}