Analyzer包含两个核心组件,Tokenizer以及TokenFilter。两者的区别在于,前者在字符级别处理流,而后者则在词语级别处理流。Tokenizer是Analyzer的第一步,其构造函数接收一个Reader作为参数,而TokenFilter则是一个类似的拦截器,其参数可以是TokenStream、Tokenizer。先看类图:
其实可以看到这个类图中,attributeSource和TokenStream与上一篇博客是一样的。不同部分在于Tokenizer之后。
package org.apache.lucene.analysis;
import org.apache.lucene.util.AttributeSource;
import java.io.Reader;
import java.io.IOException;
public abstract class Tokenizer extends TokenStream {
/** The text source for this Tokenizer. */
protected Reader input = ILLEGAL_STATE_READER;
/** Pending reader: not actually assigned to input until reset() */
private Reader inputPending = ILLEGAL_STATE_READER;
/** Construct a token stream processing the given input. */
protected Tokenizer(Reader input) {
if (input == null) {
throw new NullPointerException("input must not be null");
}
this.inputPending = input;
}
/** Construct a token stream processing the given input using the given AttributeFactory. */
protected Tokenizer(AttributeFactory factory, Reader input) {
super(factory);
if (input == null) {
throw new NullPointerException("input must not be null");
}
this.inputPending = input;
}
@Override
public void close() throws IOException {
input.close();
// LUCENE-2387: don't hold onto Reader after close, so
// GC can reclaim
inputPending = input = ILLEGAL_STATE_READER;
}
protected final int correctOffset(int currentOff) {
return (input instanceof CharFilter) ? ((CharFilter) input).correctOffset(currentOff) : currentOff;
}
public final void setReader(Reader input) throws IOException {
if (input == null) {
throw new NullPointerException("input must not be null");
} else if (this.input != ILLEGAL_STATE_READER) {
throw new IllegalStateException("TokenStream contract violation: close() call missing");
}
this.inputPending = input;
assert setReaderTestPoint();
}
@Override
public void reset() throws IOException {
super.reset();
input = inputPending;
inputPending = ILLEGAL_STATE_READER;
}
// only used by assert, for testing
boolean setReaderTestPoint() {
return true;
}
private static final Reader ILLEGAL_STATE_READER = new Reader() {
@Override
public int read(char[] cbuf, int off, int len) {
throw new IllegalStateException("TokenStream contract violation: reset()/close() call missing, " +
"reset() called multiple times, or subclass does not call super.reset(). " +
"Please see Javadocs of TokenStream class for more information about the correct consuming workflow.");
}
@Override
public void close() {}
};
}
代码还是很容易理解的。下面我们查看charTokenizer:
public abstract class CharTokenizer extends Tokenizer {
public CharTokenizer(Version matchVersion, Reader input) {
super(input);
charUtils = CharacterUtils.getInstance(matchVersion);
}
public CharTokenizer(Version matchVersion, AttributeFactory factory,
Reader input) {
super(factory, input);
charUtils = CharacterUtils.getInstance(matchVersion);
}
private int offset = 0, bufferIndex = 0, dataLen = 0, finalOffset = 0;
private static final int MAX_WORD_LEN = 255;
private static final int IO_BUFFER_SIZE = 4096;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private final CharacterUtils charUtils;
private final CharacterBuffer ioBuffer = CharacterUtils.newCharacterBuffer(IO_BUFFER_SIZE);
protected abstract boolean isTokenChar(int c);
//默认什么都不做
protected int normalize(int c) {
return c;
}
@Override
/*
* 这是最重要的一个类。首先他会读取内容。长度为4096
* 再他读取到最后的时候,会有一个标记为,这个标记位标记它最后一位是不是一个Unicode
* 编码的高位编码,如果是的话,那么这个标记置1,这样下次再进行读入buffer时候,offset
* 应该设置为1。本质上这是以空格为分割点的分词方式。
*/
public final boolean incrementToken() throws IOException {
clearAttributes();
int length = 0;
int start = -1; // this variable is always initialized
int end = -1;
char[] buffer = termAtt.buffer();
while (true) {
if (bufferIndex >= dataLen) {
offset += dataLen;
charUtils.fill(ioBuffer, input); // read supplementary char aware with CharacterUtils
if (ioBuffer.getLength() == 0) {
dataLen = 0; // so next offset += dataLen won't decrement offset
if (length > 0) {
break;
} else {
finalOffset = correctOffset(offset);
return false;
}
}
dataLen = ioBuffer.getLength();
bufferIndex = 0;
}
// use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
final int c = charUtils.codePointAt(ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength());
final int charCount = Character.charCount(c);
bufferIndex += charCount;
if (isTokenChar(c)) { // if it's a token char
if (length == 0) { // start of token
assert start == -1;
start = offset + bufferIndex - charCount;
end = start;
} else if (length >= buffer.length-1) { // check if a supplementary could run out of bounds
buffer = termAtt.resizeBuffer(2+length); // make sure a supplementary fits in the buffer
}
end += charCount;
length += Character.toChars(normalize(c), buffer, length); // buffer it, normalized
if (length >= MAX_WORD_LEN) // buffer overflow! make sure to check for >= surrogate pair could break == test
break;
} else if (length > 0) // at non-Letter w/ chars
break; // return 'em
}
termAtt.setLength(length);
assert start != -1;
offsetAtt.setOffset(correctOffset(start), finalOffset = correctOffset(end));
return true;
}
@Override
public final void end() throws IOException {
super.end();
// set final offset
offsetAtt.setOffset(finalOffset, finalOffset);
}
@Override
public void reset() throws IOException {
super.reset();
bufferIndex = 0;
offset = 0;
dataLen = 0;
finalOffset = 0;
ioBuffer.reset(); // make sure to reset the IO buffer!!
}
}
CharTokenizer 一个简单的,基于字符(character)的tokenizers。
isTokenChar(int c);//判断是否应该加入token
protected int normalize(int c) //对每一个即将加入到token的字符进行处理,默认不进行任何操作,直接返回true
再看 LetterTokenizer
一个将文本在非字母的地方进行拆分的tokenizer,对于亚洲语系来说并不适合,因为其
大部分单词并不是以空格划分的。其中主要方式是。
protected boolean isTokenChar(int c) {
return Character.isLetter(c);//收集符合要求的字符
}
最后看LowerCaseTokenizer
一个将文本在非字母的地方拆分并转换成小写的Tokenizer,其作用类似于LetterTokenizer与LowerCaseFilter的组合。
protected int normalize(int c) {
return Character.toLowerCase(c);//字母转换成小写
}
分析了这么久,分析了一个不适合亚洲语言的。那么再看一个适合的。StandardTokenizer,这块涉及到Unicode Text Segmentation。算法没看明白==。留一个扣儿,将来弄明白。