通过前两篇博文,我们已经大致了解了StandaAnalyzer的所依赖的一些类,我们戏称它为左膀右臂。现在是时候了解一下他真正的身躯了。为了便于分析,先贴出它的类图。
我们先看StopwordAnalyzerBase。
public abstract class StopwordAnalyzerBase extends Analyzer {
//用来保存停止次
protected final CharArraySet stopwords;
protected final Version matchVersion;
public CharArraySet getStopwordSet() {
return stopwords;
}
protected StopwordAnalyzerBase(final Version version, final CharArraySet stopwords) {
matchVersion = version;
// analyzers should use char array set for stopwords!
this.stopwords = stopwords == null ? CharArraySet.EMPTY_SET : CharArraySet
.unmodifiableSet(CharArraySet.copy(version, stopwords));
}
protected StopwordAnalyzerBase(final Version version) {
this(version, null);
}
protected static CharArraySet loadStopwordSet(File stopwords,
Version matchVersion) throws IOException {
Reader reader = null;
try {
//获取UTF8编码的reader。然后获取停止词。并且调用了IOUtils中的方法,具体参见前文。
reader = IOUtils.getDecodingReader(stopwords, IOUtils.CHARSET_UTF_8);
return WordlistLoader.getWordSet(reader, matchVersion);
} finally {
IOUtils.close(reader);
}
}
//获取停止词
protected static CharArraySet loadStopwordSet(Reader stopwords,
Version matchVersion) throws IOException {
try {
return WordlistLoader.getWordSet(stopwords, matchVersion);
} finally {
IOUtils.close(stopwords);
}
}
}
相对来说,了解了代码中所依赖的类,再去读此类还是很简单的。
再看看StandardAnalyzer类,其中比较重要的方法是CreateComponents。
public final class StandardAnalyzer extends StopwordAnalyzerBase {
public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
private int maxTokenLength = DEFAULT_MAX_TOKEN_LENGTH;
public static final CharArraySet STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
public StandardAnalyzer(Version matchVersion, CharArraySet stopWords) {
super(matchVersion, stopWords);
}
public StandardAnalyzer(Version matchVersion) {
this(matchVersion, STOP_WORDS_SET);
}
public StandardAnalyzer(Version matchVersion, Reader stopwords) throws IOException {
this(matchVersion, loadStopwordSet(stopwords, matchVersion));
}
public void setMaxTokenLength(int length) {
maxTokenLength = length;
}
public int getMaxTokenLength() {
return maxTokenLength;
}
@Override
//该方法继承自Analyzer,其真正完成了,Tokenizer和Tokenfilter的组合。也引出了更大的内容。
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
//new 一个Tokenizer的子类StandardTOkenizer。具体实现看下文
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
//new 出三个类 StandardFilter,LowerCaseFilter,StopFilter。
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
@Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
}
身躯中实在真的不简单,因为它又设计到 了。两大块内容,Tokenizer以及,TokenFIlter。我们就暂且兼做。Analyzer的心脏吧。