public abstract class Analyzer {
public abstract TokenStream tokenStream(String fieldName, Reader reader);
*
* @param fieldName Field name being indexed.
* @return position increment gap, added to the next token emitted from {@link #tokenStream(String,Reader)}
*/
public int getPositionIncrementGap(String fieldName)
{
return 0;
}
}
String content = "...";
StringReader reader = new StringReader(content);
Analyzer analyzer = new ....();
TokenStream ts = analyzer.tokenStream("",reader);
//開始分詞
Token t = null;
while ((t = ts.next()) != null){
System.out.println(t.termText());
}
分析器由兩部分組成。一部分是分詞器,被稱Tokenizer, 另一部分是過濾器,TokenFilter. 它們都繼承自TokenStream。一個分析器往由一個分詞器和多個過濾器組成。
public abstract class Tokenizer extends TokenStream {
/** The text source for this Tokenizer. */
protected Reader input;
/** Construct a tokenizer with null input. */
protected Tokenizer() {}
/** Construct a token stream processing the given input. */
protected Tokenizer(Reader input) {
this.input = input;
}
/** By default, closes the input Reader. */
public void close() throws IOException {
input.close();
}
}
public abstract class TokenFilter extends TokenStream {
/** The source of tokens for this filter. */
protected TokenStream input;
/** Construct a token stream filtering the given input. */
protected TokenFilter(TokenStream input) {
this.input = input;
}
/** Close the input TokenStream. */
public void close() throws IOException {
input.close();
}
}
StandardAnalyer的tokenStream方法,除了使用StatandTokenizer進行分詞外,還使用了3個Filtter:
- StandardFilter 標準過濾器,主要對切分出來的省略語(如He's的's),和以"."號分隔的縮略語進行處理。
- LowerCaseFilter 大小寫轉換器,將大寫轉為小寫
- StopFilter 忽略詞過濾器,構造其實例時,需傳入一個忽略詞集合。
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopSet);
return result;
}
stopSet在構造StandardAnalyer時指定,無構造參加時,使用默認的StopAnalyzer.ENGLISH_STOP_WORDS提供的過濾詞。