2021SC@SDUSC
今天继续对Lucene中的Analysis进行分析
阅读的DotLucene版本是1.9.RC1
1.几个具体的TokenStream
在索引的时候,添加域的时候,可以指定Analyzer,使其生成TokenStream,也可以直接指定TokenStream:
public Field(String name, TokenStream tokenStream);
下面介绍两个单独使用的TokenStream
1、NumericTokenStream
上一节介绍NumericRangeQuery的时候,在生成NumericField的时候,其会使用NumericTokenStream,其incrementToken如下:
public boolean incrementToken() { if (valSize == 0) throw new IllegalStateException("call set???Value() before usage"); if (shift >= valSize) return false; clearAttributes(); //虽然NumericTokenStream欲保存数字,然而Lucene的Token只能保存字符串,因而要将数字编码为字符串,然后存入索引。 final char[] buffer; switch (valSize) { //首先分配TermBuffer,然后将数字编码为字符串 case 64: buffer = termAtt.resizeTermBuffer(NumericUtils.BUF_SIZE_LONG); termAtt.setTermLength(NumericUtils.longToPrefixCoded(value, shift, buffer)); break; case 32: buffer = termAtt.resizeTermBuffer(NumericUtils.BUF_SIZE_INT); termAtt.setTermLength(NumericUtils.intToPrefixCoded((int) value, shift, buffer)); break; default: throw new IllegalArgumentException("valSize must be 32 or 64"); } typeAtt.setType((shift == 0) ? TOKEN_TYPE_FULL_PREC : TOKEN_TYPE_LOWER_PREC); posIncrAtt.setPositionIncrement((shift == 0) ? 1 : 0); shift += precisionStep; return true; } |
public static int intToPrefixCoded(final int val, final int shift, final char[] buffer) { if (shift>31 || shift<0) throw new IllegalArgumentException("Illegal shift value, must be 0..31"); int nChars = (31-shift)/7 + 1, len = nChars+1; buffer[0] = (char)(SHIFT_START_INT + shift); int sortableBits = val ^ 0x80000000; sortableBits >>>= shift; while (nChars>=1) { //int按照每七位组成一个utf-8的编码,并且字符串大小比较的顺序同int大小比较的顺序完全相同。 buffer[nChars--] = (char)(sortableBits & 0x7f); sortableBits >>>= 7; } return len; } |
2、SingleTokenTokenStream
SingleTokenTokenStream顾名思义就是此TokenStream仅仅包含一个Token,多用于保存一篇文档仅有一个的信息,如id,如time等,这些信息往往被保存在一个特殊的Token(如ID:ID, TIME:TIME)的倒排表的payload中的,这样可以使用跳表来增加访问速度。
所以SingleTokenTokenStream返回的Token则不是id或者time本身,而是特殊的Token,"ID:ID", "TIME:TIME",而是将id的值或者time的值放入payload中。
//索引的时候 int id = 0; //用户自己的文档号 String tokenstring = "ID"; byte[] value = idToBytes(); //将id装换为byte数组 Token token = new Token(tokenstring, 0, tokenstring.length); token.setPayload(new Payload(value)); SingleTokenTokenStream tokenstream = new SingleTokenTokenStream(token); Document doc = new Document(); doc.add(new Field("ID", tokenstream)); …… //当得到Lucene的文档号docid,并不想构造Document对象就得到用户的文档号时 TermPositions tp = reader.termPositions("ID:ID"); boolean ret = tp.skipTo(docid); tp.nextPosition(); int payloadlength = tp.getPayloadLength(); byte[] payloadBuffer = new byte[payloadlength]; tp.getPayload(payloadBuffer, 0); int id = bytesToID(); //将payloadBuffer转换为用户id |
2、Tokenizer也是一种TokenStream
public abstract class Tokenizer extends TokenStream { protected Reader input; protected Tokenizer(Reader input) { this.input = CharReader.get(input); } public void reset(Reader input) throws IOException { this.input = input; } } |
以下重要的Tokenizer如下,我们将一一解析:
- CharTokenizer
- LetterTokenizer
- LowerCaseTokenizer
- WhitespaceTokenizer
- LetterTokenizer
- ChineseTokenizer
- CJKTokenizer
- EdgeNGramTokenizer
- KeywordTokenizer
- NGramTokenizer
- SentenceTokenizer
- StandardTokenizer
1、CharTokenizer
CharTokenizer是一个抽象类,用于对字符串进行分词。
在构造函数中,生成了TermAttribute和OffsetAttribute两个属性,说明分词后除了返回分词后的字符外,还要返回offset。
offsetAtt = addAttribute(OffsetAttribute.class); termAtt = addAttribute(TermAttribute.class); |
其incrementToken函数如下:
public final boolean incrementToken() throws IOException { clearAttributes(); int length = 0; int start = bufferIndex; char[] buffer = termAtt.termBuffer(); while (true) { //不断读取reader中的字符到buffer中 if (bufferIndex >= dataLen) { offset += dataLen; dataLen = input.read(ioBuffer); if (dataLen == -1) { dataLen = 0; if (length > 0) break; else return false; } bufferIndex = 0; } //然后逐一遍历buffer中的字符 final char c = ioBuffer[bufferIndex++]; //如果是一个token字符,则normalize后接着取下一个字符,否则当前token结束。 if (isTokenChar(c)) { if (length == 0) start = offset + bufferIndex - 1; else if (length == buffer.length) buffer = termAtt.resizeTermBuffer(1+length); buffer[length++] = normalize(c); if (length == MAX_WORD_LEN) break; } else if (length > 0) break; } termAtt.setTermLength(length); offsetAtt.setOffset(correctOffset(start), correctOffset(start+length)); return true; } |
CharTokenizer是一个抽象类,其isTokenChar函数和normalize函数由子类实现。
其子类WhitespaceTokenizer实现了isTokenChar函数:
//当遇到空格的时候,当前token结束 protected boolean isTokenChar(char c) { return !Character.isWhitespace(c); } |
其子类LetterTokenizer如下实现isTokenChar函数:
protected boolean isTokenChar(char c) { return Character.isLetter(c); } |
LetterTokenizer的子类LowerCaseTokenizer实现了normalize函数,将字符串转换为小写:
protected char normalize(char c) { return Character.toLowerCase(c); } |
2、ChineseTokenizer
其在初始化的时候,添加TermAttribute和OffsetAttribute。
其incrementToken实现如下:
public boolean incrementToken() throws IOException { clearAttributes(); length = 0; start = offset; while (true) { final char c; offset++; if (bufferIndex >= dataLen) { dataLen = input.read(ioBuffer); bufferIndex = 0; } if (dataLen == -1) return flush(); else c = ioBuffer[bufferIndex++]; switch(Character.getType(c)) { //如果是英文下小写字母或数字的时候,则属于同一个Token,push到buffer中 case Character.DECIMAL_DIGIT_NUMBER: case Character.LOWERCASE_LETTER: case Character.UPPERCASE_LETTER: push(c); if (length == MAX_WORD_LEN) return flush(); break; //中文属于OTHER_LETTER,当出现中文字符的时候,则上一个Token结束,并将当前字符push到buffer中 case Character.OTHER_LETTER: if (length>0) { bufferIndex--; offset--; return flush(); } push(c); return flush(); default: if (length>0) return flush(); break; } } } |
3、KeywordTokenizer
KeywordTokenizer是将整个字符作为一个Token返回的。
其incrementToken函数如下:
public final boolean incrementToken() throws IOException { if (!done) { clearAttributes(); done = true; int upto = 0; char[] buffer = termAtt.termBuffer(); //将字符串全部读入buffer,然后返回。 while (true) { final int length = input.read(buffer, upto, buffer.length-upto); if (length == -1) break; upto += length; if (upto == buffer.length) buffer = termAtt.resizeTermBuffer(1+buffer.length); } termAtt.setTermLength(upto); finalOffset = correctOffset(upto); offsetAtt.setOffset(correctOffset(0), finalOffset); return true; } return false; } |
4、CJKTokenizer
其incrementToken函数如下:
public boolean incrementToken() throws IOException { clearAttributes(); while(true) { int length = 0; int start = offset; while (true) { //得到当前的字符,及其所属的Unicode块 char c; Character.UnicodeBlock ub; offset++; if (bufferIndex >= dataLen) { dataLen = input.read(ioBuffer); bufferIndex = 0; } if (dataLen == -1) { if (length > 0) { if (preIsTokened == true) { length = 0; preIsTokened = false; } break; } else { return false; } } else { c = ioBuffer[bufferIndex++]; ub = Character.UnicodeBlock.of(c); } //如果当前字符输入ASCII码 if ((ub == Character.UnicodeBlock.BASIC_LATIN) || (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)) { if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) { int i = (int) c; if (i >= 65281 && i <= 65374) { //将半型及全型形式Unicode转变为普通的ASCII码 i = i - 65248; c = (char) i; } } //如果当前字符是字符或者"_" "+" "#" if (Character.isLetterOrDigit(c) || ((c == '_') || (c == '+') || (c == '#'))) { if (length == 0) { start = offset - 1; } else if (tokenType == DOUBLE_TOKEN_TYPE) { offset--; bufferIndex--; if (preIsTokened == true) { length = 0; preIsTokened = false; break; } else { break; } } //将当前字符放入buffer buffer[length++] = Character.toLowerCase(c); tokenType = SINGLE_TOKEN_TYPE; if (length == MAX_WORD_LEN) { break; } } else if (length > 0) { if (preIsTokened == true) { length = 0; preIsTokened = false; } else { break; } } } else { //如果非ASCII字符 if (Character.isLetter(c)) { if (length == 0) { start = offset - 1; buffer[length++] = c; tokenType = DOUBLE_TOKEN_TYPE; } else { if (tokenType == SINGLE_TOKEN_TYPE) { offset--; bufferIndex--; break; } else { //非ASCII码字符,两个字符作为一个Token //(如"中华人民共和国"分词为"中华","华人","人民","民共","共和","和国") buffer[length++] = c; tokenType = DOUBLE_TOKEN_TYPE; if (length == 2) { offset--; bufferIndex--; preIsTokened = true; break; } } } } else if (length > 0) { if (preIsTokened == true) { length = 0; preIsTokened = false; } else { break; } } } } if (length > 0) { termAtt.setTermBuffer(buffer, 0, length); offsetAtt.setOffset(correctOffset(start), correctOffset(start+length)); typeAtt.setType(TOKEN_TYPE_NAMES[tokenType]); return true; } else if (dataLen == -1) { return false; } } } |
5、SentenceTokenizer
其是按照如下的标点来拆分句子:"。,!?;,!?;"
让我们来看下面的例子:
String s = "据纽约时报周三报道称,苹果已经超过微软成为美国最有价值的 科技公司。这是一个不容忽视的转折点。"; StringReader sr = new StringReader(s); SentenceTokenizer tokenizer = new SentenceTokenizer(sr); boolean hasnext = tokenizer.incrementToken(); while(hasnext){ TermAttribute ta = tokenizer.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = tokenizer.incrementToken(); } |
结果为: 据纽约时报周三报道称, |
其incrementToken函数如下:
public boolean incrementToken() throws IOException { clearAttributes(); buffer.setLength(0); int ci; char ch, pch; boolean atBegin = true; tokenStart = tokenEnd; ci = input.read(); ch = (char) ci; while (true) { if (ci == -1) { break; } else if (PUNCTION.indexOf(ch) != -1) { //出现标点符号,当前句子结束,返回当前Token buffer.append(ch); tokenEnd++; break; } else if (atBegin && Utility.SPACES.indexOf(ch) != -1) { tokenStart++; tokenEnd++; ci = input.read(); ch = (char) ci; } else { buffer.append(ch); atBegin = false; tokenEnd++; pch = ch; ci = input.read(); ch = (char) ci; //当连续出现两个空格,或者\r\n的时候,则当前句子结束,返回当前Token if (Utility.SPACES.indexOf(ch) != -1 && Utility.SPACES.indexOf(pch) != -1) { tokenEnd++; break; } } } if (buffer.length() == 0) return false; else { termAtt.setTermBuffer(buffer.toString()); offsetAtt.setOffset(correctOffset(tokenStart), correctOffset(tokenEnd)); typeAtt.setType("sentence"); return true; } } |