Lucene代码分析10

2021SC@SDUSC

今天继续对Lucene中的Analysis进行分析

阅读的DotLucene版本是1.9.RC1

1.几个具体的TokenStream

在索引的时候,添加域的时候,可以指定Analyzer,使其生成TokenStream,也可以直接指定TokenStream:

public Field(String name, TokenStream tokenStream);

下面介绍两个单独使用的TokenStream

1、NumericTokenStream

上一节介绍NumericRangeQuery的时候,在生成NumericField的时候,其会使用NumericTokenStream,其incrementToken如下:

public boolean incrementToken() {

  if (valSize == 0)

    throw new IllegalStateException("call set???Value() before usage");

  if (shift >= valSize)

    return false;

  clearAttributes();

  //虽然NumericTokenStream欲保存数字,然而Lucene的Token只能保存字符串,因而要将数字编码为字符串,然后存入索引。

  final char[] buffer;

  switch (valSize) {

    //首先分配TermBuffer,然后将数字编码为字符串

    case 64:

      buffer = termAtt.resizeTermBuffer(NumericUtils.BUF_SIZE_LONG);

      termAtt.setTermLength(NumericUtils.longToPrefixCoded(value, shift, buffer));

      break;

    case 32:

      buffer = termAtt.resizeTermBuffer(NumericUtils.BUF_SIZE_INT);

      termAtt.setTermLength(NumericUtils.intToPrefixCoded((int) value, shift, buffer));

      break;

    default:

      throw new IllegalArgumentException("valSize must be 32 or 64");

  }

  typeAtt.setType((shift == 0) ? TOKEN_TYPE_FULL_PREC : TOKEN_TYPE_LOWER_PREC);

  posIncrAtt.setPositionIncrement((shift == 0) ? 1 : 0);

  shift += precisionStep;

  return true;

}

public static int intToPrefixCoded(final int val, final int shift, final char[] buffer) {

  if (shift>31 || shift<0)

    throw new IllegalArgumentException("Illegal shift value, must be 0..31");

  int nChars = (31-shift)/7 + 1, len = nChars+1;

  buffer[0] = (char)(SHIFT_START_INT + shift);

  int sortableBits = val ^ 0x80000000;

  sortableBits >>>= shift;

  while (nChars>=1) {

    //int按照每七位组成一个utf-8的编码,并且字符串大小比较的顺序同int大小比较的顺序完全相同。

    buffer[nChars--] = (char)(sortableBits & 0x7f);

    sortableBits >>>= 7;

  }

  return len;

}

2、SingleTokenTokenStream

SingleTokenTokenStream顾名思义就是此TokenStream仅仅包含一个Token,多用于保存一篇文档仅有一个的信息,如id,如time等,这些信息往往被保存在一个特殊的Token(如ID:ID, TIME:TIME)的倒排表的payload中的,这样可以使用跳表来增加访问速度。

所以SingleTokenTokenStream返回的Token则不是id或者time本身,而是特殊的Token,"ID:ID", "TIME:TIME",而是将id的值或者time的值放入payload中。

//索引的时候

int id = 0; //用户自己的文档号

String tokenstring = "ID";

byte[] value = idToBytes(); //将id装换为byte数组

Token token = new Token(tokenstring, 0, tokenstring.length);

token.setPayload(new Payload(value));

SingleTokenTokenStream tokenstream = new SingleTokenTokenStream(token);

Document doc = new Document();

doc.add(new Field("ID", tokenstream));

……

//当得到Lucene的文档号docid,并不想构造Document对象就得到用户的文档号时

TermPositions tp = reader.termPositions("ID:ID");

boolean ret = tp.skipTo(docid);

tp.nextPosition();

int payloadlength = tp.getPayloadLength();

byte[] payloadBuffer = new byte[payloadlength];

tp.getPayload(payloadBuffer, 0);

int id = bytesToID(); //将payloadBuffer转换为用户id

2、Tokenizer也是一种TokenStream

public abstract class Tokenizer extends TokenStream {

  protected Reader input;

  protected Tokenizer(Reader input) {

    this.input = CharReader.get(input);

  }

  public void reset(Reader input) throws IOException {

    this.input = input;

  }

}

以下重要的Tokenizer如下,我们将一一解析:

  • CharTokenizer
    • LetterTokenizer
      • LowerCaseTokenizer
    • WhitespaceTokenizer
  • ChineseTokenizer
  • CJKTokenizer
  • EdgeNGramTokenizer
  • KeywordTokenizer
  • NGramTokenizer
  • SentenceTokenizer
  • StandardTokenizer

1、CharTokenizer

CharTokenizer是一个抽象类,用于对字符串进行分词。

在构造函数中,生成了TermAttribute和OffsetAttribute两个属性,说明分词后除了返回分词后的字符外,还要返回offset。

offsetAtt = addAttribute(OffsetAttribute.class);

termAtt = addAttribute(TermAttribute.class);

其incrementToken函数如下:

public final boolean incrementToken() throws IOException {

  clearAttributes();

  int length = 0;

  int start = bufferIndex;

  char[] buffer = termAtt.termBuffer();

  while (true) {

    //不断读取reader中的字符到buffer中

    if (bufferIndex >= dataLen) {

      offset += dataLen;

      dataLen = input.read(ioBuffer);

      if (dataLen == -1) {

        dataLen = 0;

        if (length > 0)

          break;

        else

          return false;

      }

      bufferIndex = 0;

    }

    //然后逐一遍历buffer中的字符

    final char c = ioBuffer[bufferIndex++];

    //如果是一个token字符,则normalize后接着取下一个字符,否则当前token结束。 

    if (isTokenChar(c)) {

      if (length == 0)

        start = offset + bufferIndex - 1;

      else if (length == buffer.length)

        buffer = termAtt.resizeTermBuffer(1+length);

      buffer[length++] = normalize(c);

      if (length == MAX_WORD_LEN)

        break;

    } else if (length > 0)

      break;

  }

  termAtt.setTermLength(length);

  offsetAtt.setOffset(correctOffset(start), correctOffset(start+length));

  return true;

}

CharTokenizer是一个抽象类,其isTokenChar函数和normalize函数由子类实现。

其子类WhitespaceTokenizer实现了isTokenChar函数:

//当遇到空格的时候,当前token结束

protected boolean isTokenChar(char c) {

  return !Character.isWhitespace(c);

}

其子类LetterTokenizer如下实现isTokenChar函数:

protected boolean isTokenChar(char c) {

  return Character.isLetter(c);

}

LetterTokenizer的子类LowerCaseTokenizer实现了normalize函数,将字符串转换为小写:

protected char normalize(char c) {

  return Character.toLowerCase(c);

}

2、ChineseTokenizer

其在初始化的时候,添加TermAttribute和OffsetAttribute。

其incrementToken实现如下:

public boolean incrementToken() throws IOException {

    clearAttributes();

    length = 0;

    start = offset;

    while (true) {

        final char c;

        offset++;

        if (bufferIndex >= dataLen) {

            dataLen = input.read(ioBuffer);

            bufferIndex = 0;

        }

        if (dataLen == -1) return flush();

        else

            c = ioBuffer[bufferIndex++];

        switch(Character.getType(c)) {

        //如果是英文下小写字母或数字的时候,则属于同一个Token,push到buffer中 

        case Character.DECIMAL_DIGIT_NUMBER:

        case Character.LOWERCASE_LETTER:

        case Character.UPPERCASE_LETTER:

            push(c);

            if (length == MAX_WORD_LEN) return flush();

            break;

        //中文属于OTHER_LETTER,当出现中文字符的时候,则上一个Token结束,并将当前字符push到buffer中

        case Character.OTHER_LETTER:

            if (length>0) {

                bufferIndex--;

                offset--;

                return flush();

            }

            push(c);

            return flush();

        default:

            if (length>0) return flush();

            break;

        }

    }

}

3、KeywordTokenizer

KeywordTokenizer是将整个字符作为一个Token返回的。

其incrementToken函数如下:

public final boolean incrementToken() throws IOException {

  if (!done) {

    clearAttributes();

    done = true;

    int upto = 0;

    char[] buffer = termAtt.termBuffer();

    //将字符串全部读入buffer,然后返回。

    while (true) {

      final int length = input.read(buffer, upto, buffer.length-upto);

      if (length == -1) break;

      upto += length;

      if (upto == buffer.length)

        buffer = termAtt.resizeTermBuffer(1+buffer.length);

    }

    termAtt.setTermLength(upto);

    finalOffset = correctOffset(upto);

    offsetAtt.setOffset(correctOffset(0), finalOffset);

    return true;

  }

  return false;

}

4、CJKTokenizer

其incrementToken函数如下:

public boolean incrementToken() throws IOException {

    clearAttributes();

    while(true) {

      int length = 0;

      int start = offset;

      while (true) {

        //得到当前的字符,及其所属的Unicode块

        char c;

        Character.UnicodeBlock ub;

        offset++;

        if (bufferIndex >= dataLen) {

            dataLen = input.read(ioBuffer);

            bufferIndex = 0;

        }

        if (dataLen == -1) {

            if (length > 0) {

                if (preIsTokened == true) {

                    length = 0;

                    preIsTokened = false;

                }

                break;

            } else {

                return false;

            }

        } else {

            c = ioBuffer[bufferIndex++];

            ub = Character.UnicodeBlock.of(c);

        }

        //如果当前字符输入ASCII码

        if ((ub == Character.UnicodeBlock.BASIC_LATIN) || (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)) {

            if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {

              int i = (int) c;

              if (i >= 65281 && i <= 65374) {

                //将半型及全型形式Unicode转变为普通的ASCII码

                i = i - 65248;

                c = (char) i;

              }

            }

            //如果当前字符是字符或者"_" "+" "#"

            if (Character.isLetterOrDigit(c) || ((c == '_') || (c == '+') || (c == '#'))) {

                if (length == 0) {

                    start = offset - 1;

                } else if (tokenType == DOUBLE_TOKEN_TYPE) {

                    offset--;

                    bufferIndex--;

                    if (preIsTokened == true) {

                        length = 0;

                        preIsTokened = false;

                        break;

                    } else {

                        break;

                    }

                }

                //将当前字符放入buffer

                buffer[length++] = Character.toLowerCase(c);

                tokenType = SINGLE_TOKEN_TYPE;

                if (length == MAX_WORD_LEN) {

                    break;

                }

            } else if (length > 0) {

                if (preIsTokened == true) {

                    length = 0;

                    preIsTokened = false;

                } else {

                    break;

                }

            }

        } else {

            //如果非ASCII字符

            if (Character.isLetter(c)) {

                if (length == 0) {

                    start = offset - 1;

                    buffer[length++] = c;

                    tokenType = DOUBLE_TOKEN_TYPE;

                } else {

                  if (tokenType == SINGLE_TOKEN_TYPE) {

                        offset--;

                        bufferIndex--;

                        break;

                    } else {

                        //非ASCII码字符,两个字符作为一个Token

                       //(如"中华人民共和国"分词为"中华","华人","人民","民共","共和","和国")

                        buffer[length++] = c;

                        tokenType = DOUBLE_TOKEN_TYPE;

                        if (length == 2) {

                            offset--;

                            bufferIndex--;

                            preIsTokened = true;

                            break;

                        }

                    }

                }

            } else if (length > 0) {

                if (preIsTokened == true) {

                    length = 0;

                    preIsTokened = false;

                } else {

                    break;

                }

            }

        }

    }

    if (length > 0) {

      termAtt.setTermBuffer(buffer, 0, length);

      offsetAtt.setOffset(correctOffset(start), correctOffset(start+length));

      typeAtt.setType(TOKEN_TYPE_NAMES[tokenType]);

      return true;

    } else if (dataLen == -1) {

      return false;

    }

  }

}

5、SentenceTokenizer

其是按照如下的标点来拆分句子:"。,!?;,!?;"

让我们来看下面的例子:

String s = "据纽约时报周三报道称,苹果已经超过微软成为美国最有价值的  科技公司。这是一个不容忽视的转折点。";

StringReader sr = new StringReader(s);

SentenceTokenizer tokenizer = new SentenceTokenizer(sr);

boolean hasnext = tokenizer.incrementToken();

while(hasnext){

  TermAttribute ta = tokenizer.getAttribute(TermAttribute.class);

  System.out.println(ta.term());

  hasnext = tokenizer.incrementToken();

}

结果为:

据纽约时报周三报道称,
苹果已经超过微软成为美国最有价值的
科技公司。
这是一个不容忽视的转折点。

其incrementToken函数如下:

public boolean incrementToken() throws IOException {

  clearAttributes();

  buffer.setLength(0);

  int ci;

  char ch, pch;

  boolean atBegin = true;

  tokenStart = tokenEnd;

  ci = input.read();

  ch = (char) ci;

  while (true) {

    if (ci == -1) {

      break;

    } else if (PUNCTION.indexOf(ch) != -1) {

      //出现标点符号,当前句子结束,返回当前Token

      buffer.append(ch);

      tokenEnd++;

      break;

    } else if (atBegin && Utility.SPACES.indexOf(ch) != -1) {

      tokenStart++;

      tokenEnd++;

      ci = input.read();

      ch = (char) ci;

    } else {

      buffer.append(ch);

      atBegin = false;

      tokenEnd++;

      pch = ch;

      ci = input.read();

      ch = (char) ci;

      //当连续出现两个空格,或者\r\n的时候,则当前句子结束,返回当前Token

      if (Utility.SPACES.indexOf(ch) != -1

          && Utility.SPACES.indexOf(pch) != -1) {

        tokenEnd++;

        break;

      }

    }

  }

  if (buffer.length() == 0)

    return false;

  else {

    termAtt.setTermBuffer(buffer.toString());

    offsetAtt.setOffset(correctOffset(tokenStart), correctOffset(tokenEnd));

    typeAtt.setType("sentence");

    return true;

  }

}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值