8.解析Lucene-StandardAnalyzer类之---心脏(Tokenizer)。

    Analyzer包含两个核心组件,Tokenizer以及TokenFilter。两者的区别在于,前者在字符级别处理流,而后者则在词语级别处理流。Tokenizer是Analyzer的第一步,其构造函数接收一个Reader作为参数,而TokenFilter则是一个类似的拦截器,其参数可以是TokenStream、Tokenizer。先看类图:

         其实可以看到这个类图中,attributeSource和TokenStream与上一篇博客是一样的。不同部分在于Tokenizer之后。

package org.apache.lucene.analysis;


import org.apache.lucene.util.AttributeSource;

import java.io.Reader;
import java.io.IOException;
public abstract class Tokenizer extends TokenStream {  
  /** The text source for this Tokenizer. */
  protected Reader input = ILLEGAL_STATE_READER;
  
  /** Pending reader: not actually assigned to input until reset() */
  private Reader inputPending = ILLEGAL_STATE_READER;

  /** Construct a token stream processing the given input. */
  protected Tokenizer(Reader input) {
    if (input == null) {
      throw new NullPointerException("input must not be null");
    }
    this.inputPending = input;
  }
  
  /** Construct a token stream processing the given input using the given AttributeFactory. */
  protected Tokenizer(AttributeFactory factory, Reader input) {
    super(factory);
    if (input == null) {
      throw new NullPointerException("input must not be null");
    }
    this.inputPending = input;
  }

  @Override
  public void close() throws IOException {
    input.close();
    // LUCENE-2387: don't hold onto Reader after close, so
    // GC can reclaim
    inputPending = input = ILLEGAL_STATE_READER;
  }
  
  protected final int correctOffset(int currentOff) {
    return (input instanceof CharFilter) ? ((CharFilter) input).correctOffset(currentOff) : currentOff;
  }

  public final void setReader(Reader input) throws IOException {
    if (input == null) {
      throw new NullPointerException("input must not be null");
    } else if (this.input != ILLEGAL_STATE_READER) {
      throw new IllegalStateException("TokenStream contract violation: close() call missing");
    }
    this.inputPending = input;
    assert setReaderTestPoint();
  }
  
  @Override
  public void reset() throws IOException {
    super.reset();
    input = inputPending;
    inputPending = ILLEGAL_STATE_READER;
  }

  // only used by assert, for testing
  boolean setReaderTestPoint() {
    return true;
  }
  
  private static final Reader ILLEGAL_STATE_READER = new Reader() {
    @Override
    public int read(char[] cbuf, int off, int len) {
      throw new IllegalStateException("TokenStream contract violation: reset()/close() call missing, " +
          "reset() called multiple times, or subclass does not call super.reset(). " +
          "Please see Javadocs of TokenStream class for more information about the correct consuming workflow.");
    }

    @Override
    public void close() {} 
  };
}

代码还是很容易理解的。下面我们查看charTokenizer:

public abstract class CharTokenizer extends Tokenizer {
  
  public CharTokenizer(Version matchVersion, Reader input) {
    super(input);
    charUtils = CharacterUtils.getInstance(matchVersion);
  }
  
  public CharTokenizer(Version matchVersion, AttributeFactory factory,
      Reader input) {
    super(factory, input);
    charUtils = CharacterUtils.getInstance(matchVersion);
  }
  
  private int offset = 0, bufferIndex = 0, dataLen = 0, finalOffset = 0;
  private static final int MAX_WORD_LEN = 255;
  private static final int IO_BUFFER_SIZE = 4096;
  
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
  
  private final CharacterUtils charUtils;
  private final CharacterBuffer ioBuffer = CharacterUtils.newCharacterBuffer(IO_BUFFER_SIZE);
  
  protected abstract boolean isTokenChar(int c);

 //默认什么都不做
  protected int normalize(int c) {
    return c;
  }

  @Override
  /*
   * 这是最重要的一个类。首先他会读取内容。长度为4096
   * 再他读取到最后的时候,会有一个标记为,这个标记位标记它最后一位是不是一个Unicode
   * 编码的高位编码,如果是的话,那么这个标记置1,这样下次再进行读入buffer时候,offset
   * 应该设置为1。本质上这是以空格为分割点的分词方式。
   */
  public final boolean incrementToken() throws IOException {
    clearAttributes();
    int length = 0;
    int start = -1; // this variable is always initialized
    int end = -1;
    char[] buffer = termAtt.buffer();
    while (true) {
      if (bufferIndex >= dataLen) {
        offset += dataLen;
        charUtils.fill(ioBuffer, input); // read supplementary char aware with CharacterUtils
        if (ioBuffer.getLength() == 0) {
          dataLen = 0; // so next offset += dataLen won't decrement offset
          if (length > 0) {
            break;
          } else {
            finalOffset = correctOffset(offset);
            return false;
          }
        }
        dataLen = ioBuffer.getLength();
        bufferIndex = 0;
      }
      // use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
      final int c = charUtils.codePointAt(ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength());
      final int charCount = Character.charCount(c);
      bufferIndex += charCount;

      if (isTokenChar(c)) {               // if it's a token char
        if (length == 0) {                // start of token
          assert start == -1;
          start = offset + bufferIndex - charCount;
          end = start;
        } else if (length >= buffer.length-1) { // check if a supplementary could run out of bounds
          buffer = termAtt.resizeBuffer(2+length); // make sure a supplementary fits in the buffer
        }
        end += charCount;
        length += Character.toChars(normalize(c), buffer, length); // buffer it, normalized
        if (length >= MAX_WORD_LEN) // buffer overflow! make sure to check for >= surrogate pair could break == test
          break;
      } else if (length > 0)             // at non-Letter w/ chars
        break;                           // return 'em
    }

    termAtt.setLength(length);
    assert start != -1;
    offsetAtt.setOffset(correctOffset(start), finalOffset = correctOffset(end));
    return true;
    
  }
  
  @Override
  public final void end() throws IOException {
    super.end();
    // set final offset
    offsetAtt.setOffset(finalOffset, finalOffset);
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    bufferIndex = 0;
    offset = 0;
    dataLen = 0;
    finalOffset = 0;
    ioBuffer.reset(); // make sure to reset the IO buffer!!
  }
}

CharTokenizer  一个简单的,基于字符(character)的tokenizers。
 

 isTokenChar(int c);//判断是否应该加入token
 protected int normalize(int c) //对每一个即将加入到token的字符进行处理,默认不进行任何操作,直接返回true

再看 LetterTokenizer
一个将文本在非字母的地方进行拆分的tokenizer,对于亚洲语系来说并不适合,因为其
大部分单词并不是以空格划分的。其中主要方式是。
 

  protected boolean isTokenChar(int c) {
    return Character.isLetter(c);//收集符合要求的字符
  }

最后看LowerCaseTokenizer
一个将文本在非字母的地方拆分并转换成小写的Tokenizer,其作用类似于LetterTokenizer与LowerCaseFilter的组合。
 


  protected int normalize(int c) {
    return Character.toLowerCase(c);//字母转换成小写
  }

分析了这么久,分析了一个不适合亚洲语言的。那么再看一个适合的。StandardTokenizer,这块涉及到Unicode Text Segmentation。算法没看明白==。留一个扣儿,将来弄明白。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值