Lucene代码分析13

2021SC@SDUSC

今天继续对Lucene中的Analysis进行分析

阅读的DotLucene版本是1.9.RC1

TokenFilter

6.ReverseStringFilter

public boolean incrementToken() throws IOException {

  if (input.incrementToken()) {

    int len = termAtt.termLength();

    if (marker != NOMARKER) {

      len++;

      termAtt.resizeTermBuffer(len);

      termAtt.termBuffer()[len - 1] = marker;

    }

    //将token反转

    reverse( termAtt.termBuffer(), len );

    termAtt.setTermLength(len);

    return true;

  } else {

    return false;

  }

}

public static void reverse( char[] buffer, int start, int len ){

  if( len <= 1 ) return;

  int num = len>>1;

  for( int i = start; i < ( start + num ); i++ ){

    char c = buffer[i];

    buffer[i] = buffer[start * 2 + len - i - 1];

    buffer[start * 2 + len - i - 1] = c;

  }

}

举例:

String s = "Tokenization is the process of breaking a stream of text up into meaningful elements called tokens.";

StringReader sr = new StringReader(s);

LowerCaseTokenizer lt = new LowerCaseTokenizer(sr);

ReverseStringFilter filter = new ReverseStringFilter(lt);

boolean hasnext = filter.incrementToken();

while(hasnext){

  TermAttribute ta = filter.getAttribute(TermAttribute.class);

  System.out.println(ta.term());

  hasnext = filter.incrementToken();

}

结果为:

noitazinekot
si
eht
ssecorp
fo
gnikaerb
a
maerts
fo
txet
pu
otni
lufgninaem
stnemele
dellac
snekot

7.SnowballFilter

其包含成员变量SnowballProgram stemmer,其是一个抽象类,其子类有EnglishStemmer和PorterStemmer等。

public final boolean incrementToken() throws IOException {

  if (input.incrementToken()) {

    String originalTerm = termAtt.term();

    stemmer.setCurrent(originalTerm);

    stemmer.stem();

    String finalTerm = stemmer.getCurrent();

    if (!originalTerm.equals(finalTerm))

      termAtt.setTermBuffer(finalTerm);

    return true;

  } else {

    return false;

  }

}

举例:

String s = "Tokenization is the process of breaking a stream of text up into meaningful elements called tokens.";

StringReader sr = new StringReader(s);

LowerCaseTokenizer lt = new LowerCaseTokenizer(sr);

SnowballFilter filter = new SnowballFilter(lt, new EnglishStemmer());

boolean hasnext = filter.incrementToken();

while(hasnext){

  TermAttribute ta = filter.getAttribute(TermAttribute.class);

  System.out.println(ta.term());

  hasnext = filter.incrementToken();

}

结果如下:

token
is
the
process
of
break
a
stream
of
text
up
into
meaning
element
call
token

8.TeeSinkTokenFilter

TeeSinkTokenFilter可以使得已经分好词的Token全部或者部分的被保存下来,用于生成另一个TokenStream可以保存在其他的域中。

我们可用如下的语句生成一个TeeSinkTokenFilter:

TeeSinkTokenFilter source = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader));

然后使用函数newSinkTokenStream()或者newSinkTokenStream(SinkFilter filter)生成一个SinkTokenStream:

TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream();

其中在newSinkTokenStream(SinkFilter filter)函数中,将新生成的SinkTokenStream保存在TeeSinkTokenFilter的成员变量sinks中。

在TeeSinkTokenFilter的incrementToken函数中:

public boolean incrementToken() throws IOException {

  if (input.incrementToken()) {

    //对于每一个Token,依次遍历成员变量sinks

    AttributeSource.State state = null;

    for (WeakReference<SinkTokenStream> ref : sinks) {

      //对于每一个SinkTokenStream,首先调用函数accept看是否接受,如果接受则将此Token也加入此SinkTokenStream。

      final SinkTokenStream sink = ref.get();

      if (sink != null) {

        if (sink.accept(this)) {

          if (state == null) {

            state = this.captureState();

          }

          sink.addState(state);

        }

      }

    }

    return true;

  }

  return false;

}

SinkTokenStream.accept调用SinkFilter.accept,对于默认的ACCEPT_ALL_FILTER则接受所有的Token:

private static final SinkFilter ACCEPT_ALL_FILTER = new SinkFilter() {

  @Override

  public boolean accept(AttributeSource source) {

    return true;

  }

};

这样SinkTokenStream就能够保存下所有WhitespaceTokenizer分好的Token。

当我们使用比较复杂的分成系统的时候,分词一篇文章往往需要耗费比较长的时间,当分好的词需要再次使用的时候,再分一次词实在太浪费了,于是可以用上述的例子,将分好的词保存在一个TokenStream里面就可以了。

如下面的例子:

String s = "this is a book";

StringReader reader = new StringReader(s);

TeeSinkTokenFilter source = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader));

TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream();

boolean hasnext = source.incrementToken();

while(hasnext){

  TermAttribute ta = source.getAttribute(TermAttribute.class);

  System.out.println(ta.term());

  hasnext = source.incrementToken();

}

System.out.println("---------------------------------------------");

hasnext = sink.incrementToken();

while(hasnext){

  TermAttribute ta = sink.getAttribute(TermAttribute.class);

  System.out.println(ta.term());

  hasnext = sink.incrementToken();

}

结果为:

this
is
a
book
---------------------------------------------
this
is
a
book

当然有时候我们想在分好词的一系列Token中,抽取我们想要的一些实体,保存下来。

如下面的例子:

  String s = "Japan will always balance its national interests between China and America.";

  StringReader reader = new StringReader(s);

  TeeSinkTokenFilter source = new TeeSinkTokenFilter(new LowerCaseTokenizer(reader));

  //一个集合,保存所有的国家名称

  final HashSet<String> countryset = new HashSet<String>();

  countryset.add("japan");

  countryset.add("china");

  countryset.add("america");

  countryset.add("korea");

  SinkFilter countryfilter = new SinkFilter() {

    @Override

    public boolean accept(AttributeSource source) {

      TermAttribute ta = source.getAttribute(TermAttribute.class);

      //如果在国家名称列表中,则保留

      if(countryset.contains(ta.term())){

        return true;

      }

      return false;

    }

  };

  TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream(countryfilter);

  //由LowerCaseTokenizer对语句进行分词,并把其中的国家名称保存在SinkTokenStream中

  boolean hasnext = source.incrementToken();

  while(hasnext){

    TermAttribute ta = source.getAttribute(TermAttribute.class);

    System.out.println(ta.term());

    hasnext = source.incrementToken();

  }

  System.out.println("---------------------------------------------");

  hasnext = sink.incrementToken();

  while(hasnext){

    TermAttribute ta = sink.getAttribute(TermAttribute.class);

    System.out.println(ta.term());

    hasnext = sink.incrementToken();

  }

}

结果为:

japan
will
always
balance
its
national
interests
between
china
and
america
---------------------------------------------
japan
china
america

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值