2021SC@SDUSC
今天继续对Lucene中的Analysis进行分析
阅读的DotLucene版本是1.9.RC1
TokenFilter
6.ReverseStringFilter
public boolean incrementToken() throws IOException { if (input.incrementToken()) { int len = termAtt.termLength(); if (marker != NOMARKER) { len++; termAtt.resizeTermBuffer(len); termAtt.termBuffer()[len - 1] = marker; } //将token反转 reverse( termAtt.termBuffer(), len ); termAtt.setTermLength(len); return true; } else { return false; } } |
public static void reverse( char[] buffer, int start, int len ){ if( len <= 1 ) return; int num = len>>1; for( int i = start; i < ( start + num ); i++ ){ char c = buffer[i]; buffer[i] = buffer[start * 2 + len - i - 1]; buffer[start * 2 + len - i - 1] = c; } } |
举例:
String s = "Tokenization is the process of breaking a stream of text up into meaningful elements called tokens."; StringReader sr = new StringReader(s); LowerCaseTokenizer lt = new LowerCaseTokenizer(sr); ReverseStringFilter filter = new ReverseStringFilter(lt); boolean hasnext = filter.incrementToken(); while(hasnext){ TermAttribute ta = filter.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = filter.incrementToken(); } |
结果为: noitazinekot |
7.SnowballFilter
其包含成员变量SnowballProgram stemmer,其是一个抽象类,其子类有EnglishStemmer和PorterStemmer等。
public final boolean incrementToken() throws IOException { if (input.incrementToken()) { String originalTerm = termAtt.term(); stemmer.setCurrent(originalTerm); stemmer.stem(); String finalTerm = stemmer.getCurrent(); if (!originalTerm.equals(finalTerm)) termAtt.setTermBuffer(finalTerm); return true; } else { return false; } } |
举例:
String s = "Tokenization is the process of breaking a stream of text up into meaningful elements called tokens."; StringReader sr = new StringReader(s); LowerCaseTokenizer lt = new LowerCaseTokenizer(sr); SnowballFilter filter = new SnowballFilter(lt, new EnglishStemmer()); boolean hasnext = filter.incrementToken(); while(hasnext){ TermAttribute ta = filter.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = filter.incrementToken(); } |
结果如下: token |
8.TeeSinkTokenFilter
TeeSinkTokenFilter可以使得已经分好词的Token全部或者部分的被保存下来,用于生成另一个TokenStream可以保存在其他的域中。
我们可用如下的语句生成一个TeeSinkTokenFilter:
TeeSinkTokenFilter source = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader)); |
然后使用函数newSinkTokenStream()或者newSinkTokenStream(SinkFilter filter)生成一个SinkTokenStream:
TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream(); |
其中在newSinkTokenStream(SinkFilter filter)函数中,将新生成的SinkTokenStream保存在TeeSinkTokenFilter的成员变量sinks中。
在TeeSinkTokenFilter的incrementToken函数中:
public boolean incrementToken() throws IOException { if (input.incrementToken()) { //对于每一个Token,依次遍历成员变量sinks AttributeSource.State state = null; for (WeakReference<SinkTokenStream> ref : sinks) { //对于每一个SinkTokenStream,首先调用函数accept看是否接受,如果接受则将此Token也加入此SinkTokenStream。 final SinkTokenStream sink = ref.get(); if (sink != null) { if (sink.accept(this)) { if (state == null) { state = this.captureState(); } sink.addState(state); } } } return true; } return false; } |
SinkTokenStream.accept调用SinkFilter.accept,对于默认的ACCEPT_ALL_FILTER则接受所有的Token:
private static final SinkFilter ACCEPT_ALL_FILTER = new SinkFilter() { @Override public boolean accept(AttributeSource source) { return true; } }; |
这样SinkTokenStream就能够保存下所有WhitespaceTokenizer分好的Token。
当我们使用比较复杂的分成系统的时候,分词一篇文章往往需要耗费比较长的时间,当分好的词需要再次使用的时候,再分一次词实在太浪费了,于是可以用上述的例子,将分好的词保存在一个TokenStream里面就可以了。
如下面的例子:
String s = "this is a book"; StringReader reader = new StringReader(s); TeeSinkTokenFilter source = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader)); TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream(); boolean hasnext = source.incrementToken(); while(hasnext){ TermAttribute ta = source.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = source.incrementToken(); } System.out.println("---------------------------------------------"); hasnext = sink.incrementToken(); while(hasnext){ TermAttribute ta = sink.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = sink.incrementToken(); } |
结果为: this |
当然有时候我们想在分好词的一系列Token中,抽取我们想要的一些实体,保存下来。
如下面的例子:
String s = "Japan will always balance its national interests between China and America."; StringReader reader = new StringReader(s); TeeSinkTokenFilter source = new TeeSinkTokenFilter(new LowerCaseTokenizer(reader)); //一个集合,保存所有的国家名称 final HashSet<String> countryset = new HashSet<String>(); countryset.add("japan"); countryset.add("china"); countryset.add("america"); countryset.add("korea"); SinkFilter countryfilter = new SinkFilter() { @Override public boolean accept(AttributeSource source) { TermAttribute ta = source.getAttribute(TermAttribute.class); //如果在国家名称列表中,则保留 if(countryset.contains(ta.term())){ return true; } return false; } }; TeeSinkTokenFilter.SinkTokenStream sink = source.newSinkTokenStream(countryfilter); //由LowerCaseTokenizer对语句进行分词,并把其中的国家名称保存在SinkTokenStream中 boolean hasnext = source.incrementToken(); while(hasnext){ TermAttribute ta = source.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = source.incrementToken(); } System.out.println("---------------------------------------------"); hasnext = sink.incrementToken(); while(hasnext){ TermAttribute ta = sink.getAttribute(TermAttribute.class); System.out.println(ta.term()); hasnext = sink.incrementToken(); } } |
结果为: japan |