Lucene源码分析-- Analyzer

原文出处:

http://lqgao.spaces.live.com/blog/cns!3BB36966ED98D3E5!437.entry?_c11_blogpart_blogpart=blogview&_c=blogpart#permalink



本文主要分析一下 Lucene输入部分——Analyzer(分析器)。为什么要有Analyzer部分呢?打个比方,人体在消化食物的时候,是不是都要把食物分解掉?食物在肠道里面,被分解成葡萄糖、氨基酸、脂肪等等。变成小块以后,才容易被吸收并加以利用。Lucene也有类似的过程:把文本分解成更小的单元,有词、标点符号、分割符号,甚至还有网站名等等。Analyzer就好比是人体的肠道,它的职责就是把输入的文本切成小的单元。

先看一段代码吧:
#0001<wbr> public static void TestAnalyzer()<br>#0002<wbr> {<br>#0003<wbr><wbr><wbr><wbr><wbr> Analyzer []analyzers = new Analyzer[4];<br>#0004<wbr><wbr><wbr><wbr><wbr> analyzers[0] = new WhitespaceAnalyzer();<br>#0005<wbr><wbr><wbr><wbr><wbr> analyzers[1] = new SimpleAnalyzer();<br>#0006<wbr><wbr><wbr><wbr><wbr> analyzers[2] = new StopAnalyzer();<br>#0007<wbr><wbr><wbr><wbr><wbr> analyzers[3] = new StandardAnalyzer();<br>#0008<wbr><wbr><wbr><wbr><wbr> String text = "This is a test document. For more info, please visit Victor's Blog: <a href="http://lqgao.spaces.msn.com/"><u><font color="#11779f">http://lqgao.spaces.msn.com</font></u></a>. ";<br>#0009<wbr><wbr><wbr><wbr><wbr> for(int i=0; i&lt;analyzers.Length; i++)<br>#0010<wbr><wbr><wbr><wbr><wbr> {<br>#0011<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr> DumpAnalyzer(analyzers[i], new StringReader(text));<br>#0012<wbr><wbr><wbr><wbr><wbr> }<br>#0013<wbr> }<br>#0014<wbr><br>#0015<wbr> public static void DumpAnalyzer(Analyzer analyzer, TextReader reader)<br>#0016<wbr> {<br>#0017<wbr><wbr><wbr><wbr><wbr> TokenStream stream = analyzer.TokenStream(reader);<br>#0018<wbr><wbr><wbr><wbr><wbr> Token token;<br>#0019<wbr><wbr><wbr><wbr><wbr> System.Console.WriteLine(analyzer + " :");<br>#0020<wbr><wbr><wbr><wbr><wbr> while ((token = stream.Next()) != null)<br>#0021<wbr><wbr><wbr><wbr><wbr> {<br>#0022<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr> System.Console.Write("[" + token.TermText() + "]");<br>#0023<wbr><wbr><wbr><wbr><wbr> }<br>#0024<wbr><wbr><wbr><wbr><wbr> System.Console.WriteLine();<br>#0025<wbr><wbr><wbr><wbr><wbr> System.Console.WriteLine();<br>#0026<wbr> }<br>运行结果如下:<br>Lucene.Net.Analysis.WhitespaceAnalyzer :<br>[This][is][a][test][document.][For][more][info,][please][visit][Victor's][Blog:]<br>[http://lqgao.spaces.msn.com.]</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

Lucene.Net.Analysis.SimpleAnalyzer :
[this][is][a][test][document][for][more][info][please][visit][victor][s][blog][h
ttp][lqgao][spaces][msn][com]

Lucene.Net.Analysis.StopAnalyzer :
[test][document][more][info][please][visit][victor][blog][http][lqgao][spaces][m
sn][com]

Lucene.Net.Analysis.Standard.StandardAnalyzer :
[test][document][more][info][please][visit][victor][blog][http][lqgaospacesmsnco
m]
好,让咱们来分析一下。Lucene中默认提供4个Analyzer:SimpleAnalyzer, StandardAnalyzer, StopAnalyzer, WhitespaceAnalyzer。至于这4个有什么区别,听我慢慢道来。
WhitespaceAnalyzer似乎什么都不做,就是按照white space (空格符号)把文本分开——这样做最省力,最简单。
SimpleAnalyzer 则比WhitespaceAnalyzer进步一些,至少不管大写还是小写的字母,统统变成小写形式。这样做的好处也很明显,不管输入是This tHis还是 THIS thIS,最后都统一为this,便于匹配。除了统一大小写外,SimpleAnalyzer还把标点符号处理了,或者说SimpleAnalyzer是按照标点符号分割单词的。比如`documents.’在SimpleAnalyzer的结果中变为`document’。
StopAnalyzer 看起来和SimpleAnalyzer非常相似,只不过,结果中有一些词被去掉了,比如‘this’, ‘is’, ‘a’, ‘for’等——这些大量出现但没有实际意义的词通常被称为stop word(停用词),并被去掉,不加入索引中。因为这样的词数量很大,但并不能很好的区分文档的内容。去掉stop word能减少索引的规模。
StandardAnalyzer做得要复杂一些了。像”Victor’s”这样的词,被处理为’victor’,并没有”’s”,而且网址也被处理了。稍后我们分析StandardAnalyzer的功能。这几个Analyzer的继承关系如图 1所示。
<wbr><br>图 1几种analyzer的类层次图<br>现在回头再看看Analyzer们是怎么工作的(#0015~#0026)。其实Analyzer是一个工厂模式(Factory Pattern),见#0017。使用时需要其生成一个TokenStream的对象。TokenStream,顾名思义,表示token流,即一个 token序列。每个token都是Token类型的。#0020~#0023展现TokenStream的调用方式。<br>接下来让我们一步一步地展开Analyzers的细节。既然Token是TokenStream组成的元素,让我们先来看看它的“庐山真面目”。<br>#0001<wbr> public sealed class Token<br>#0002<wbr> {<br>#0003<wbr><wbr><wbr><wbr><wbr> internal System.String termText; // the text of the term<br>#0004<wbr><wbr><wbr><wbr><wbr> internal int startOffset; // start in source text<br>#0005<wbr><wbr><wbr><wbr><wbr> internal int endOffset; // end in source text<br>#0006<wbr><wbr><wbr><wbr><wbr> internal System.String type_Renamed_Field = "word"; // lexical type<br>#0007<wbr><br>#0008<wbr><wbr><wbr><wbr><wbr> private int positionIncrement = 1;<br>#0009<wbr><br>#0010<wbr><wbr><wbr><wbr><wbr> public Token(System.String text, int start, int end)<br>#0011<wbr><wbr><wbr><wbr><wbr> public Token(System.String text, int start, int end, System.String typ)<br>#0012<wbr><br>#0013<wbr><wbr><wbr><wbr><wbr> public void<wbr> SetPositionIncrement(int positionIncrement)<br>#0014<wbr><wbr><wbr><wbr><wbr> public int GetPositionIncrement()<br>#0015<wbr><wbr><wbr><wbr><wbr> public System.String TermText()<br>#0016<wbr><wbr><wbr><wbr><wbr> public int StartOffset()<br>#0017<wbr><wbr><wbr><wbr><wbr> public int EndOffset()<br>#0018<wbr><wbr><wbr><wbr><wbr> public System.String Type()<br>#0019<wbr> }<br>可以看出,Token存储了term的字符串(#0003),并记录下起始和终止位置(#0004~#0005),此外还有一个类型信息(#0006)。DumpAnalyzer中调用了TermText()获取字符串信息。<br>然后看看TokenStream:<br>#0001<wbr> public abstract class TokenStream<br>#0002<wbr> {<br>#0003<wbr><wbr><wbr><wbr><wbr> /// &lt;summary&gt;Returns the next token in the stream, or null at EOS. &lt;/summary&gt;<br>#0004<wbr><wbr><wbr><wbr><wbr> public abstract Token Next();<br>#0005<wbr><br>#0006<wbr><wbr><wbr><wbr><wbr> /// &lt;summary&gt;Releases resources associated with this stream. &lt;/summary&gt;<br>#0007<wbr><wbr><wbr><wbr><wbr> public virtual void<wbr> Close()<br>#0008<wbr><wbr><wbr><wbr><wbr> {}<br>#0009<wbr> }<br>TokenStream是一个抽象类,接口只有两个:Next()和Close()。Next()返回当前的token,并指向下一个token;没有token则返回null。<br>Analyzer也是一个抽象类。默认的TokenStream() (#0005)就是构造并返回一个TokenStream的对象。<br>#0001<wbr> public abstract class Analyzer<br>#0002<wbr> {<br>#0003<wbr><wbr><wbr><wbr><wbr> public virtual TokenStream TokenStream(System.IO.TextReader reader)<br>#0004<wbr><wbr><wbr><wbr><wbr> {<br>#0005<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr> return TokenStream(null, reader);<br>#0006<wbr><wbr><wbr><wbr><wbr> }<br>#0007<wbr> }<br>再看它的一个子类WhitespaceTokenizer:<br>#0001<wbr> public class WhitespaceTokenizer:CharTokenizer<br>#0002<wbr> {<br>#0003<wbr><wbr><wbr><wbr><wbr> public WhitespaceTokenizer(System.IO.TextReader in_Renamed):base(in_Renamed)<br>#0004<wbr><wbr><wbr><wbr><wbr> {}<br>#0005<wbr><wbr><wbr><wbr><wbr> protected internal override bool IsTokenChar(char c)<br>#0006<wbr><wbr><wbr><wbr><wbr> {<br>#0007<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr> return !System.Char.IsWhiteSpace(c);<br>#0008<wbr><wbr><wbr><wbr><wbr> }<br>#0009<wbr> }<br>#0010<wbr><br>#0021<wbr> public abstract class CharTokenizer : Tokenizer<br>#0022<wbr> {<br>#0023<wbr><wbr><wbr><wbr><wbr> public CharTokenizer(System.IO.TextReader input) : base(input)<br>#0024<wbr><wbr><wbr><wbr><wbr> {}<br>#0025<wbr><br>#0026<wbr><wbr><wbr><wbr><wbr> private int offset = 0, bufferIndex = 0, dataLen = 0;<br>#0027<wbr><wbr><wbr><wbr><wbr> private const int MAX_WORD_LEN = 255;<br>#0028<wbr><wbr><wbr><wbr><wbr> private const int IO_BUFFER_SIZE = 1024;<br>#0029<wbr><wbr><wbr><wbr><wbr> private char[] buffer = new char[MAX_WORD_LEN];<br>#0030<wbr><wbr><wbr><wbr><wbr> private char[] ioBuffer = new char[IO_BUFFER_SIZE];<br>#0031<wbr><br>#0032<wbr><wbr><wbr><wbr><wbr> protected internal abstract bool IsTokenChar(char c);<br>#0033<wbr><wbr><wbr><wbr><wbr> protected internal virtual char Normalize(char c)<br>#0034<wbr><wbr><wbr><wbr><wbr> {<br>#0035<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr> return c;<br>#0036<wbr><wbr><wbr><wbr><wbr> }<br>#0037<wbr><wbr><wbr><wbr><wbr> public override Token Next()<br>#0038<wbr><wbr><wbr><wbr><wbr> {<br>#0039<wbr><wbr><wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值