Lucene中的Analyzer

阅读下面文章之前,建议先阅读随风的 DotLucene源码浅读笔记(1) : Lucene.Net.Analysis 了解Lucene的Analyzer

由于lucene中自带的几个Analyzer不能满足业务需求,要自定义Analyzer所以参考lucene中自带的几个Analyzer的实现。
在参考的过程中,发现KeywordAnalyzer可以简化。
1.修改keywordAnalyzer
Analysis\KeywordTokenizer.cs中分词实现
None.gif public   override  Token Next()
ExpandedBlockStart.gifContractedBlock.gif        
dot.gif {
InBlock.gif            
if (!done)
ExpandedSubBlockStart.gifContractedSubBlock.gif            
dot.gif{
InBlock.gif                done 
= true;
InBlock.gif                System.Text.StringBuilder buffer 
= new System.Text.StringBuilder();
InBlock.gif                
int length;
InBlock.gif                
while (true)
ExpandedSubBlockStart.gifContractedSubBlock.gif                
dot.gif{
InBlock.gif                    length 
= input.Read((System.Char[]) this.buffer, 0this.buffer.Length);
InBlock.gif                    
if (length <= 0)
InBlock.gif                        
break;
InBlock.gif                    
InBlock.gif                    buffer.Append(
this.buffer, 0, length);
ExpandedSubBlockEnd.gif                }

InBlock.gif                System.String text 
= buffer.ToString();
InBlock.gif                
return new Token(text, 0, text.Length);
ExpandedSubBlockEnd.gif            }

InBlock.gif            
return null;
ExpandedBlockEnd.gif        }
private const int DEFAULT_BUFFER_SIZE = 256;

在分词的过程中,判断了词的大小,当要使用KeywordAnalyzer作为词的分析器时,只能含有256个字符。而一般的使用过程中很少会把大于256的词用KeywordAnalyzer去分词。因此可以把判断去掉(注意:要分的词不能大于256个字符)下面是简化后的代码:
ExpandedBlockStart.gif ContractedBlock.gif /**/ /// <summary>
InBlock.gif    
///  KeywordAnalyzer的简单实现(确定keywords在 255字符以内)
InBlock.gif    
///  <remark>自定义KeywordAnalyzer</remark>
ExpandedBlockEnd.gif    
/// </summary>

None.gif      public   class  SimpleKeywordAnalyzer : Analyzer
ExpandedBlockStart.gifContractedBlock.gif    
dot.gif {
InBlock.gif        
public override TokenStream TokenStream(String fieldName, System.IO.TextReader reader)
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            
return new CustomCharTokenizer(reader);
ExpandedSubBlockEnd.gif        }

InBlock.gif
InBlock.gif        
public class CustomCharTokenizer : CharTokenizer // CharTokenizer系统自带用来对基于字符的进行简单分词(tokenizer)
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            
public CustomCharTokenizer(System.IO.TextReader reader)
InBlock.gif                : 
base(reader)
ExpandedSubBlockStart.gifContractedSubBlock.gif            
dot.gif{
InBlock.gif
ExpandedSubBlockEnd.gif            }

InBlock.gif
InBlock.gif            
protected internal override bool  IsTokenChar(char c)
ExpandedSubBlockStart.gifContractedSubBlock.gif            
dot.gif{
InBlock.gif                
return true;
ExpandedSubBlockEnd.gif            }

ExpandedSubBlockEnd.gif        }

InBlock.gif
ExpandedBlockEnd.gif    }


2.自定义Analyzer
在项目实践中,需要对IT商品名进行搜索。我们知道IT商品名比较复杂,商品中含很多型号。
如笔记本:BenQ Joybook R23E (103),HP Pavilion dv1617  
用户搜索时基本上不会把型号输完整,很有可能输入r23,dv等关键字进行搜索(限定搜索此必须1个字符以上,否则搜索关键字基本没含义)。Lucene自带的analyzer显然不能满足需求,必需自定义analyzer。
自定义analyzer一般都会
1.定义分(切)词规则(实现Tokenizer)
2.定义词的过滤规则(实现TokenFilter)

分词规则确定:
首先分析taobao等大型电子商务网站对商品搜索结果。taobao基本上无论输入什么词都能搜索出结果,其规则是只要商品名含有输入的字符就显示出来。比如输入“件硬”搜索出的商品只是含有此关键字。基本可以判断出是单字分词,没有进行语义分析。
对于商品名的复杂性和用户输入的不确定性,这样的规则是非常符合商品搜索的。
因此在实现商品搜索的过程中,我也采用了这样的规则。
规则:进行单字分词,字母和数字都作为单个字符处理,其它字符则被过滤。
如BenQ Joybook R23E (103)分词后效果:b e n q j o y b o o k r 2 3 e 1 0 3
实现代码:
ExpandedBlockStart.gif ContractedBlock.gif /**/ /// <summary>
InBlock.gif    
/// Title: ProductAnalyzer
InBlock.gif    
/// Description:
InBlock.gif    
///   Subclass of org.apache.lucene.analysis.Analyzer
InBlock.gif    
///   build from a ProductTokenizer, filtered with ProductFilter.
InBlock.gif    
/// Copyright:   Copyright (c) 2006.07.19
InBlock.gif    
/// @author try
ExpandedBlockEnd.gif    
/// </summary>

None.gif      public   class  ProductAnalyzer : Analyzer 
ExpandedBlockStart.gifContractedBlock.gif    
dot.gif {
InBlock.gif        
public ProductAnalyzer() 
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
ExpandedSubBlockEnd.gif        }

ExpandedSubBlockStart.gifContractedSubBlock.gif        
/**//// <summary>
InBlock.gif        
/// Creates a TokenStream which tokenizes all the text in the provided Reader.
InBlock.gif        
/// </summary>
ExpandedSubBlockEnd.gif        
/// <returns>A TokenStream build from a ProductTokenizer filtered with ProductFilter.</returns>

InBlock.gif        public override sealed TokenStream TokenStream(String fieldName, TextReader reader) 
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            TokenStream result 
= null;
InBlock.gif            result 
= new ProductTokenizer(reader);
InBlock.gif            
return  new ProductFilter(result);
ExpandedSubBlockEnd.gif        }

ExpandedBlockEnd.gif    }

None.gif public   sealed   class  ProductTokenizer : Tokenizer 
ExpandedBlockStart.gifContractedBlock.gif    
dot.gif {
InBlock.gif        
public ProductTokenizer(TextReader _in) 
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            input 
= _in;
ExpandedSubBlockEnd.gif        }

InBlock.gif        
private int offset = 0, bufferIndex = 0, dataLen = 0;
InBlock.gif        
private static int MAX_WORD_LEN = 255;
InBlock.gif        
private static int IO_BUFFER_SIZE = 1024;
InBlock.gif        
private char[] buffer = new char[MAX_WORD_LEN];
InBlock.gif        
private char[] ioBuffer = new char[IO_BUFFER_SIZE];
InBlock.gif        
private int length;
InBlock.gif        
private int start;
InBlock.gif        
private void Push(char c) 
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            
if (length == 0) start = offset-1;   // start of token
InBlock.gif
            buffer[length++= Char.ToLower(c);  // buffer it
ExpandedSubBlockEnd.gif
        }

InBlock.gif        
private Token Flush() 
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            
if (length > 0
ExpandedSubBlockStart.gifContractedSubBlock.gif            
dot.gif{
InBlock.gif                
return new Token(new String(buffer, 0, length), start, start+length);
ExpandedSubBlockEnd.gif            }

InBlock.gif            
else
InBlock.gif                
return null;
ExpandedSubBlockEnd.gif        }

InBlock.gif        
public override Token Next()
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            length 
= 0;
InBlock.gif            start 
= offset;
InBlock.gif            
while (true
ExpandedSubBlockStart.gifContractedSubBlock.gif            
dot.gif{
InBlock.gif                
char c;
InBlock.gif                offset
++;
InBlock.gif                
if (bufferIndex >= dataLen) 
ExpandedSubBlockStart.gifContractedSubBlock.gif                
dot.gif{
InBlock.gif                    dataLen 
= input.Read(ioBuffer, 0, ioBuffer.Length);
InBlock.gif                    bufferIndex 
= 0;
ExpandedSubBlockEnd.gif                }
;
InBlock.gif                
if (dataLen == 0
InBlock.gif                    
return Flush();
InBlock.gif                
else
ExpandedSubBlockStart.gifContractedSubBlock.gif                
dot.gif{
InBlock.gif                    c 
= ioBuffer[bufferIndex++];
ExpandedSubBlockEnd.gif                }

InBlock.gif                
switch(Char.GetUnicodeCategory(c)) 
ExpandedSubBlockStart.gifContractedSubBlock.gif                
dot.gif{                     
InBlock.gif                    
case UnicodeCategory.DecimalDigitNumber:
InBlock.gif                        Push(c);
InBlock.gif                        
return Flush();
InBlock.gif                    
case UnicodeCategory.LowercaseLetter:
InBlock.gif                    
case UnicodeCategory.UppercaseLetter:
InBlock.gif                        Push(c);
InBlock.gif                        
return Flush();
InBlock.gif                    
case UnicodeCategory.OtherLetter:
InBlock.gif                        
if (length > 0)
ExpandedSubBlockStart.gifContractedSubBlock.gif                        
dot.gif{
InBlock.gif                            bufferIndex
--;
InBlock.gif                            offset
--;
InBlock.gif                            
return Flush();
ExpandedSubBlockEnd.gif                        }

InBlock.gif                       Push(c);
InBlock.gif                       
return Flush();
InBlock.gif                   
default:
InBlock.gif                       
if (length > 0return Flush();
InBlock.gif                       
break;
ExpandedSubBlockEnd.gif                }

ExpandedSubBlockEnd.gif            }

ExpandedSubBlockEnd.gif        }

ExpandedBlockEnd.gif    }

None.gif public   sealed   class  ProductFilter : TokenFilter 
ExpandedBlockStart.gifContractedBlock.gif    
dot.gif {
InBlock.gif        
// Only English now, Chinese to be added later.
InBlock.gif
        public static String[] STOP_WORDS = 
ExpandedSubBlockStart.gifContractedSubBlock.gif                 
dot.gif{
InBlock.gif                     
"and""are""as""at""be""but""by",
InBlock.gif                     
"for""if""in""into""is""it",
InBlock.gif                     
"no""not""of""on""or""such",
InBlock.gif                     
"that""the""their""then""there""these",
InBlock.gif                     
"they""this""to""was""will""with"
ExpandedSubBlockEnd.gif                 }
;
InBlock.gif        
private Hashtable stopTable;
InBlock.gif        
public ProductFilter(TokenStream _in)
InBlock.gif            : 
base(_in)
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            stopTable 
= new Hashtable(STOP_WORDS.Length);
InBlock.gif
InBlock.gif            
for (int i = 0; i < STOP_WORDS.Length; i++)
InBlock.gif                stopTable[STOP_WORDS[i]] 
= STOP_WORDS[i];
ExpandedSubBlockEnd.gif        }

InBlock.gif        
public override Token Next()
ExpandedSubBlockStart.gifContractedSubBlock.gif        
dot.gif{
InBlock.gif            
for (Token token = input.Next(); token != null; token = input.Next()) 
ExpandedSubBlockStart.gifContractedSubBlock.gif            
dot.gif{
InBlock.gif                String text 
= token.TermText();
InBlock.gif                
if (stopTable[text] == null
ExpandedSubBlockStart.gifContractedSubBlock.gif                
dot.gif{
InBlock.gif                    
switch (Char.GetUnicodeCategory(text[0])) 
ExpandedSubBlockStart.gifContractedSubBlock.gif                    
dot.gif{
InBlock.gif                        
case UnicodeCategory.LowercaseLetter:
InBlock.gif                        
case UnicodeCategory.UppercaseLetter:
InBlock.gif                             
return token;
InBlock.gif                        
case UnicodeCategory.OtherLetter:
InBlock.gif                            
return token;
InBlock.gif                        
case UnicodeCategory.DecimalDigitNumber:
InBlock.gif                            
return token;
ExpandedSubBlockEnd.gif                    }

ExpandedSubBlockEnd.gif                }

ExpandedSubBlockEnd.gif            }

InBlock.gif            
return null;
ExpandedSubBlockEnd.gif        }

ExpandedBlockEnd.gif    }

实际使用下来,搜索效果理想




转载于:https://www.cnblogs.com/try/archive/2006/10/22/534950.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值