搜索引擎中中文词组分词的实现

Lucene.net标准分词器在英文分词中有非常好的体验。比喻说:在邮件,IP地址,符号处理方面,它都处理得非常好。只是很遗憾,它不支持中文词组分词。于是,我就通过修改里面的核心代码让它扩展,支持中文的分词。

目标:使它能够增加对中文词组的切词。

效果:

原句:“我是中国人!I am chiness!Email:youpeizun126@126.com;IP:172.17. 34.168 ”

切词效果:

我/是/中国人/中国/中/国/人/Email/youpeizun126@126.com/IP/172.17.34.168

所要完成的任务:

1. 装载词库

2. 截取一段连续的中文字段

3. 进行连续的分词.



下面是设计扩展Lucene.net标准分词器的支持中文词组分词的流程图.



       接下来,我把扩展Lucene.net标准分词器所写的核心代码,主要包含三个函数,它们分别实现装载词典,载取连续中文字段,中文词组分词算法功能.
 


/*
 
#region 加载中文词典
        public  void LoadDirectory(string path)
        {
            if(!File.Exists("words.txt"))
                return;
           TextReader tr_words=new StreamReader("words.txt",System.Text.Encoding.Default);
            System.Diagnostics.Debug.Write("begin read words");
                if(directory==null)
          { 
                    directory=new System.Collections.Hashtable();
                    try
                    {
                        string word=null;
                        while((word=tr_words.ReadLine())!=null)
                        {
                            try
                            {

                                if(directory[word]==null)
                                {
                                directory.Add(word,word);
                                
                                }

                            }
                            catch(SystemException ex_)
                            {
                            
                            }
                        }
                    }
                    catch(SystemException ex)
                    {
                    
                    }
          }
#endregion
           
        }
#region 截取一段连续中文字段
        private void  InitChinessText()
        {
            
            textlengh=0;        
            cn_index=0;
            chinesstext[0]=token.image;
            textlengh++;
            cn_start=token.beginColumn;
            isCnToken=true; 
            bool isCN= true;
            
            while(isCN&&textlengh<255)
            {   token=token_source.GetNextToken();
                if(token.kind!=0)
                {
                    isCN=Char.GetUnicodeCategory(token.image,0).Equals(System.Globalization.UnicodeCategory.OtherLetter);
                }
                else
                    isCN=false;
                if(isCN)
                {     
                    
                        
                    
                        chinesstext[textlengh]=token.image;        
                        textlengh++;
                }
                else
                {
                    
                    cn_end_token=token;
                }
            
            }
            if(textlengh>=4)
            {
            wordlengh=4;
            }
            else
                wordlengh=textlengh;
        
            
        
        }
#endregion
#region 实现中文分词算法
        private string GetNextTokenText()
        {   string text=null;
         
            if(wordlengh==4)
            {
                text=chinesstext[cn_index]+chinesstext[cn_index+1]+chinesstext[cn_index+2]+chinesstext[cn_index+3];
                if(directory[text]!=null)
                {
                    
                }
                wordlengh--;
            }
            if(wordlengh==3)
            {
                text=chinesstext[cn_index]+chinesstext[cn_index+1]+chinesstext[cn_index+2];
                wordlengh--;
                if(directory[text]!=null)
                {
                goto return_;
                }
                
            }
            if(wordlengh==2)
            {
                text=chinesstext[cn_index]+chinesstext[cn_index+1];
                wordlengh--;
                if(directory[text]!=null)
                {
                    goto return_;
                }
            
            }
            if(wordlengh==1)
            {
                text=chinesstext[cn_index];
                cn_index++;
                if((textlengh-cn_index)>=4)
                {
                    wordlengh=4;
                }
                else
                    if((textlengh-cn_index)==0)
                {
                    isCnToken=false;
                    jj_ntk=cn_end_token.kind;
                    token=new Token();
                    token.next=cn_end_token;
                
                }
                else
                {
                wordlengh=textlengh-cn_index;
                }

            
            }
            return_:
                return text;
        }
        
#endregion
        
*/


           结束,谢谢你的阅读.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值