SharpICTCLAS分词系统简介(1)读取词典库

最新推荐文章于 2020-03-06 09:29:59 发布

princes_fan

最新推荐文章于 2020-03-06 09:29:59 发布

阅读量380

点赞数

分类专栏：读ICTCLAS 文章标签： structure dictionary null table string c#

读ICTCLAS 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

原文地址：http://www.cnblogs.com/zhenyulu/articles/668024.html

ICTCLAS分词的总体流程包括：1）初步分词；2）词性标注；3）人名、地名识别；4）重新分词；5）重新词性标注这五步。就第一步分词而言，又细分成：1）原子切分；2）找出原子之间所有可能的组词方案；3）N-最短路径中文词语粗分三步。

在所有内容中，词典库的读取是最基本的功能。ICTCLAS中词典存放在Data目录中，常用的词典包括coreDict.dct（词典库）、BigramDict.dct（词与词间的关联库）、nr.dct（人名库）、ns.dct（地名库）、tr.dct（翻译人名库），它们的文件格式是完全相同的，都使用CDictionary类进行解析。如果想深入了解ICTCLAS词典结构，可以参考sinboy的《ICTCLAS分词系统研究（二）--词典结构》一文，详细介绍了词典结构。我这里只给出SharpICTCLAS中的实现。

首先是对基本元素的定义。在SharpICTCLAS中，对原有命名进行了部分调整，使得更具有实际意义并适合C#的习惯。代码如下：

Copy Code

WordDictionaryElement.cs 程序

 
  using System;  
  
  using System.Collections.Generic;  
  
  using System.Text;  
  
  namespace SharpICTCLAS  
  
 {  
  
  //==================================================  
  
  // Original predefined in DynamicArray.h file  
  
  //==================================================  
  
  public  
  class ArrayChainItem  
  
    {  
  
  public  
  int col, row; 
  //row and column  
  
  public  
  double  
  value; 
  //The value of the array  
  
  public  
  int nPOS;  
  
  public  
  int nWordLen;  
  
  public  
  string sWord;  
  
  //The possible POS of the word related to the segmentation graph  
  
  public ArrayChainItem next;  
  
    }  
  
  public  
  class WordResult  
  
    {  
  
  //The word   
  
  public  
  string sWord;  
  
  //the POS of the word  
  
  public  
  int nPOS;  
  
  //The -log(frequency/MAX)  
  
  public  
  double dValue;  
  
    }  
  
  //--------------------------------------------------  
  
  // data structure for word item  
  
  //--------------------------------------------------  
  
  public  
  class WordItem  
  
    {  
  
  public  
  int nWordLen;  
  
  //The word   
  
  public  
  string sWord;  
  
  //the process or information handle of the word  
  
  public  
  int nPOS;  
  
  //The count which it appear  
  
  public  
  int nFrequency;  
  
    }  
  
  //--------------------------------------------------  
  
  //data structure for dictionary index table item  
  
  //--------------------------------------------------  
  
  public  
  class IndexTableItem  
  
    {  
  
  //The count number of words which initial letter is sInit  
  
  public  
  int nCount;  
  
  //The  head of word items  
  
  public WordItem[] WordItems;  
  
    }  
  
  //--------------------------------------------------  
  
  //data structure for word item chain  
  
  //--------------------------------------------------  
  
  public  
  class WordChain  
  
    {  
  
  public WordItem data;  
  
  public WordChain next;  
  
    }  
  
  //--------------------------------------------------  
  
  //data structure for dictionary index table item  
  
  //--------------------------------------------------  
  
  public  
  class ModifyTableItem  
  
    {  
  
  //The count number of words which initial letter is sInit  
  
  public  
  int nCount;  
  
  //The number of deleted items in the index table  
  
  public  
  int nDelete;  
  
  //The head of word items  
  
  public WordChain pWordItemHead;  
  
    }   
  
 }

其中ModifyTableItem用于组成ModifyTable，但在实际分词时，词库往往处于“只读”状态，因此用于修改词库的ModifyTable实际上起的作用并不大。因此在后面我将ModifyTable的代码暂时省略。

有了基本元素的定义后，就该定义“词典”类了。原有C++代码中所有类名均以大写的“C”打头，词典类名为CDictionary，在SharpICTCLAS中，我去掉了开头的“C”，并且为了防止和系统的Dictionary类重名，特起名为“WordDictionary”类。该类主要负责完成词典库的读、写以及检索操作。让我们看看如何读取词典库：

Copy Code

词典库的读取：

 
  public  
  class WordDictionary  
  
 {  
  
  public  
  bool bReleased =  
  true;  
  
  public IndexTableItem[] indexTable;  
  
  public ModifyTableItem[] modifyTable;  
  
  public  
  bool Load( 
  string sFilename)  
  
    {  
  
  return Load(sFilename,  
  false);  
  
    }  
  
  public  
  bool Load( 
  string sFilename,  
  bool bReset)  
  
    {  
  
  int frequency, wordLength, pos;    
  //频率、词长、读取词性  
  
  bool isSuccess =  
  true;  
  
       FileStream fileStream =  
  null;  
  
       BinaryReader binReader =  
  null;  
  
  try  
  
       {  
  
          fileStream =  
  new FileStream(sFilename, FileMode.Open, FileAccess.Read);  
  
  if (fileStream ==  
  null)  
  
  return  
  false;  
  
          binReader =  
  new BinaryReader(fileStream, Encoding.GetEncoding( 
  "gb2312"));  
  
          indexTable =  
  new IndexTableItem[Predefine.CC_NUM];  
  
          bReleased =  
  false;  
  
  for ( 
  int i = 0; i < Predefine.CC_NUM; i++)  
  
          {  
  
  //读取以该汉字打头的词有多少个  
  
             indexTable[i] =  
  new IndexTableItem();  
  
             indexTable[i].nCount = binReader.ReadInt32();  
  
  if (indexTable[i].nCount <= 0)  
  
  continue;  
  
             indexTable[i].WordItems =  
  new WordItem[indexTable[i].nCount];  
  
  for ( 
  int j = 0; j < indexTable[i].nCount; j++)  
  
             {  
  
                indexTable[i].WordItems[j] =  
  new WordItem();  
  
                frequency = binReader.ReadInt32();    
  //读取频率  
  
                wordLength = binReader.ReadInt32();   
  //读取词长  
  
                pos = binReader.ReadInt32();       
  //读取词性  
  
  if (wordLength > 0)  
  
                   indexTable[i].WordItems[j].sWord = Utility.ByteArray2String(binReader.ReadBytes(wordLength));  
  
  else  
  
                   indexTable[i].WordItems[j].sWord =  
  "";  
  
  //Reset the frequency  
  
  if (bReset)  
  
                   indexTable[i].WordItems[j].nFrequency = 0;  
  
  else  
  
                   indexTable[i].WordItems[j].nFrequency = frequency;  
  
                indexTable[i].WordItems[j].nWordLen = wordLength;  
  
                indexTable[i].WordItems[j].nPOS = pos;  
  
             }  
  
          }  
  
       }  
  
  catch ( 
  Exception e)  
  
       {  
  
          Console.WriteLine(e.Message);  
  
          isSuccess =  
  false;  
  
       }  
  
  finally  
  
       {  
  
  if (binReader !=  
  null)  
  
             binReader.Close();  
  
  if (fileStream !=  
  null)  
  
             fileStream.Close();  
  
       }  
  
  return isSuccess;  
  
    }     
  
  //......  
  
 }

下面内容节选自词库中CCID为2、3、4、5的单元， CCID的取值范围自1～6768，对应6768个汉字，所有与该汉字可以组成的词均记录在相应的单元内。词库中记录的词是没有首汉字的（我用带括号的字补上了），其首汉字就是该单元对应的汉字。词库中记录了词的词长、频率、词性以及词。

另外特别需要注意的是在一个单元内，词是按照CCID大小排序的！这对我们后面的分析至关重要。

Copy Code

ICTCLAS词库部分内容

   汉字:埃, ID ：2  
  
   词长  频率  词性   词  
  
     0   128    h   (埃)  
  
     0     0    j   (埃)  
  
     2     4    n   (埃)镑  
  
     2    28    ns  (埃)镑  
  
     4     4    n   (埃)菲尔  
  
     2   511    ns  (埃)及  
  
     4     4    ns  (埃)克森  
  
     6     2    ns  (埃)拉特湾  
  
     4     4    nr  (埃)里温  
  
     6     2    nz  (埃)默鲁市  
  
     2    27    n   (埃)塞  
  
     8    64    ns  (埃)塞俄比亚  
  
    22     2    ns  (埃)塞俄比亚联邦民主共和国  
  
     4     3    ns  (埃)塞萨  
  
     4     4    ns  (埃)舍德  
  
     6     2    nr  (埃)斯特角  
  
     4     2    ns  (埃)松省  
  
     4     3    nr  (埃)特纳  
  
     6     2    nz  (埃)因霍温  
  
 ====================================  
  
 汉字:挨, ID ：3  
  
   词长  频率  词性   词  
  
     0    56    h   (挨)  
  
     2     1    j   (挨)次  
  
     2    19    n   (挨)打  
  
     2     3    ns  (挨)冻  
  
     2     1    n   (挨)斗  
  
     2     9    ns  (挨)饿  
  
     2     4    ns  (挨)个  
  
     4     2    ns  (挨)个儿  
  
     6    17    nr  (挨)家挨户  
  
     2     1    nz  (挨)近  
  
     2     0    n   (挨)骂  
  
     6     1    ns  (挨)门挨户  
  
     2     1    ns  (挨)批  
  
     2     0    ns  (挨)整  
  
     2    12    ns  (挨)着  
  
     2     0    nr  (挨)揍  
  
 ====================================  
  
 汉字:哎, ID ：4  
  
   词长  频率  词性   词  
  
     0    10    h   (哎)  
  
     2     3    j   (哎)呀  
  
     2     2    n   (哎)哟  
  
 ====================================  
  
 汉字:唉, ID ：5  
  
   词长  频率  词性   词  
  
     0     9    h   (唉)  
  
     6     4    j   (唉)声叹气

在这里还应当注意的是，一个词可能有多个词性，因此一个词可能在词典中出现多次，但词性不同。若想从词典中唯一定位一个词的话，必须同时指明词与词性。

另外在WordDictionary类中用到得比较多的就是词的检索，这由FindInOriginalTable方法实现。原ICTCLAS代码中该方法的实现结构比较复杂，同时考虑了多种检索需求，因此代码也相对复杂一些。在SharpICTCLAS中，我对该方法进行了重载，针对不同检索目的设计了不同的FindInOriginalTable方法，简化了程序接口和代码复杂度。其中一个FindInOriginalTable方法代码如下，实现了判断某一词性的一词是否存在功能。

Copy Code

FindInOriginalTable方法的一个重载版本

 
  private  
  bool FindInOriginalTable( 
  int nInnerCode,  
  string sWord,  
  int nPOS)  
  
 {  
  
    WordItem[] pItems = indexTable[nInnerCode].WordItems;  
  
  int nStart = 0, nEnd = indexTable[nInnerCode].nCount - 1;  
  
  int nMid = (nStart + nEnd) / 2, nCmpValue;  
  
  //Binary search  
  
  while (nStart <= nEnd)  
  
    {  
  
       nCmpValue = Utility.CCStringCompare(pItems[nMid].sWord, sWord);  
  
  if (nCmpValue == 0 && (pItems[nMid].nPOS == nPOS || nPOS == -1))  
  
  return  
  true; 
  //find it  
  
  else  
  if (nCmpValue < 0 || (nCmpValue == 0 && pItems[nMid].nPOS < nPOS && nPOS != -1))  
  
          nStart = nMid + 1;  
  
  else  
  if (nCmpValue > 0 || (nCmpValue == 0 && pItems[nMid].nPOS > nPOS && nPOS != -1))  
  
          nEnd = nMid - 1;  
  
       nMid = (nStart + nEnd) / 2;  
  
    }  
  
  return  
  false;  
  
 }

其它功能在这里就不再介绍了。

小结

1、WordDictionary类实现了对字典的读取、写入、更改、检索等功能。

2、词典中记录了以6768个汉字打头的词、词性、出现频率的信息，具体结构需要了解。

princes_fan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SharpICTCLAS分词系统简介(1)读取词典库

原文地址：http://www.cnblogs.com/zhenyulu/articles/668024.htmlICTCLAS分词的总体流程包括：1）初步分词；2）词性标注；3）人名、地名识别；4）重新分词；5）重新词性标注这五步。就第一步分词而言，又细分成：1）原子切分；2）找出原子之间所有可能的组词方案；3）N-最短路径中文词语粗分三步。在所有内容中，词典库的读取是最基本
复制链接

扫一扫

专栏目录