THUCTC源码解读(二)

最新推荐文章于 2024-08-21 08:31:52 发布

multiangle

最新推荐文章于 2024-08-21 08:31:52 发布

阅读量3.3k

点赞数 1

分类专栏：自然语言处理机器学习&深度学习自然语言处理文章标签：源码 THUCTC 自然语言处理 nlp

本文链接：https://blog.csdn.net/u014595019/article/details/51446226

版权

机器学习&深度学习同时被 3 个专栏收录

35 篇文章 28 订阅

订阅专栏

自然语言处理

19 篇文章 7 订阅

订阅专栏

自然语言处理

15 篇文章 59 订阅

订阅专栏

在通过Demo初步了解了THUCTC的用法以后，开始深入探究THUCTC的结构，了解实现方式。只要了解了代码结构，才能了解背后的原理和优化方法，也方便在此基础上做出自己的改进。

THUCTC的主要原理

首先，会将训练文本进行分词处理，然后进行词频统计，通过统计词频和包含该词的文档频率，利用卡方(Chi-Square)检验来选择出特征词，并以此为依据构造文档向量(DocumentVector)，词向量中的每一项均代表一个特征（词），具体的值则为相应的权重，这里一般采用TF-IDF方式，即词的重要性与该文档中该次出现次数成正比，与包含该词的文件频率成反比。在构造完文档向量之后，则将训练和分类的任务交给了现成的库(liblinear或libsvm)。因此，也可以说，THUCTC主要完成的是文档的向量化工作。

Word类

Word类是THUCTC中最基本的类之一，功能很简单，就是存储单词信息，是词典的基本单元
存储的内容有:

id 该词在词典（Lexicon）中的id
name 该词本身，如”中国”
tf 即term frequency，词频，指该词在所有文档中出现次数
df 即document frequency ，文频，指出现该词的文档数

Lexicon类

Lexicon,即词典类，顾名思义，处理与词典相关的操作，其基本组成是Word对象。Lexicon类中的变量有:

Hashtable<Integer, Word> idHash     //从id到Word的映射，能够根据id获得对应Word对象
Hashtable<String, Word> nameHash    //从name到Word的映射，能够根据单词获得对应Word对象
long numDocs        // 处理过的文档数目
boolean locked      // 当词典被锁定时，locked=true 此时无法向词典中加入新Word

Set<Integer> termSet //是一个private项，用于保存一个文档中出现过的单词，用于统计df

在来看Lexicon类中的方法，主要有

[1] public Lexicon()
[2] public Word getWord( int id )
[3] public Word getWord( String name )
[4] public void addDocument ( String [] doc ) 
[5] public Word [] convertDocument ( String [] doc )
[6] protected Word buildWord ( String termString )
[7] public boolean loadFromFile( File f )
[8] public boolean loadFromInputStream(InputStream input)
[9] public Lexicon map( Map<Integer, Integer> translation )

我们一个个来看。

[1] public Lexicon()

public Lexicon () {
    idHash = new Hashtable<Integer, Word>(50000);
    nameHash = new Hashtable<String, Word>(50000);
    locked = false;
    numDocs = 0;
  }

构建Lexicon对象的时候，
构建空的idHash和nameHash，容量为50000
将locked设为false, 即可以添加新单词
将numDocs设为0，表示未处理过文档

[2]public Word getWord( int id ), [3] public Word getWord( String name )

  public Word getWord( int id ) {
    return idHash.get ( id );
  }
  public Word getWord( String name ) {
    return nameHash.get ( name );
  }

[2]和[3]可以一起来看，前者是根据id提取出Word，后者根据name提取出Word

[4] public void addDocument ( String [] doc )

这个函数用于处理新文档以更新词典的状态(单词项，tf, df 值等等)
函数的输入doc需要是一个String数组，即此时的文档是分词完毕以后的结果，数组中的每一个字符串都是一个单词

public void addDocument ( String [] doc ) {
    termSet.clear(); // 将termSet清空，准备处理新文档
    for ( String token : doc ) {
      Word t = nameHash.get(token); // 对于每个单词，使用nameHash获取对应Word对象
      if ( t == null ) {  
        if ( locked )  
          continue;     // 若此时词典中无该单词但是locked=true,则跳过，即不添加新词
        t = new Word(); // 若locked=false，则添加新词
        t.name = token;
        t.id = nameHash.size();
        t.tf = 0;       
        t.df = 0;
        nameHash.put(t.name, t); //将新词添加进idHash和nameHash
        idHash.put(t.id, t);
      }
      t.tf += 1;  // 该词的term frequency +1 
      if ( ! termSet.contains(t.id) ) {
        termSet.add(t.id); // 若该文档之前没出现过这个词，则该词的文档频率+1，否则不管
        t.df++;
      }
    }
    numDocs ++ ;  // numDocs++ ，表示处理过的文档数增加1
}

[5] public Word [] convertDocument ( String [] doc )

将分词后的文档转化成Word数组。如果出现新词，若locked=true 则更新词典，否则会跳过该新词

public Word [] convertDocument ( String [] doc ) {
    Word [] terms = new Word[doc.length];  // 根据doc的长度来建立Word数组
    int n = 0;
    for ( int i = 0 ; i < doc.length ; i++ ) {
      String token = doc[i];
      Word t = nameHash.get( token );  
      if ( t == null ) {
        if ( locked ) 
          continue;
        t = new Word ();
        t.name = token;
        t.tf = 1;
        t.df = 1;
        t.id = nameHash.size();
        nameHash.put(t.name, t);
        idHash.put(t.id, t);
      }
      terms[n++] = t; // 到此为止总体跟[4]addDocument比较相似，
      //但是要注意当locked=true时，碰到新词是会跳过的
      //也就是说，有可能会出现Word向量的长度小于doc长度的情况
    }
    if ( n < terms.length ) {
      Word [] finalterms = new Word[n]; //这边就是处理上面提到过的情况
      for ( int i = 0 ; i < n ; i++ ) {
        finalterms[i] = terms[i];
      }
      terms = finalterms;
    }
    return terms;
}

[6] protected Word buildWord ( String termString )

这个方法是protected类型，不对外开放，是为下面两个方法loadFromInputStream服务的。
实现了根据一行字符串来新建Word对象的功能

  protected Word buildWord ( String termString ) {
      // 举例；  26163:附赠:114:94
    Word t = null;
    String [] parts = termString.split(":"); //将字符串根据';'分割成字符串数组
    if ( parts.length == 4 ) {
      t = new Word(); // 各项的值分别是: id,name,tf,df
      t.id = Integer.parseInt(parts[0]);
      t.name = parts[1].replace(COLON_REPLACER, ":");
      t.tf = Integer.parseInt(parts[2]);
      t.df = Integer.parseInt(parts[3]);
    }
    return t;
  }

[7] public boolean loadFromFile( File f )

这个函数和后面的loadFromInputStream功能类似，都是用于从本地文件中载入之前构造好的词典
这个函数比较简单，先将文件转化成FileInputStream,然后调用[8]loadFromInputStream

  public boolean loadFromFile( File f ) {
    FileInputStream fis;
    try {
      fis = new FileInputStream(f); //FileInputStream能将文件转化成字节流
    } catch (FileNotFoundException e) {
      return false;
    }
    return loadFromInputStream(fis);
  }

[8] public boolean loadFromInputStream(InputStream input)

用于从字节流中载入之前构造好的词典

public boolean loadFromInputStream(InputStream input) {
    nameHash.clear();  // 先将之前的nameHash和idHash清空
    idHash.clear();
    try {
        //BufferedReader( InputStreamReader( FileInputStream( File ) ) )
        // FileInputStream 能把文件转化成字节流
        // InputStreamReader 按照字节读取，一个汉字为2个字节
        // BufferedReader 可以整行读取，效率更高
      BufferedReader reader =
        new BufferedReader( new InputStreamReader( input, "UTF-8") );

      String termString;
      numDocs = Integer.parseInt(reader.readLine()); //文件中的第一行为numDocs
      while ( (termString = reader.readLine()) != null ) {
          // lexion 中每一行，分别是id, name, tf, df
        Word t = buildWord( termString );  // 调用[6]buildWord方法
        if ( t != null ) {
          nameHash.put( t.name, t); // 对于每一个词，都存入idHash和nameHash
          idHash.put( t.id, t);
        }
      }
      reader.close();   //关闭reader
    } catch (UnsupportedEncodingException e) {
      return false;
    } catch (IOException e) {
      return false;
    }
    return true;
  }

[9] public Lexicon map( Map《Integer, Integer》 translation )

用于根据映射表来生成新的词典。

/**
   * 紧缩词典，利用一个map把原来编号为key的word变为编号为value的word，去掉不在key
   * 中的word
   * @param translation 影射表
   */
  public Lexicon map( Map<Integer, Integer> translation ) {
      // 根据映射表生成新的词典
      // 映射表的左边为该词在现有词典中的id
      // 映射表的右边为该词在新词典中的id
      // 注意，新词典的size只与映射表的长度有关，与旧词典的size无关
    Lexicon newlex = new Lexicon();
    Hashtable<Integer, Word> newIdHash = new Hashtable<Integer, Word>();
    Hashtable<String, Word> newNameHash = new Hashtable<String, Word>();

    for ( Entry<Integer, Integer> e : translation.entrySet()){
      Word w = idHash.get(e.getKey());
      Word nw = (Word) w.clone();
      nw.id = e.getValue();
      newIdHash.put(nw.id, nw);
      newNameHash.put(nw.getName(), nw);
    }
    newlex.idHash = newIdHash;
    newlex.nameHash = newNameHash;
    newlex.numDocs = this.numDocs;
    return newlex;
  }