Weka学习之Filter(2)-StringToWordVector

最新推荐文章于 2024-04-21 17:00:41 发布

helen_PhDing

最新推荐文章于 2024-04-21 17:00:41 发布

阅读量5.5k

点赞数

文章标签： filter input dataset dictionary structure exception

为了更具体地展示Filter的用法和原理，我们分析一个名为StringToWordVector的Filter。它是我们在文本挖掘中用得比较普遍的一个类。作用是把字符串属性转换成一个个词属性，属性的值可以在参数中指定，比如0-1变量（代表这个词是否在该实例中出现），词频变量，log（1+词频）或者TF-IDF值。

下面是StringToWordVector的input方法源码：

/**
* Input an instance for filtering. Filter requires all
* training instances be read before producing output.
*
* @param instance the input instance.
* @return true if the filtered instance may now be
* collected with output().
* @throws IllegalStateException if no input structure has been defined.
*/
public boolean input(Instance instance) throws Exception {
 
if (getInputFormat() == null) {
throw new IllegalStateException("No input instance format defined");
}
if (m_NewBatch) {
resetQueue();
m_NewBatch = false;
}
if (isFirstBatchDone()) {
FastVector fv = new FastVector();
int firstCopy = convertInstancewoDocNorm(instance, fv);
Instance inst = (Instance)fv.elementAt(0);
if (m_filterType != FILTER_NONE) {
normalizeInstance(inst, firstCopy);
}
push(inst);
return true;
} else {
bufferInput(instance);
return false;
}
}

这个方法支持增量输入数据实例，对于首批的数据首先执行的是bufferInput方法。这个方法就是把实例instance加入到inputFormat的数据集dataset中。所有的instance添加完毕后，我们就开始进入到batchFinished方法中：

/**
   * Signify that this batch of input to the filter is finished.
   * If the filter requires all instances prior to filtering,
   * output() may now be called to retrieve the filtered instances.
   *
   * @return true if there are instances pending output.
   * @throws IllegalStateException if no input structure has been defined.
   */
  public boolean batchFinished() throws Exception {
 
    if (getInputFormat() == null) {
      throw new IllegalStateException("No input instance format defined");
    }
 
    // We only need to do something in this method
    // if the first batch hasn't been processed. Otherwise
    // input() has already done all the work.
    if (!isFirstBatchDone()) {
 
      // Determine the dictionary from the first batch (training data)
      determineDictionary();
 
      // Convert all instances w/o normalization
      FastVector fv = new FastVector();
      int firstCopy=0;
      for(int i=0; i &lt; m_NumInstances; i++) {
	firstCopy = convertInstancewoDocNorm(getInputFormat().instance(i), fv);
      }
 
      // Need to compute average document length if necessary
      if (m_filterType != FILTER_NONE) {
	m_AvgDocLength = 0;
	for(int i=0; i &lt; fv.size(); i++) {
	  Instance inst = (Instance) fv.elementAt(i);
	  double docLength = 0;
	  for(int j=0; j &lt; inst.numValues(); j++) { 	    if(inst.index(j)&gt;=firstCopy) {
	      docLength += inst.valueSparse(j) * inst.valueSparse(j);
	    }
	  }
	  m_AvgDocLength += Math.sqrt(docLength);
	}
	m_AvgDocLength /= m_NumInstances;
      }
 
      // Perform normalization if necessary.
      if (m_filterType == FILTER_NORMALIZE_ALL) {
	for(int i=0; i &lt; fv.size(); i++) {
	  normalizeInstance((Instance) fv.elementAt(i), firstCopy);
	}
      }
 
      // Push all instances into the output queue
      for(int i=0; i &lt; fv.size(); i++) {
	push((Instance) fv.elementAt(i));
      }
    }
 
    // Flush the input
    flushInput();
 
    m_NewBatch = true;
    m_FirstBatchDone = true;
    return (numPendingOutput() != 0);
  }

注意到我们的determineDictionary()方法，这个方法的主要作用是：

1.确认停用词表；

2.对那些需要进行转换的字符串型属性值，按给定的tokenizer进行分词，记录单词对应的类词频和类文件数（即属于该类的文件中有多少文件包含这个单词的）；

3.根据最小词频数（m_minTermFreq）和每个类最多保留单词数(m_WordsToKeep)过滤单词；

4.收集未转换属性作为新属性；

5.把第二步中符合条件的单词收集起来作为新属性；

6.计算每个单词在多少个文档中出现过，保存在m_DocsCounts数组中；

7.TreeMap类型成员变量m_Dictionary记录<word,新属性index>对；

8.设置outputFormat的新属性的结构体。

24-26行对每个实例调用convertInstancewoDocNorm(Instance instance, FastVector v)方法。该方法进行以下操作：

1.记录所有未参加转换的非0属性值到<新属性index,属性值> –> contained[a TreeMap type]，变量firstCopy=未参加转换的属性个数+1；

2.对所有参见转换的属性值，

1）tokenize；

2）转换成小写，去词根；

3）把<新属性index, 词频[或者0-1变量，用于表征单词是否出现,如果设置变量m_OutputCounts==false的话]>—>contained，这个词频值在本次迭代中完成统计。

3.如果设置变量m_TFTransform为真，更新contained中Key大于等于firstCopy的值为val = Math.log(val+1)，也即把原先记录的词频fij变成log（fij + 1），注意如果要达到这个效果只有把m_TFTransform 以及m_OutputCounts同时设置成true。

4.如果设置变量m_IDFTransform为真，更新更新contained中Key大于等于firstCopy的值为val=val*Math.log( m_NumInstances /
(double) m_DocsCounts[index.intValue()] )，也即把原先记录的词频fij变成fij*log（文档数/该单词在多少个文档中出现过），就是我们用的TF-IDF。注意如果要达到这个效果只有把m_IDFTransform 以及m_OutputCounts同时设置成true，并保持m_TFTransform为false（否则的话就是两个log相乘了）。

5.把上面搜集的新属性的对应index和值，也就contained转换成values和indices数组，生成一个SparseInstance，添加到vector中，返回firstCopy。

然后convertInstancewoDocNorm迭代完成。firstCopy记录了第一个转换得到的新属性的index，而fv中包括了所有的已经转换完毕的SparseInstance。

29行开始判断是否进行对文本长度进行归一。最后设置m_NewBatch和m_FirstBatchDone为真，并把所有转换完毕的SparseInstance加入到一个队列中，返回该队列长度。

batchFinished结束。

我们再回到第一个调用batchFinished的地方，即静态函数useFiler中，这里简单地把SparseInstance实例加入到outputFormat的dataset中。然后就返回这个dataset。