weka 3.6.13-SNAPSHOT 过滤器StringToWordVector参数含义解释

最新推荐文章于 2021-04-11 09:34:54 发布

云聪

最新推荐文章于 2021-04-11 09:34:54 发布

阅读量2k

点赞数

分类专栏： weka 文章标签： weka StringTo WordVector

本文链接：https://blog.csdn.net/l294265421/article/details/52829174

版权

weka 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

IDFTransform: 值为true时，把文档中单词出现的次数val（或者TFTransform转化后的值）转化为
$val*Math.log( doc\_total\_num/doc\_num\_contain\_this\_word )$

TFTransform: 值为为true时，把单词出现的次数val转化为
$Math.log(val+1)$ 。

attributeIndices: 设置文档实例的哪些属性被选择。

attributeNamePrefix: 属性名的前缀。比如，词“美丽”作为一个属性，该词生成的属性的属性名就是“attributeNamePrefix美丽”。

doNotOperateOnPerClassBasis: 值为true时，“wordsToKeep”和“minTermFreq”是基于所有文档，而不是基于每个类的文档。

invertSelection: 值为true时，反选“attributeIndices”选择的属性。

lowerCaseTokens: 值为true时，所有单词被转化为小写。

minTermFreq: 设置（有类别时，基于每一类）最小单词频率。

normalizeDocLength: 用于用文档长度对属性值进行归一化，公式为：
$attribute\_value * AvgDocLength / docLength$

outputWordCounts:值为true时，记录单词在文档中出现的次数，而不是boolean值。

periodicPruning: 指定删减词典的频率。

periodicPruning的实现代码：

long pruneRate = 
      Math.round((m_PeriodicPruningRate/100.0)*getInputFormat().numInstances());
      if (pruneRate > 0) {
    if (i % pruneRate == 0 && i > 0) {
      for (int z = 0; z < values; z++) {
        Vector d = new Vector(1000);
        Iterator it = dictionaryArr[z].keySet().iterator();
        while (it.hasNext()) {
          String word = (String)it.next();
          Count count = (Count)dictionaryArr[z].get(word);
          if (count.count <= 1) { d.add(word); }
        }
        Iterator iter = d.iterator();
        while(iter.hasNext()) {
          String word = (String)iter.next();
          dictionaryArr[z].remove(word);
        }
      }
    }
      }

其中，m_PeriodicPruningRate等于periodicPruning。每处理pruneRate个文本，就删除词典中出现次数小于等于1的词。这个选项在wordsToKeep之前起作用。

stemmer: 设置词干提取算法。

stopwords: 设置停用词，停用词包含在一个文件当中，文件中一行一个停用词，以#开头的行被当作注释处理。

useStoplist: 是否使用停用词，值为false时，不使用“stopwords”选项定义的停用词。

wordsToKeep: 设置（有类别时，基于每一类）保留多少个单词（每一类中，出现次数前wordsToKeep名的单词被保留）。

云聪

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
weka 3.6.13-SNAPSHOT 过滤器StringToWordVector参数含义解释

IDFTransform: 值为true时，把文档中单词出现的次数val（或者TFTransform转化后的值）转化为 val∗Math.log(doc_total_num/doc_num_contain_this_word)val*Math.log( doc\_total\_num/doc\_num\_contain\_this\_word )
复制链接

扫一扫

专栏目录