mahout SparseVectorsFromSequenceFiles详解（3）

最新推荐文章于 2022-07-27 11:28:23 发布

softwarehe

最新推荐文章于 2022-07-27 11:28:23 发布

阅读量1.4k

点赞数

分类专栏： mahout

本文链接：https://blog.csdn.net/softwarehe/article/details/8512585

版权

mahout 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

创建dictionary和tf-vectors

实现类是DictionaryVectorizer

调用createTermFrequencyVectors方法，参数是：

input，output，tfVectorsFolderName，baseConf，这几个参数很明显

minSupport -- 最少要在文档中出现多少次才会放置到sparsevector，缺省值2

maxNGramSize -- 最大n-gram值，这个在计算llr（log-likelyhood ratio）时使用，如为3，则计算3-gram，2-gram，1-gram，为2则计算2-gram，1-gram，缺省是1

minLLRValue -- 最小的llr值，高于这个的认为是出现在一起的词，缺省是1，这个值最好就使用缺省的

normPower -- normalization是为了避免文档长度的影响的一种手段，统计学里边叫p-norm，参考mahout in action的8.4节，缺省是0，这个值最好改下，比如2

logNormalize -- 是否使用log normalization，这个不明白有何用

chunkSizeInMegabytes -- 这个值的设置有讲究，对性能影响很大，具体还不明白，有机会专题研究

其它参数都很明白

dictionary job

    Path dictionaryJobPath = new Path(output, DICTIONARY_JOB_FOLDER);

    int[] maxTermDimension = new int[1];
    List<Path> dictionaryChunks;
    if (maxNGramSize == 1) {
      startWordCounting(input, dictionaryJobPath, baseConf, minSupport);
      dictionaryChunks =
          createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension);
    } else {
      CollocDriver.generateAllGrams(input, dictionaryJobPath, baseConf, maxNGramSize,
        minSupport, minLLRValue, numReducers);
      dictionaryChunks =
          createDictionaryChunks(new Path(new Path(output, DICTIONARY_JOB_FOLDER),
                                          CollocDriver.NGRAM_OUTPUT_DIRECTORY),
                                 output,
                                 baseConf,
                                 chunkSizeInMegabytes,
                                 maxTermDimension);
    }

我们先看n-gram为1的情况，首先调用startWordCounting，这又是一段hadoop程序

startWordCounting

Mapper是TermCountMapper类，Combiner和Reducer都是TermCountReducer类，我们到这两个类里边看他们都做了什么工作

首先看TermCountMapper

  @Override
  protected void map(Text key, StringTuple value, final Context context) throws IOException, InterruptedException {
    OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>();
    for (String word : value.getEntries()) {
      if (wordCount.containsKey(word)) {
        wordCount.put(word, wordCount.get(word) + 1);
      } else {
        wordCount.put(word, 1);
      }
    }
    wordCount.forEachPair(new ObjectLongProcedure<String>() {
      @Override
      public boolean apply(String first, long second) {
        try {
          context.write(new Text(first), new LongWritable(second));
        } catch (IOException e) {
          context.getCounter("Exception", "Output IO Exception").increment(1);
        } catch (InterruptedException e) {
          context.getCounter("Exception", "Interrupted Exception").increment(1);
        }
        return true;
      }
    });
  }

还记得第一步tokenize化的时候，输出文档的key是文档名，value是token组成的StringTuple，在上面参数中很清楚的展示

输出的key是token，value是token的数量

再看TermCountReducer

  @Override
  protected void reduce(Text key, Iterable<LongWritable> values, Context context)
    throws IOException, InterruptedException {
    long sum = 0;
    for (LongWritable value : values) {
      sum += value.get();
    }
    if (sum >= minSupport) {
      context.write(key, new LongWritable(sum));
    }
  }

没什么特别的，就是把数量加起来而已

自此，wordcount工作完成