mahout SparseVectorsFromSequenceFiles详解（7）

最新推荐文章于 2017-12-13 21:00:42 发布

softwarehe

最新推荐文章于 2017-12-13 21:00:42 发布

阅读量867

点赞数

分类专栏： mahout

本文链接：https://blog.csdn.net/softwarehe/article/details/8515969

版权

mahout 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

现在进入倒数第二步，DF的生成

calculateDF

调用方法是TFIDFConverter.calculateDF

输入目录是tf-vectors目录，上一个步骤生成的，key是文档目录，value是词频vector

真正的执行方法是startDFCounting，又是一个hadoop程序，mapper是TermDocumentCountMapper，reducer是TermDocumentCountReducer

先看TermDocumentCountMapper

  @Override
  protected void map(WritableComparable<?> key, VectorWritable value, Context context)
    throws IOException, InterruptedException {
    Vector vector = value.get();
    Iterator<Vector.Element> it = vector.iterateNonZero();

    while (it.hasNext()) {
      Vector.Element e = it.next();
      context.write(new IntWritable(e.index()), ONE);
    }
    context.write(TOTAL_COUNT, ONE);
  }

key是词的index，value是词频vector

有index的就加1，最后是total_count也加1，可以理解total_count是文档总数

再看TermDocumentCountReducer

  @Override
  protected void reduce(IntWritable key, Iterable<LongWritable> values, Context context)
    throws IOException, InterruptedException {
    long sum = 0;
    for (LongWritable value : values) {
      sum += value.get();
    }
    context.write(key, new LongWritable(sum));
  }

就是把文档对应index为1的都加起来，这样可以知道每个index有多少个文档有这个词存在

最后生成的目录在df-count

createDictionaryChunks

这个挺无聊的，就是把df-count的内容写成chunks sequence file的形式，估计是为了性能原因吧

softwarehe

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
mahout SparseVectorsFromSequenceFiles详解（7）

现在进入倒数第二步，DF的生成calculateDF调用方法是TFIDFConverter.calculateDF输入目录是tf-vectors目录，上一个步骤生成的，key是文档目录，value是词频vector真正的执行方法是startDFCounting，又是一个hadoop程序，mapper是TermDocumentCountMapper，reducer是TermDoc
复制链接

扫一扫

专栏目录