mahout SparseVectorsFromSequenceFiles详解(6)

这一部分是tf的生成

首先是生成PartialVectors,每个dictionaryChunk生成一个PartialVectors,代码如下:

    int partialVectorIndex = 0;
    Collection<Path> partialVectorPaths = Lists.newArrayList();
    for (Path dictionaryChunk : dictionaryChunks) {
      Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++);
      partialVectorPaths.add(partialVectorOutputPath);
      makePartialVectors(input, baseConf, maxNGramSize, dictionaryChunk, partialVectorOutputPath,
        maxTermDimension[0], sequentialAccess, namedVectors, numReducers);
    }

makePartialVectors

参数

dimension  -- 前面的计算dictionaryChunk时候得到的

其它参数都很清晰

makePartialVectors是一个hadoop程序,直接看其mapper类和reducer类即可,mapper类用的就是缺省的,reducer类是TFPartialVectorReducer,我们直接看其实现

TFPartialVectorReducer

注意makePartialVectors的文件输入路径是tokenizedPath,所以key是文档路径,value是文档里的所有token元组(tuple翻译为元组),理解这一点才会理解reducer操作,在此之前还应查看reducer的setup函数,mapper和reducer的setup函数是初始化含义,将map和reduce阶段要用到的变量都在这里初始化

首先取出token元组

    Iterator<StringTuple> it = values.iterator();
    if (!it.hasNext()) {
      return;
    }
    StringTuple value = it.next();

创建vector

   Vector vector = new RandomAccessSparseVector(dimension, value.length());

下面的代码根据n-gram不同采取不同策略,我们先研究n-gram为1的情况

      for (String term : value.getEntries()) {
        if (!term.isEmpty() && dictionary.containsKey(term)) { // unigram
          int termId = dictionary.get(term);
          vector.setQuick(termId, vector.getQuick(termId) + 1);
        }
      }

对每个token,当不为空且字典中存在(其实必然存在)时,在vector中其维度上设置其词频

最后对vector一系列处理后写入输出

    if (sequentialAccess) {
      vector = new SequentialAccessSparseVector(vector);
    }

    if (namedVector) {
      vector = new NamedVector(vector, key.toString());
    }
    if (vector.getNumNondefaultElements() > 0) {
      VectorWritable vectorWritable = new VectorWritable(vector);
      context.write(key, vectorWritable);
    } else {
      context.getCounter("TFParticalVectorReducer", "emptyVectorCount").increment(1);
    }

合并partialVectors

由PartialVectorMerger.java完成,没啥新意,直接看其reducer -- PartialVectorMergeReducer

  @Override
  protected void reduce(WritableComparable<?> key, Iterable<VectorWritable> values, Context context) throws IOException,
      InterruptedException {

    Vector vector = new RandomAccessSparseVector(dimension, 10);
    for (VectorWritable value : values) {
      vector.assign(value.get(), Functions.PLUS);
    }
    if (normPower != PartialVectorMerger.NO_NORMALIZING) {
      if (logNormalize) {
        vector = vector.logNormalize(normPower);
      } else {
        vector = vector.normalize(normPower);
      }
    }
    if (sequentialAccess) {
      vector = new SequentialAccessSparseVector(vector);
    }

    if (namedVector) {
      vector = new NamedVector(vector, key.toString());
    }

    VectorWritable vectorWritable = new VectorWritable(vector);
    context.write(key, vectorWritable);
  }

也没啥新鲜的,各个chunks词频加起来而已

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值