Mahout MinHash代码阅读理解

最新推荐文章于 2024-08-08 14:14:56 发布

softwarehe

最新推荐文章于 2024-08-08 14:14:56 发布

阅读量1.7k

点赞数

分类专栏： mahout

本文链接：https://blog.csdn.net/softwarehe/article/details/8523967

版权

mahout 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

MinHash的介绍请参看http://rdc.taobao.com/team/jm/archives/2434

初始化

    Configuration conf = getConf();
    
    conf.setInt(MinhashOptionCreator.MIN_CLUSTER_SIZE, minClusterSize);
    conf.setInt(MinhashOptionCreator.MIN_VECTOR_SIZE, minVectorSize);
    conf.set(MinhashOptionCreator.HASH_TYPE, hashType);
    conf.setInt(MinhashOptionCreator.NUM_HASH_FUNCTIONS, numHashFunctions);
    conf.setInt(MinhashOptionCreator.KEY_GROUPS, keyGroups);
    conf.setBoolean(MinhashOptionCreator.DEBUG_OUTPUT, debugOutput);

设置缺省参数

设置hadoop运行参数

    job.setMapperClass(MinHashMapper.class);
    job.setReducerClass(MinHashReducer.class);

MinHashMapper

setup函数先取得选项参数，再根据hashType获得hashfunction

tf-idf sequence file文件的key是标记文档的字符串，value是vector组成的，每个vector的key是index，value是index的tf-idf值，理解这些值才能理解mapper

取得features：

Vector featureVector = features.get();

初始化minHashValues

    for (int i = 0; i < numHashFunctions; i++) {
      minHashValues[i] = Integer.MAX_VALUE;
    }

计算这个文档的minhash

    for (int i = 0; i < numHashFunctions; i++) {
      for (Vector.Element ele : featureVector) {
        int value = (int) ele.get();
        bytesToHash[0] = (byte) (value >> 24);
        bytesToHash[1] = (byte) (value >> 16);
        bytesToHash[2] = (byte) (value >> 8);
        bytesToHash[3] = (byte) value;
        int hashIndex = hashFunction[i].hash(bytesToHash);
        //if our new hash value is less than the old one, replace the old one
        if (minHashValues[i] > hashIndex) {
          minHashValues[i] = hashIndex;
        }
      }
    }

mapper输出

    for (int i = 0; i < numHashFunctions; i++) {
      StringBuilder clusterIdBuilder = new StringBuilder();
      for (int j = 0; j < keyGroups; j++) {
        clusterIdBuilder.append(minHashValues[(i + j) % numHashFunctions]).append('-');
      }
      //remove the last dash
      clusterIdBuilder.deleteCharAt(clusterIdBuilder.length() - 1);
      Text cluster = new Text(clusterIdBuilder.toString());
      Writable point;
      if (debugOutput) {
        point = new VectorWritable(featureVector.clone());
      } else {
        point = new Text(item.toString());
      }
      context.write(cluster, point);
    }

这里需要理解keyGroups的含义，它的作用是连接hash值，这样在比较hash值的时候是多个，减少了两个项之间的冲突，比较结果更可信，参见:

http://mail-archives.apache.org/mod_mbox/mahout-user/201111.mbox/%3CB3AAE5F4-207A-40BA-9312-F8211483D651@apache.org%3E

mapper的输出key就是上面的keyGroups，value是文档id

MinHashReducer

  @Override
  protected void reduce(Text cluster, Iterable<Writable> points, Context context)
    throws IOException, InterruptedException {
    Collection<Writable> pointList = Lists.newArrayList();
    for (Writable point : points) {
      if (debugOutput) {
        Vector pointVector = ((VectorWritable) point).get().clone();
        Writable writablePointVector = new VectorWritable(pointVector);
        pointList.add(writablePointVector);
      } else {
        Writable pointText = new Text(point.toString());
        pointList.add(pointText);
      }
    }
    if (pointList.size() >= minClusterSize) {
      context.getCounter(Clusters.ACCEPTED).increment(1);
      for (Writable point : pointList) {
        context.write(cluster, point);
      }
    } else {
      context.getCounter(Clusters.DISCARDED).increment(1);
    }
  }

理解keyGroups后很好理解上面的代码，输出key是kegGroups，value是文档id

从上面也可以看出minHash的输出不是最终结果，要得到结果还需要自己处理输出