spatialhadoop2.3源码阅读(七) Sampler类

该类的主要作用是从输入文件中进行随机采样。Sampler类中可以有三种采用方式,分别是根据比例,根据大小以及根据记录数来进行采样。在这三种采用方式中,按比例采样分别实现了本地方法和MapReduce方法,其他两种只有本地方法,具体代码如下:

public static void sample(Path[] inputFiles,
      ResultCollector<? extends TextSerializable> output, OperationsParams params)
      throws IOException {
    if (params.get("ratio") != null) {
      if (params.getBoolean("local", false))
        sampleLocalWithRatio(inputFiles, output, params);
      else
        sampleMapReduceWithRatio(inputFiles, output, params);
    } else if (params.get("size") != null) {
      sampleLocalWithSize(inputFiles, output, params);
    } else if (params.get("count") != null){
      // The only way to sample by count is using the local sampler
      sampleLocalByCount(inputFiles, output, params);
    } else {
      throw new RuntimeException("Must provide one of three options 'size', 'ratio' or 'count'");
    }
  }

接下来主要介绍其中的MapReduce实现,即sampleMapReduceWithRatio方法。

1. map实现

在Map类中,定义的成员变量如下:

 /**Ratio of lines to sample*/
    private double sampleRatio;
    
    /**Random number generator to use*/
    private Random random;

    /**The key assigned to all output records to reduce shuffle overhead*/
    private IntWritable key = new IntWritable((int) (Math.random() * Integer.MAX_VALUE));
    
    /**Shape instance used to parse input lines*/
    private Shape inShape;
    
    enum Conversion {None, ShapeToPoint, ShapeToRect};
    Conversion conversion;
sampleRatio 表示采样比例

random 是java自带的随机数生成类,用来随机采样

key 是map方法输出中的key,在这里是一个随机数,map输出的所有key-value对,key值都相同。

inShape 表示输入数据的shape类型

conversion 定义了输入shape类型和输出shape类型之间的转换。主要有不转换,shape转为点,shape转为矩形。


Map类执行map方法前,会先执行configure方法,其实现如下:

@Override
    public void configure(JobConf job) {
      sampleRatio = job.getFloat("ratio", 0.01f);
      random = new Random(job.getLong("seed", System.currentTimeMillis()));
      
      TextSerializable inObj = OperationsParams.getTextSerializable(job, "shape", new Text2());
      TextSerializable outObj = OperationsParams.getTextSerializable(job, "outshape", new Text2());

      if (inObj.getClass() == outObj.getClass()) {
        conversion = Conversion.None;
      } else {
        if (inObj instanceof Shape && outObj instanceof Point) {
          inShape = (Shape) inObj;
          conversion = Conversion.ShapeToPoint;
        } else if (inObj instanceof Shape && outObj instanceof Rectangle) {
          inShape = (Shape) inObj;
          conversion = Conversion.ShapeToRect;
        } else if (outObj instanceof Text) {
          conversion = Conversion.None;
        } else {
          throw new RuntimeException("Don't know how to convert from: "+
              inObj.getClass()+" to "+outObj.getClass());
        }
      }
    }
在该方法中,主要实现了初始化,包括获得sampleRatio,random,输入类型和输出类型。并根据输入类型和输出类型,判断conversion的值。


接下来将会循环调用Map方法,具体实现如下:

public void map(Rectangle cell, Text line,
        OutputCollector<IntWritable, Text> output, Reporter reporter)
            throws IOException {
      if (random.nextFloat() < sampleRatio) {
        switch (conversion) {
        case None:
          output.collect(key, line);
          break;
        case ShapeToPoint:
          inShape.fromText(line);
          Rectangle mbr = inShape.getMBR();
          if (mbr != null) {
            Point center = mbr.getCenterPoint();
            line.clear();
            center.toText(line);
            output.collect(key, line);
          }
          break;
        case ShapeToRect:
          inShape.fromText(line);
          mbr = inShape.getMBR();
          if (mbr != null) {
            line.clear();
            mbr.toText(line);
            output.collect(key, line);
          }
          break;
        }
      }
    }
map方法的random.nextFloat() < sampleRatio 判断用来保证最终采样的数据符合 sampleRatio比例。random.nextFloat() 方法会生成0.0 到1.0之间的随机数,并且该随机数落在0.0-1.0之间的概率是相等的。所以只要加上该判断,即可保证采样比例的正确。

接下来根据不同的conversion值,采用数据进行转换。

None:直接输出。

ShapeToPoint:获得输入数据的最小包围矩形,计算最小包围矩形的中心点坐标,最后输出。

ShapeToRect:获得输入数据的最小包围矩形输出。



2.Reduce

 public static class Reduce extends MapReduceBase implements
  Reducer<IntWritable, Text, NullWritable, Text> {
    @Override
    public void reduce(IntWritable dummy, Iterator<Text> values,
        OutputCollector<NullWritable, Text> output, Reporter reporter)
            throws IOException {
      while (values.hasNext()) {
        Text x = values.next();
        output.collect(NullWritable.get(), x);
      }
    }
  }
reduce就只是单纯的讲map输出的key转换为null,然后输出。



接下来介绍sampleMapReduceWithRatio方法。

该方法的上半部分主要是对MapReduce Job的一些配置,具体如下:

JobConf job = new JobConf(params, Sampler.class);
    
    Path outputPath;
    FileSystem outFs = FileSystem.get(job);
    do {
      outputPath = new Path(files[0].toUri().getPath()+
          ".sample_"+(int)(Math.random()*1000000));
    } while (outFs.exists(outputPath));
    
    job.setJobName("Sample");
    job.setMapOutputKeyClass(IntWritable.class);
    job.setMapOutputValueClass(Text.class);
    
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    ClusterStatus clusterStatus = new JobClient(job).getClusterStatus();
    job.setNumMapTasks(clusterStatus.getMaxMapTasks() * 5);
    // Number of reduces can be set to zero. However, setting it to a reasonable
    // number ensures that number of output files is limited to that number
    job.setNumReduceTasks(
        Math.max(1, (int)Math.round(clusterStatus.getMaxReduceTasks() * 0.9)));
    
    job.setInputFormat(ShapeLineInputFormat.class);
    job.setOutputFormat(TextOutputFormat.class);
    
    ShapeLineInputFormat.setInputPaths(job, files);
    TextOutputFormat.setOutputPath(job, outputPath);
    
    // Submit the job
    RunningJob run_job = JobClient.runJob(job);

接下来重点介绍下半部分,该部分是对MapReduce结果的后续处理:

 Counters counters = run_job.getCounters();
    Counter outputRecordCounter = counters.findCounter(Task.Counter.MAP_OUTPUT_RECORDS);
    final long resultCount = outputRecordCounter.getValue();
    
    Counter outputSizeConter = counters.findCounter(Task.Counter.MAP_OUTPUT_BYTES);
    final long sampleSize = outputSizeConter.getValue();
    
    LOG.info("resultSize: "+sampleSize);
    LOG.info("resultCount: "+resultCount);

    Counter inputBytesCounter = counters.findCounter(Task.Counter.MAP_INPUT_BYTES);
    Sampler.sizeOfLastProcessedFile = inputBytesCounter.getValue();

    // Ratio of records to return from output based on the threshold
    // Note that any number greater than or equal to one will cause all
    // elements to be returned
    long desiredSampleSize = job.getLong("size", 0);
    // Fraction of drawn sample to return
    float selectRatio = desiredSampleSize <= 0? 2.0f : (float)desiredSampleSize / sampleSize;

    // Read job result
    int result_size = 0;
    if (selectRatio > 1.0f) {
      // Return all records from the output
      ShapeLineInputFormat inputFormat = new ShapeLineInputFormat();
      ShapeLineInputFormat.setInputPaths(job, outputPath);
      InputSplit[] splits = inputFormat.getSplits(job, 1);
      for (InputSplit split : splits) {
        RecordReader<Rectangle, Text> reader = inputFormat.getRecordReader(split, job, null);
        Rectangle key = reader.createKey();
        Text value = reader.createValue();
        T outObj = (T) OperationsParams.getTextSerializable(params, "outshape", new Text2());
        while (reader.next(key, value)) {
          outObj.fromText(value);
          output.collect(outObj);
        }
        reader.close();
      }
    } else {
      if (output != null) {
        OperationsParams params2 = new OperationsParams(params);
        params2.setFloat("ratio", selectRatio);
        params2.set("shape", params.get("outshape"));
        params2.set("outshape", params.get("outshape"));
        if (selectRatio > 0.1) {
          LOG.info("Local return "+selectRatio+" of "+resultCount+" records");
          // Keep a copy of sizeOfLastProcessedFile because we don't want it changed
          long tempSize = sizeOfLastProcessedFile;
          // Return a (small) ratio of the result using a MapReduce job
          // In this case, the files are very big and we need just a small ratio
          // of them. It is better to do it in parallel
          result_size = sampleLocalWithRatio(new Path[] { outputPath},
              output, params2);
          sizeOfLastProcessedFile = tempSize;
        } else {
          LOG.info("MapReduce return "+selectRatio+" of "+resultCount+" records");
          // Keep a copy of sizeOfLastProcessedFile because we don't want it changed
          long tempSize = sizeOfLastProcessedFile;
          // Return a (small) ratio of the result using a MapReduce job
          // In this case, the files are very big and we need just a small ratio
          // of them. It is better to do it in parallel
          result_size = sampleMapReduceWithRatio(new Path[] { outputPath},
              output, params2);
          sizeOfLastProcessedFile = tempSize;
        }
      }
    }

    outFs.delete(outputPath, true);
    
    return result_size;

1-12行:获取MapReduce的处理结果,包括map产生的记录数,处理的数据量,处理的数据记录数。

17-19行:计算用户设置的size大小和采样的数据大小之间的比例

23-38:用户设置的size大于采样的size,则遍历MapReduce的输出目录,将结果返回。

39-64:用户设置的size小于采样的size。则需要对MapReduce的输出目录中的采样数据再次进行采样。

40-54:如果再次采样比例大于0.1,则在本地进行采样,采用sampleLocalWithRatio方法。

55-64:如果再次采样比例小于0.1,则用MapReduce的方法进行采样,采用sampleMapReduceWithRatio。

(注意:个人感觉这里有问题,应该采样比例小于0.1的时候,在本地采样,大于0.1的时候用MapReudce采样)。

69:删除输出目录。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值