spatialhadoop2.3源码阅读(七) Sampler类

最新推荐文章于 2021-02-11 20:13:02 发布

flyhaifeng

最新推荐文章于 2021-02-11 20:13:02 发布

阅读量701

点赞数

分类专栏： spatialhadoop

本文链接：https://blog.csdn.net/flyhaifeng/article/details/50349851

版权

spatialhadoop 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

该类的主要作用是从输入文件中进行随机采样。Sampler类中可以有三种采用方式，分别是根据比例，根据大小以及根据记录数来进行采样。在这三种采用方式中，按比例采样分别实现了本地方法和MapReduce方法，其他两种只有本地方法，具体代码如下：

public static void sample(Path[] inputFiles,
      ResultCollector<? extends TextSerializable> output, OperationsParams params)
      throws IOException {
    if (params.get("ratio") != null) {
      if (params.getBoolean("local", false))
        sampleLocalWithRatio(inputFiles, output, params);
      else
        sampleMapReduceWithRatio(inputFiles, output, params);
    } else if (params.get("size") != null) {
      sampleLocalWithSize(inputFiles, output, params);
    } else if (params.get("count") != null){
      // The only way to sample by count is using the local sampler
      sampleLocalByCount(inputFiles, output, params);
    } else {
      throw new RuntimeException("Must provide one of three options 'size', 'ratio' or 'count'");
    }
  }

接下来主要介绍其中的MapReduce实现，即sampleMapReduceWithRatio方法。

1. map实现

在Map类中，定义的成员变量如下：

 /**Ratio of lines to sample*/
    private double sampleRatio;
    
    /**Random number generator to use*/
    private Random random;

    /**The key assigned to all output records to reduce shuffle overhead*/
    private IntWritable key = new IntWritable((int) (Math.random() * Integer.MAX_VALUE));
    
    /**Shape instance used to parse input lines*/
    private Shape inShape;
    
    enum Conversion {None, ShapeToPoint, ShapeToRect};
    Conversion conversion;

sampleRatio 表示采样比例

random 是java自带的随机数生成类，用来随机采样

key 是map方法输出中的key，在这里是一个随机数，map输出的所有key-value对，key值都相同。

inShape 表示输入数据的shape类型

conversion 定义了输入shape类型和输出shape类型之间的转换。主要有不转换，shape转为点，shape转为矩形。

Map类执行map方法前，会先执行configure方法，其实现如下：

@Override
    public void configure(JobConf job) {
      sampleRatio = job.getFloat("ratio", 0.01f);
      random = new Random(job.getLong("seed", System.currentTimeMillis()));
      
      TextSerializable inObj = OperationsParams.getTextSerializable(job, "shape", new Text2());
      TextSerializable outObj = OperationsParams.getTextSerializable(job, "outshape", new Text2());

      if (inObj.getClass() == outObj.getClass()) {
        conversion = Conversion.None;
      } else {
        if (inObj instanceof Shape && outObj instanceof Point) {
          inShape = (Shape) inObj;
          conversion = Conversion.ShapeToPoint;
        } else if (inObj instanceof Shape && outObj instanceof Rectangle) {
          inShape = (Shape) inObj;
          conversion = Conversion.ShapeToRect;
        } else if (outObj instanceof Text) {
          conversion = Conversion.None;
        } else {
          throw new RuntimeException("Don't know how to convert from: "+
              inObj.getClass()+" to "+outObj.getClass());
        }
      }
    }

在该方法中，主要实现了初始化，包括获得sampleRatio，random，输入类型和输出类型。并根据输入类型和输出类型，判断conversion的值。

接下来将会循环调用Map方法，具体实现如下：

public void map(Rectangle cell, Text line,
        OutputCollector<IntWritable, Text> output, Reporter reporter)
            throws IOException {
      if (random.nextFloat() < sampleRatio) {
        switch (conversion) {
        case None:
          output.collect(key, line);
          break;
        case ShapeToPoint:
          inShape.fromText(line);
          Rectangle mbr = inShape.getMBR();
          if (mbr != null) {
            Point center = mbr.getCenterPoint();
            line.clear();
            center.toText(line);
            output.collect(key, line);
          }
          break;
        case ShapeToRect:
          inShape.fromText(line);
          mbr = inShape.getMBR();
          if (mbr != null) {
            line.clear();
            mbr.toText(line);
            output.collect(key, line);
          }
          break;
        }
      }
    }

map方法的random.nextFloat() < sampleRatio 判断用来保证最终采样的数据符合 sampleRatio比例。random.nextFloat() 方法会生成0.0 到1.0之间的随机数，并且该随机数落在0.0-1.0之间的概率是相等的。所以只要加上该判断，即可保证采样比例的正确。

接下来根据不同的conversion值，采用数据进行转换。

None：直接输出。

ShapeToPoint：获得输入数据的最小包围矩形，计算最小包围矩形的中心点坐标，最后输出。

ShapeToRect：获得输入数据的最小包围矩形输出。

2.Reduce

 public static class Reduce extends MapReduceBase implements
  Reducer<IntWritable, Text, NullWritable, Text> {
    @Override
    public void reduce(IntWritable dummy, Iterator<Text> values,
        OutputCollector<NullWritable, Text> output, Reporter reporter)
            throws IOException {
      while (values.hasNext()) {
        Text x = values.next();
        output.collect(NullWritable.get(), x);
      }
    }
  }

reduce就只是单纯的讲map输出的key转换为null，然后输出。

接下来介绍sampleMapReduceWithRatio方法。

该方法的上半部分主要是对MapReduce Job的一些配置，具体如下：

JobConf job = new JobConf(params, Sampler.class);
    
    Path outputPath;
    FileSystem outFs = FileSystem.get(job);
    do {
      outputPath = new Path(files[0].toUri().getPath()+
          ".sample_"+(int)(Math.random()*1000000));
    } while (outFs.exists(outputPath));
    
    job.setJobName("Sample");
    job.setMapOutputKeyClass(IntWritable.class);
    job.setMapOutputValueClass(Text.class);
    
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    ClusterStatus clusterStatus = new JobClient(job).getClusterStatus();
    job.setNumMapTasks(clusterStatus.getMaxMapTasks() * 5);
    // Number of reduces can be set to zero. However, setting it to a reasonable
    // number ensures that number of output files is limited to that number
    job.setNumReduceTasks(
        Math.max(1, (int)Math.round(clusterStatus.getMaxReduceTasks() * 0.9)));
    
    job.setInputFormat(ShapeLineInputFormat.class);
    job.setOutputFormat(TextOutputFormat.class);
    
    ShapeLineInputFormat.setInputPaths(job, files);
    TextOutputFormat.setOutputPath(job, outputPath);
    
    // Submit the job
    RunningJob run_job = JobClient.runJob(job);

接下来重点介绍下半部分，该部分是对MapReduce结果的后续处理：

 Counters counters = run_job.getCounters();
    Counter outputRecordCounter = counters.findCounter(Task.Counter.MAP_OUTPUT_RECORDS);
    final long resultCount = outputRecordCounter.getValue();
    
    Counter outputSizeConter = counters.findCounter(Task.Counter.MAP_OUTPUT_BYTES);
    final long sampleSize = outputSizeConter.getValue();
    
    LOG.info("resultSize: "+sampleSize);
    LOG.info("resultCount: "+resultCount);

    Counter inputBytesCounter = counters.findCounter(Task.Counter.MAP_INPUT_BYTES);
    Sampler.sizeOfLastProcessedFile = inputBytesCounter.getValue();

    // Ratio of records to return from output based on the threshold
    // Note that any number greater than or equal to one will cause all
    // elements to be returned
    long desiredSampleSize = job.getLong("size", 0);
    // Fraction of drawn sample to return
    float selectRatio = desiredSampleSize <= 0? 2.0f : (float)desiredSampleSize / sampleSize;

    // Read job result
    int result_size = 0;
    if (selectRatio > 1.0f) {
      // Return all records from the output
      ShapeLineInputFormat inputFormat = new ShapeLineInputFormat();
      ShapeLineInputFormat.setInputPaths(job, outputPath);
      InputSplit[] splits = inputFormat.getSplits(job, 1);
      for (InputSplit split : splits) {
        RecordReader<Rectangle, Text> reader = inputFormat.getRecordReader(split, job, null);
        Rectangle key = reader.createKey();
        Text value = reader.createValue();
        T outObj = (T) OperationsParams.getTextSerializable(params, "outshape", new Text2());
        while (reader.next(key, value)) {
          outObj.fromText(value);
          output.collect(outObj);
        }
        reader.close();
      }
    } else {
      if (output != null) {
        OperationsParams params2 = new OperationsParams(params);
        params2.setFloat("ratio", selectRatio);
        params2.set("shape", params.get("outshape"));
        params2.set("outshape", params.get("outshape"));
        if (selectRatio > 0.1) {
          LOG.info("Local return "+selectRatio+" of "+resultCount+" records");
          // Keep a copy of sizeOfLastProcessedFile because we don't want it changed
          long tempSize = sizeOfLastProcessedFile;
          // Return a (small) ratio of the result using a MapReduce job
          // In this case, the files are very big and we need just a small ratio
          // of them. It is better to do it in parallel
          result_size = sampleLocalWithRatio(new Path[] { outputPath},
              output, params2);
          sizeOfLastProcessedFile = tempSize;
        } else {
          LOG.info("MapReduce return "+selectRatio+" of "+resultCount+" records");
          // Keep a copy of sizeOfLastProcessedFile because we don't want it changed
          long tempSize = sizeOfLastProcessedFile;
          // Return a (small) ratio of the result using a MapReduce job
          // In this case, the files are very big and we need just a small ratio
          // of them. It is better to do it in parallel
          result_size = sampleMapReduceWithRatio(new Path[] { outputPath},
              output, params2);
          sizeOfLastProcessedFile = tempSize;
        }
      }
    }

    outFs.delete(outputPath, true);
    
    return result_size;

1-12行：获取MapReduce的处理结果，包括map产生的记录数，处理的数据量，处理的数据记录数。

17-19行：计算用户设置的size大小和采样的数据大小之间的比例

23-38：用户设置的size大于采样的size，则遍历MapReduce的输出目录，将结果返回。

39-64：用户设置的size小于采样的size。则需要对MapReduce的输出目录中的采样数据再次进行采样。

40-54：如果再次采样比例大于0.1，则在本地进行采样，采用sampleLocalWithRatio方法。

55-64：如果再次采样比例小于0.1，则用MapReduce的方法进行采样，采用sampleMapReduceWithRatio。

（注意：个人感觉这里有问题，应该采样比例小于0.1的时候，在本地采样，大于0.1的时候用MapReudce采样）。

69：删除输出目录。

flyhaifeng

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spatialhadoop2.3源码阅读(七) Sampler类

该类的主要作用是从输入文件中进行随机采样。Sampler类中可以有三种采用方式，分别是根据比例，根据大小以及根据记录数来进行采样。在这三种采用方式中，按比例采样分别实现了本地方法和MapReduce方法，其他两种只有本地方法，具体代码如下：public static void sample(Path[] inputFiles, ResultCollector output, Ope
复制链接

扫一扫