该类的主要作用是从输入文件中进行随机采样。Sampler类中可以有三种采用方式,分别是根据比例,根据大小以及根据记录数来进行采样。在这三种采用方式中,按比例采样分别实现了本地方法和MapReduce方法,其他两种只有本地方法,具体代码如下:
public static void sample(Path[] inputFiles,
ResultCollector<? extends TextSerializable> output, OperationsParams params)
throws IOException {
if (params.get("ratio") != null) {
if (params.getBoolean("local", false))
sampleLocalWithRatio(inputFiles, output, params);
else
sampleMapReduceWithRatio(inputFiles, output, params);
} else if (params.get("size") != null) {
sampleLocalWithSize(inputFiles, output, params);
} else if (params.get("count") != null){
// The only way to sample by count is using the local sampler
sampleLocalByCount(inputFiles, output, params);
} else {
throw new RuntimeException("Must provide one of three options 'size', 'ratio' or 'count'");
}
}
接下来主要介绍其中的MapReduce实现,即sampleMapReduceWithRatio方法。
1. map实现
在Map类中,定义的成员变量如下:
/**Ratio of lines to sample*/
private double sampleRatio;
/**Random number generator to use*/
private Random random;
/**The key assigned to all output records to reduce shuffle overhead*/
private IntWritable key = new IntWritable((int) (Math.random() * Integer.MAX_VALUE));
/**Shape instance used to parse input lines*/
private Shape inShape;
enum Conversion {None, ShapeToPoint, ShapeToRect};
Conversion conversion;
sampleRatio 表示采样比例
random 是java自带的随机数生成类,用来随机采样
key 是map方法输出中的key,在这里是一个随机数,map输出的所有key-value对,key值都相同。
inShape 表示输入数据的shape类型
conversion 定义了输入shape类型和输出shape类型之间的转换。主要有不转换,shape转为点,shape转为矩形。
Map类执行map方法前,会先执行configure方法,其实现如下:
@Override
public void configure(JobConf job) {
sampleRatio = job.getFloat("ratio", 0.01f);
random = new Random(job.getLong("seed", System.currentTimeMillis()));
TextSerializable inObj = OperationsParams.getTextSerializable(job, "shape", new Text2());
TextSerializable outObj = OperationsParams.getTextSerializable(job, "outshape", new Text2());
if (inObj.getClass() == outObj.getClass()) {
conversion = Conversion.None;
} else {
if (inObj instanceof Shape && outObj instanceof Point) {
inShape = (Shape) inObj;
conversion = Conversion.ShapeToPoint;
} else if (inObj instanceof Shape && outObj instanceof Rectangle) {
inShape = (Shape) inObj;
conversion = Conversion.ShapeToRect;
} else if (outObj instanceof Text) {
conversion = Conversion.None;
} else {
throw new RuntimeException("Don't know how to convert from: "+
inObj.getClass()+" to "+outObj.getClass());
}
}
}
在该方法中,主要实现了初始化,包括获得sampleRatio,random,输入类型和输出类型。并根据输入类型和输出类型,判断conversion的值。
接下来将会循环调用Map方法,具体实现如下:
public void map(Rectangle cell, Text line,
OutputCollector<IntWritable, Text> output, Reporter reporter)
throws IOException {
if (random.nextFloat() < sampleRatio) {
switch (conversion) {
case None:
output.collect(key, line);
break;
case ShapeToPoint:
inShape.fromText(line);
Rectangle mbr = inShape.getMBR();
if (mbr != null) {
Point center = mbr.getCenterPoint();
line.clear();
center.toText(line);
output.collect(key, line);
}
break;
case ShapeToRect:
inShape.fromText(line);
mbr = inShape.getMBR();
if (mbr != null) {
line.clear();
mbr.toText(line);
output.collect(key, line);
}
break;
}
}
}
map方法的random.nextFloat() < sampleRatio 判断用来保证最终采样的数据符合
sampleRatio比例。random.nextFloat() 方法会生成0.0 到1.0之间的随机数,并且该随机数落在0.0-1.0之间的概率是相等的。所以只要加上该判断,即可保证采样比例的正确。
接下来根据不同的conversion值,采用数据进行转换。
None:直接输出。
ShapeToPoint:获得输入数据的最小包围矩形,计算最小包围矩形的中心点坐标,最后输出。
ShapeToRect:获得输入数据的最小包围矩形输出。
2.Reduce
public static class Reduce extends MapReduceBase implements
Reducer<IntWritable, Text, NullWritable, Text> {
@Override
public void reduce(IntWritable dummy, Iterator<Text> values,
OutputCollector<NullWritable, Text> output, Reporter reporter)
throws IOException {
while (values.hasNext()) {
Text x = values.next();
output.collect(NullWritable.get(), x);
}
}
}
reduce就只是单纯的讲map输出的key转换为null,然后输出。
接下来介绍sampleMapReduceWithRatio方法。
该方法的上半部分主要是对MapReduce Job的一些配置,具体如下:
JobConf job = new JobConf(params, Sampler.class);
Path outputPath;
FileSystem outFs = FileSystem.get(job);
do {
outputPath = new Path(files[0].toUri().getPath()+
".sample_"+(int)(Math.random()*1000000));
} while (outFs.exists(outputPath));
job.setJobName("Sample");
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
ClusterStatus clusterStatus = new JobClient(job).getClusterStatus();
job.setNumMapTasks(clusterStatus.getMaxMapTasks() * 5);
// Number of reduces can be set to zero. However, setting it to a reasonable
// number ensures that number of output files is limited to that number
job.setNumReduceTasks(
Math.max(1, (int)Math.round(clusterStatus.getMaxReduceTasks() * 0.9)));
job.setInputFormat(ShapeLineInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
ShapeLineInputFormat.setInputPaths(job, files);
TextOutputFormat.setOutputPath(job, outputPath);
// Submit the job
RunningJob run_job = JobClient.runJob(job);
接下来重点介绍下半部分,该部分是对MapReduce结果的后续处理:
Counters counters = run_job.getCounters();
Counter outputRecordCounter = counters.findCounter(Task.Counter.MAP_OUTPUT_RECORDS);
final long resultCount = outputRecordCounter.getValue();
Counter outputSizeConter = counters.findCounter(Task.Counter.MAP_OUTPUT_BYTES);
final long sampleSize = outputSizeConter.getValue();
LOG.info("resultSize: "+sampleSize);
LOG.info("resultCount: "+resultCount);
Counter inputBytesCounter = counters.findCounter(Task.Counter.MAP_INPUT_BYTES);
Sampler.sizeOfLastProcessedFile = inputBytesCounter.getValue();
// Ratio of records to return from output based on the threshold
// Note that any number greater than or equal to one will cause all
// elements to be returned
long desiredSampleSize = job.getLong("size", 0);
// Fraction of drawn sample to return
float selectRatio = desiredSampleSize <= 0? 2.0f : (float)desiredSampleSize / sampleSize;
// Read job result
int result_size = 0;
if (selectRatio > 1.0f) {
// Return all records from the output
ShapeLineInputFormat inputFormat = new ShapeLineInputFormat();
ShapeLineInputFormat.setInputPaths(job, outputPath);
InputSplit[] splits = inputFormat.getSplits(job, 1);
for (InputSplit split : splits) {
RecordReader<Rectangle, Text> reader = inputFormat.getRecordReader(split, job, null);
Rectangle key = reader.createKey();
Text value = reader.createValue();
T outObj = (T) OperationsParams.getTextSerializable(params, "outshape", new Text2());
while (reader.next(key, value)) {
outObj.fromText(value);
output.collect(outObj);
}
reader.close();
}
} else {
if (output != null) {
OperationsParams params2 = new OperationsParams(params);
params2.setFloat("ratio", selectRatio);
params2.set("shape", params.get("outshape"));
params2.set("outshape", params.get("outshape"));
if (selectRatio > 0.1) {
LOG.info("Local return "+selectRatio+" of "+resultCount+" records");
// Keep a copy of sizeOfLastProcessedFile because we don't want it changed
long tempSize = sizeOfLastProcessedFile;
// Return a (small) ratio of the result using a MapReduce job
// In this case, the files are very big and we need just a small ratio
// of them. It is better to do it in parallel
result_size = sampleLocalWithRatio(new Path[] { outputPath},
output, params2);
sizeOfLastProcessedFile = tempSize;
} else {
LOG.info("MapReduce return "+selectRatio+" of "+resultCount+" records");
// Keep a copy of sizeOfLastProcessedFile because we don't want it changed
long tempSize = sizeOfLastProcessedFile;
// Return a (small) ratio of the result using a MapReduce job
// In this case, the files are very big and we need just a small ratio
// of them. It is better to do it in parallel
result_size = sampleMapReduceWithRatio(new Path[] { outputPath},
output, params2);
sizeOfLastProcessedFile = tempSize;
}
}
}
outFs.delete(outputPath, true);
return result_size;
1-12行:获取MapReduce的处理结果,包括map产生的记录数,处理的数据量,处理的数据记录数。
17-19行:计算用户设置的size大小和采样的数据大小之间的比例
23-38:用户设置的size大于采样的size,则遍历MapReduce的输出目录,将结果返回。
39-64:用户设置的size小于采样的size。则需要对MapReduce的输出目录中的采样数据再次进行采样。
40-54:如果再次采样比例大于0.1,则在本地进行采样,采用sampleLocalWithRatio方法。
55-64:如果再次采样比例小于0.1,则用MapReduce的方法进行采样,采用sampleMapReduceWithRatio。
(注意:个人感觉这里有问题,应该采样比例小于0.1的时候,在本地采样,大于0.1的时候用MapReudce采样)。
69:删除输出目录。