分区:默认情况下一个分区对应一个reducer,也就是有几个分区就会有几个part-0,part-1,part-2...的输出结果在hdfs上
排序:有三次,每个maper的中间结果是有序的(走两遍排序),每个reducer合并完mapper的中间结果后是有序的(每个分区再走一遍排序)
二次排序的案例
1、输入数据
4 1
5 1
6 4
7 4
4 1
5 1
6 4
7 4
4 1
5 1
6 4
7 4
4 1
5 1
6 4
7 4
2、mapper
public class TwriceSortMapper extends Mapper<LongWritable, Text, Text, Text> {
//long、String不能在hadoop之间进行数据传输,所以必须使用hadoop提供的序列化方法long=>longWriable、String=>Text
@Override
protected void map(LongWritable key, Text value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
context.write(value, value);
}
}
2、patitioner
当numReduceTasks=1时,所有数据都在一个分区里
当numReduceTasks=4时,所有4 1在一个分区,所有5 2在一个分区,所有6 3在一个分区,所有7 4在一个分区
public class KeyPartitioner extends Partitioner<Text, Text> {
@Override
public int getPartition(Text key, Text value,int numReduceTasks){
return (key.toString().split(" ")[0].hashCode()&Integer.MAX_VALUE)%numReduceTasks;
}
}
3、sort
//对键值对的key进行排序
public class SortComparator extends WritableComparator {
protected SortComparator(){
super(Text.class,true);
}
@Override
public int compare(WritableComparable key1, WritableComparable key2){
if(Integer.parseInt(key1.toString().split(" ")[0])==Integer.parseInt(key2.toString().split(" ")[0])){
if(Integer.parseInt(key1.toString().split(" ")[1])>Integer.parseInt(key2.toString().split(" ")[1])){
return 1;
}else if (Integer.parseInt(key1.toString().split(" ")[1])<Integer.parseInt(key2.toString().split(" ")[1])){
return -1;
}else if(Integer.parseInt(key1.toString().split(" ")[1])==Integer.parseInt(key2.toString().split(" ")[1])) {
return 0;
}
}else {
if(Integer.parseInt(key1.toString().split(" ")[0])>Integer.parseInt(key2.toString().split(" ")[0])){
return 1;
}else if(Integer.parseInt(key1.toString().split(" ")[0])<Integer.parseInt(key2.toString().split(" ")[0])){
return -1;
}
}
return 0;
}
}
4、reducer
public class TwriceSortReducer extends Reducer<Text, Text, NullWritable, Text> {
@Override
protected void reduce(Text key, Iterable<Text> v2s,Context context)
throws IOException, InterruptedException {
for (Text value :v2s) {
context.write(NullWritable.get(),value);
}
}
}
5、main,reducer个数是1
public class TwriceSortMain {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//构建job对象
Job job = Job.getInstance(new Configuration());
//注意:main方法所在的类
job.setJarByClass(TwriceSortMain.class);
//设置mapper相关属性
job.setMapperClass(TwriceSortMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// FileInputFormat.setInputPaths(job, new Path("D:words.txt"));
job.setSortComparatorClass(SortComparator.class);
//设置reducer相关属性
job.setReducerClass(TwriceSortReducer.class);
job.setPartitionerClass(KeyPartitioner.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(1);
// FileOutputFormat.setOutputPath(job, new Path("D:wcout510"));
//提交任务
job.waitForCompletion(true);
}
}
6、输出结果和解释
当numReduceTasks=1时,所有数据都在一个分区里
当numReduceTasks=4时,所有4 1在一个分区,所有5 2在一个分区,所有6 3在一个分区,所有7 4在一个分区
最终每个分区对一个reducer和part-r-00001,2,3,每个分区一个二次排序
4 1
4 1
4 1
5 1
5 1
5 1
6 4
6 4
6 4
7 4
7 4
7 4
7 4