二次排序Partitioner、SortComparator、GroupingComparator
Partitioner:完成分区,重写getPartition()函数
SortComparator与GroupingComparator异同:
相同:都要继承WritableComparator对象,构造函数关联bean对象,重写compare()方法.
不同:SortComparator完成的是二次排序功能,其compare()方法完成bean对象的排序,GroupingComparator完成分组功能,其compare()方法完成bean对象分组。
需求分析:
1.键值对是两个整数(int1,int2),int1范围是1-100000,int2范围是1-100.
2.要求先按int2排序,再按int1排序。
3.reduce至少五个,且reduce的输出全排序
bean对象:要实现WritableComparable的功能,这里重写了compareTo方法进行排序,其功能与SortComparator一致。
public class MyBean implements WritableComparable<MyBean> {
private int int1;
private int int2;
public MyBean() {
}
public MyBean(int int1, int int2) {
this.int1 = int1;
this.int2 = int2;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(int1);
out.writeInt(int2);
}
@Override
public void readFields(DataInput in) throws IOException {
this.int1 = in.readInt();
this.int2 = in.readInt();
}
@Override
public int compareTo(MyBean o) { //实现二次排序
if (this.int2 == o.getInt2()) {
return this.int1 - o.getInt1();
} else {
return this.int2 - o.getInt2();
}
}
@Override
public String toString() {
return "(" + int1 + "," + int2 + ")";
}
public int getInt1() {
return int1;
}
public void setInt1(int int1) {
this.int1 = int1;
}
public int getInt2() {
return int2;
}
public void setInt2(int int2) {
this.int2 = int2;
}
}
map
public static class MyMapper extends Mapper<LongWritable, Text, MyBean, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString(); // (int1,int2)
String[] fields = line.split(",");
String num1 = fields[0].substring(1, fields[0].length());
String num2 = fields[1].substring(0, fields[1].length() - 1);
MyBean bean = new MyBean(Integer.parseInt(num1), Integer.parseInt(num2));
context.write(bean, NullWritable.get());
}
}
partitioner
public static class MyPartitioner extends Partitioner<MyBean, NullWritable> {
@Override
public int getPartition(MyBean key, NullWritable value, int numPartitions) {
int int2 = key.getInt2();
return (int2 - 1) / 20;
}
}
groupingComparator:只根据int2分区
public static class GroupingComparator extends WritableComparator {
public GroupingComparator() {
super(MyBean.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
MyBean beanA = (MyBean) a;
MyBean beanB = (MyBean) b;
return beanA.getInt2() - beanB.getInt2();
}
}
reducer
说明一点:相同的key值会进入同一个reduce函数,这里二次排序只根据int2对key(bean对象)进行分组,实际上key值(bean对象)不完全相同,存在多个在同一组的key值(bean对象),存在两种情况:
1.int2相同,int1不同。
2.int2相同,int1也相同。
这时value是NullWritable类型,要获取不同的bean对象,必须通过遍历values来获得不同的key值。否则每次获取的都是第一个key值(bean对象)
public static class MyReducer extends Reducer<MyBean, NullWritable, Text, NullWritable> {
@Override
protected void reduce(MyBean key, Iterable<NullWritable> values, Context context)
throws IOException, InterruptedException {
String str = "";
str += key.getInt2()+":";
for (NullWritable value : values) {
str += key.getInt1() + ",";
}
context.write(new Text(str.substring(0, str.length()-1)), NullWritable.get());
}
}
driver
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
job.setJarByClass(MyGroupingComparator.class);
job.setPartitionerClass(MyPartitioner.class);
job.setGroupingComparatorClass(GroupingComparator.class);
job.setNumReduceTasks(5);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(MyBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
其中mapper、reducer和driver都在同一个类里
public class MyGroupingComparator {}
以上完成了分区、排序、分组的功能。排序的功能有两种实现方法:
1.以上是用继承了WritableComparable的bean对象的compareTo函数实现的。
@Override
public int compareTo(MyBean o) { //实现二次排序
if (this.int2 == o.getInt2()) {
return this.int1 - o.getInt1();
} else {
return this.int2 - o.getInt2();
}
}
2.也可以继承SortComparator类实现
job.setSortComparatorClass(SortComparator.class); //driver
public class SortComparator extends WritableComparator {
public SortComparator() {
super(MyBean.class, true);
}
@Override
@SuppressWarnings("rawtypes")
public int compare(WritableComparable a, WritableComparable b) {
MyBean beanA = (MyBean) a;
MyBean beanB = (MyBean) b;
if (beanA.getInt2() == beanB.getInt2()) {
return beanA.getInt1() - beanB.getInt1();
} else {
return beanA.getInt2() - beanB.getInt2();
}
}
}
附: (int1,int2)的生成类
/*
* 使用随机数生成以(整数1,整数2)为(int1,int2)的文本文件,
* 文件数量不少于100个,
* 单个文件记录数量不少于10万条,
* 其中int1为1-100000的随机数,int2位1-100的随机数。
*/
public class InitRandom {
public static void main(String[] args) throws IOException {
int int1 = 100000;
int int2 = 100;
int numOfFiles = 100;
int numOfRecords = 100000;
String path = args[0]; //inputPath
FileOutputStream fos = null;
Random random = new java.util.Random();
for (int i = 1; i <= numOfFiles; i++) {
System.out.println("writing file#"+i);
fos = new FileOutputStream(new File(path + "/file" + i));
List<String> list = new ArrayList<String>();
for (int j = 0; j < numOfRecords; j++)
list.add("(" + (random.nextInt(int1) + 1) +","+ (random.nextInt(int2) + 1) +")");//line
PrintStream pStream = new PrintStream(new BufferedOutputStream(fos));
for (String str : list) {
pStream.println(str);
}
pStream.close();
fos.close();
}
}
}