MapReduce中二次排序

最新推荐文章于 2022-10-07 20:58:47 发布

Sunrise0929

最新推荐文章于 2022-10-07 20:58:47 发布

阅读量484

点赞数

分类专栏：云计算

本文链接：https://blog.csdn.net/xiaoqinggao/article/details/9944455

版权

云计算专栏收录该内容

5 篇文章 0 订阅

订阅专栏

MR自带的源码SecondarySort，即二次排序。二次排序可以实现类似下例功能：计算每年的最高气温。如果key设置为气温，value设置为年份及其他信息，那么我们不必遍历他们以找到最大值，只需获取每年的第一个值而忽略其他。但这不是最有效的解决问题的方法，考虑将key变成复合的，即年份和气温，先按年份升序，再按气温降序。但是这样不能保证同一年的记录去同一个reducer，需要设置partitioner使其按照键的年份部分进行分区。然而这样还是没有改变Reducer通过分区按键成组的事实，还需要控制分组的设置，通过在reducer中以键的年份部分来分组值，那么就将同一年的记录放在同一个reduce组中。同时因为他们以气温降序排列，第一个就是最高气温。

下面对MR中的自带源码SecondarySort进行分析：

(1) 自定义key

在mr中，所有的key是需要被比较和排序的，并且是二次，先根据partition，再根据大小。而本例中也是要比较两次。先按照第一字段排序，然后再对第一字段相同的按照第二字段排序。根据这一点，我们可以构造一个复合类IntPair，他有两个字段，先利用分区对第一字段排序，再利用分区内的比较对第二字段排序。所有自定义的key应该实现接口WritableComparable，因为是可序列的并且可比较的。

//自己定义的key类应该实现WritableComparable接口

public static class IntPair implements WritableComparable<IntPair> {

int first;

int second;

public void set(int left, int right) {

first = left;

second = right;

}

public int getFirst() {

return first;

}

public int getSecond() {

return second;

}

@Override

//反序列化，从流中的二进制转换成IntPair

public void readFields(DataInput in) throws IOException {

// TODO Auto-generated method stub

first = in.readInt();

second = in.readInt();

}

@Override

//序列化，将IntPair转化成使用流传送的二进制

public void write(DataOutput out) throws IOException {

// TODO Auto-generated method stub

out.writeInt(first);

out.writeInt(second);

}

@Override

//key的比较

public int compareTo(IntPair o) {

// TODO Auto-generated method stub

if (first != o.first) {

return first < o.first ? -1 : 1;

} else if (second != o.second) {

return second < o.second ? -1 : 1;

} else {

return 0;

}

//新定义类应该重写的两个方法

@Override

//The hashCode() method is used by the HashPartitioner (the default partitioner in MapReduce)

public int hashCode() {

return first * 157 + second;

}

@Override

public boolean equals(Object right) {

if (right == null)

return false;

if (this == right)

return true;

if (right instanceof IntPair) {

IntPair r = (IntPair) right;

return r.first == first && r.second == second;

} else {

return false;

}

(2) 分区函数类

key的第一次比较。

public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>{

@Override

public int getPartition(IntPair key, IntWritable value,

int numPartitions) {

return Math.abs(key.getFirst() * 127) % numPartitions;

}

(3) 分组函数类

在reduce阶段，构造一个key对应的value迭代器的时候，只要first相同就属于同一个组，放在一个value迭代器。这是一个比较器，需要继承WritableComparator。

//继承WritableComparator

public static class GroupingComparator extends WritableComparator {

protected GroupingComparator() {

super(IntPair.class, true);

}

@Override

//Compare two WritableComparables.

public int compare(WritableComparable w1, WritableComparable w2) {

IntPair ip1 = (IntPair) w1;

IntPair ip2 = (IntPair) w2;

int l = ip1.getFirst();

int r = ip2.getFirst();

return l == r ? 0 : (l < r ? -1 : 1);

}

(4) Main函数中的设置

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

// TODO Auto-generated method stub

// 读取hadoop配置

Configuration conf = new Configuration();

// 实例化一道作业

Job job = new Job(conf, "secondarysort");

job.setJarByClass(Sort.class);

// Mapper类型

job.setMapperClass(Map.class);

// 不再需要Combiner类型，因为Combiner的输出类型<Text, IntWritable>对Reduce的输入类型<IntPair, IntWritable>不适用

//job.setCombinerClass(Reduce.class);

// Reducer类型

job.setReducerClass(Reduce.class);

// 分区函数

job.setPartitionerClass(FirstPartitioner.class);

// 分组函数

job.setGroupingComparatorClass(GroupingComparator.class);

// map 输出Key的类型

job.setMapOutputKeyClass(IntPair.class);

// map输出Value的类型

job.setMapOutputValueClass(IntWritable.class);

// reduce输出Key的类型，是Text，因为使用的OutputFormatClass是TextOutputFormat

job.setOutputKeyClass(Text.class);

// reduce输出Value的类型

job.setOutputValueClass(IntWritable.class);

// 将输入的数据集分割成小数据块splites，同时提供一个RecordReder的实现。

job.setInputFormatClass(TextInputFormat.class);

// 提供一个RecordWriter的实现，负责数据输出。

job.setOutputFormatClass(TextOutputFormat.class);

// 输入hdfs路径

FileInputFormat.setInputPaths(job, new Path(args[0]));

// 输出hdfs路径

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// 提交job

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Sunrise0929

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录