MapReduce之数据组织模式

最新推荐文章于 2021-08-06 21:30:05 发布

xuehuagongzi000

最新推荐文章于 2021-08-06 21:30:05 发布

阅读量324

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/xuehuagongzi000/article/details/84112816

版权

hadoop 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

分区：默认情况下一个分区对应一个reducer，也就是有几个分区就会有几个part-0,part-1,part-2...的输出结果在hdfs上

排序：有三次，每个maper的中间结果是有序的（走两遍排序），每个reducer合并完mapper的中间结果后是有序的（每个分区再走一遍排序）

二次排序的案例

1、输入数据

2、mapper

public class TwriceSortMapper extends Mapper<LongWritable, Text, Text, Text> {

    //long、String不能在hadoop之间进行数据传输，所以必须使用hadoop提供的序列化方法long=>longWriable、String=>Text
    @Override
    protected void map(LongWritable key, Text value,
                       org.apache.hadoop.mapreduce.Mapper.Context context)
            throws IOException, InterruptedException {

            context.write(value, value);
    }
}

2、patitioner

当numReduceTasks=1时，所有数据都在一个分区里

当numReduceTasks=4时，所有4 1在一个分区，所有5 2在一个分区，所有6 3在一个分区，所有7 4在一个分区

public class KeyPartitioner extends Partitioner<Text, Text> {

    @Override
    public int getPartition(Text key, Text value,int numReduceTasks){
        return (key.toString().split(" ")[0].hashCode()&Integer.MAX_VALUE)%numReduceTasks;
    }
}

3、sort

//对键值对的key进行排序

public class SortComparator extends WritableComparator {

    protected SortComparator(){
        super(Text.class,true);
    }

    @Override
    public int compare(WritableComparable key1, WritableComparable key2){
        if(Integer.parseInt(key1.toString().split(" ")[0])==Integer.parseInt(key2.toString().split(" ")[0])){
            if(Integer.parseInt(key1.toString().split(" ")[1])>Integer.parseInt(key2.toString().split(" ")[1])){
                return 1;
            }else if (Integer.parseInt(key1.toString().split(" ")[1])<Integer.parseInt(key2.toString().split(" ")[1])){
                return -1;
            }else if(Integer.parseInt(key1.toString().split(" ")[1])==Integer.parseInt(key2.toString().split(" ")[1])) {
                return 0;
            }
        }else {
            if(Integer.parseInt(key1.toString().split(" ")[0])>Integer.parseInt(key2.toString().split(" ")[0])){
                return 1;
            }else if(Integer.parseInt(key1.toString().split(" ")[0])<Integer.parseInt(key2.toString().split(" ")[0])){
                return -1;
            }
        }
        return 0;
    }

}

4、reducer

public class TwriceSortReducer  extends Reducer<Text, Text, NullWritable, Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> v2s,Context context)
            throws IOException, InterruptedException {
        for (Text value :v2s) {
            context.write(NullWritable.get(),value);
        }
    }


}

5、main，reducer个数是1

public class TwriceSortMain {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //构建job对象
        Job job = Job.getInstance(new Configuration());

        //注意：main方法所在的类
        job.setJarByClass(TwriceSortMain.class);

        //设置mapper相关属性
        job.setMapperClass(TwriceSortMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
//        FileInputFormat.setInputPaths(job, new Path("D:words.txt"));

        job.setSortComparatorClass(SortComparator.class);
        //设置reducer相关属性
        job.setReducerClass(TwriceSortReducer.class);
        job.setPartitionerClass(KeyPartitioner.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setNumReduceTasks(1);

//        FileOutputFormat.setOutputPath(job, new Path("D:wcout510"));


        //提交任务
        job.waitForCompletion(true);
    }

}

6、输出结果和解释

当numReduceTasks=1时，所有数据都在一个分区里

当numReduceTasks=4时，所有4 1在一个分区，所有5 2在一个分区，所有6 3在一个分区，所有7 4在一个分区

最终每个分区对一个reducer和part-r-00001,2,3，每个分区一个二次排序

xuehuagongzi000

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce之数据组织模式

分区：默认情况下一个分区对应一个reducer，也就是有几个分区就会有几个part-0,part-1,part-2...的输出结果在hdfs上排序：有三次，每个maper的中间结果是有序的（走两遍排序），每个reducer合并完mapper的中间结果后是有序的（每个分区再走一遍排序）二次排序的案例1、输入数据4 15 16 47 44 15 16 47 44 1...
复制链接

扫一扫