hadoop中的全排序

最新推荐文章于 2021-01-08 17:26:26 发布

皮皮的雅客

最新推荐文章于 2021-01-08 17:26:26 发布

阅读量1.1k

点赞数

分类专栏： ---------【Hadoop】 ◆【大数据】文章标签： hadoop 全排序

本文链接：https://blog.csdn.net/king123456man/article/details/81530476

版权

◆【大数据】同时被 2 个专栏收录

41 篇文章 1 订阅

订阅专栏

---------【Hadoop】

14 篇文章 1 订阅

订阅专栏

hadoop 有一个很重要的功能就是能对处理的数据进行清洗，排序(部分排序)，将杂乱无章的数据编程有序的数据。hadoop的MR框架能对数据进行默认的排序(部分排列)，下面将介绍第一种定制排序——全排序(按照key进行排序)。

全排序的几种实现

只定义一个reduce，默认就是全排序
自定义分区函数(自行设置分界区间)
使用hadoop的采样机制

重点来说一下使用hadoop的采样机制实现全排序

hadoop的采样机制实现全排序

准备工作

采样器的数据来源为二进制文件(.seq)，所以需要进行数据准备，如下的代码会生成一个 .seq 文件，文件中的内容为年份——气温，要求就是找出每一年的最高气温。

public void save() throws Exception {
    Configuration conf = new Configuration();
    conf.set("fs.defaultFS","file:///");
    FileSystem fs = FileSystem.get(conf);
    Path p = new Path("F:/hadoop/temp.seq") ;
    SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,p, IntWritable.class, IntWritable.class);
    for (int i = 0; i < 6000; i++) {
        int year = 1970 + new Random().nextInt(100);
        int temp = -30 + new Random().nextInt(100);
        writer.append(new IntWritable(year), new IntWritable(temp));
    }
    writer.close();
}

Mapper

public class MaxTempMapper extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {

    @Override
    protected void map(IntWritable key, IntWritable value, Context context) throws IOException, InterruptedException {
        context.write(key, value);
    }
}

Reducer

public class MaxTempReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {

@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int max = Integer.MIN_VALUE;
    for (IntWritable iw : values) {
        max = max > iw.get() ? max : iw.get();
    }

    context.write(key, new IntWritable(max));
}
}

主函数

注意：map和reduce输出的类型需要放在采样器前面，否则会出错。

/**
 * 全排序：找出每年的最高温度
 *  > 默认全排序就是不做任何处理，只用一个reduce处理数据
 *
 *  > 使用采样器
 */
public class MaxTempApp {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "file:///");

        Job job = Job.getInstance(conf);

        //设置job的各种属性
        job.setJobName("MaxTempApp");
        job.setJarByClass(MaxTempApp.class);
        job.setInputFormatClass(SequenceFileInputFormat.class);         //设置输入格式

        //添加输入路径
        FileInputFormat.addInputPath(job, new Path(args[0]));
        //设置输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(MaxTempMapper.class);
        job.setReducerClass(MaxTempReducer.class);

        //从map输出类型
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(IntWritable.class);

        //从reduce输出类型
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);

        //设置reduce的个数
        job.setNumReduceTasks(3); 

        //创建随机采样器对象
        //freq: 每个key被选中的概率
        //numSample: 抽取样本的总数
        //maxSplitSampled: 最大采样切片数
        InputSampler.Sampler<IntWritable, IntWritable> sampler =
                new InputSampler.RandomSampler<>(0.5, 3000, 3);

        分区文件设置，设置的job的配置对象，不是之前设置的conf
        TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), new Path("F:/hadoop/temp/par.lst"));

        //设置全排序分区类
        job.setPartitionerClass(TotalOrderPartitioner.class);

        //将sample数据写入分区文件
        InputSampler.writePartitionFile(job, sampler);

        //是否等到编译(打印出过程)
        job.waitForCompletion(true);
    }
}

皮皮的雅客

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
hadoop中的全排序

hadoop 有一个很重要的功能就是能对处理的数据进行清洗，排序(部分排序)，将杂乱无章的数据编程有序的数据。hadoop的MR框架能对数据进行默认的排序(部分排列)，下面将介绍第一种定制排序——全排序(按照key进行排序)。全排序的几种实现只定义一个reduce，默认就是全排序自定义分区函数(自行设置分界区间)使用hadoop的采样机制重点来说一下使...
复制链接

扫一扫