使用MapReduce对年份、温度进行全排序，防止数据倾斜（详）

最新推荐文章于 2024-05-16 13:02:17 发布

Chenway丶

最新推荐文章于 2024-05-16 13:02:17 发布

阅读量531

点赞数

本文链接：https://blog.csdn.net/a8330508/article/details/81257096

版权

何为全排序？全排序就是将part-r-xxxxx文件合在一起，数据仍然有序。

先来讲讲对年份、温度进行简单排序：

1.以下是关于年份、温度的数据：

....

2.对以上数据计算出每年最高气温

3.编程：

①编写一个继承Mapper的类，重写map方法:

public class MaxTempMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable>{
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String arr[] = line.split(" ");
        context.write(new IntWritable(Integer.parseInt(arr[0])),new IntWritable(Integer.parseInt(arr[1])));
    }
}

其中Key值是每个数据的记录在数据分片中字节偏移量。value是每行数据，将每行数据进行分割，最后以IntWriitable，IntWritable 类型写入年份、温度。

②编写一个继承Reducer的类，重写reduce方法：

public class MaxTempReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{
    /**
     * reduce
     */
    protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int max = Integer.MIN_VALUE ;
        for(IntWritable iw : values){
            max = max > iw.get() ? max : iw.get() ;
        }
        context.write(key,new IntWritable(max));
    }
}

其中中间经过分区、排序过后，为reduce方法输入的是年份、温度迭代器。

通过max = max > iw.get() ? max : iw.get() ;遍历比较温度，获得最高温度并写出。

③编写主类：

public class MaxTempApp {
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","file:///");

        Job job = Job.getInstance(conf);

        //设置job的各种属性
        job.setJobName("MaxTempApp");                        //作业名称
        job.setJarByClass(MaxTempApp.class);                 //搜索类
        job.setInputFormatClass(TextInputFormat.class); //设置输入格式

        //添加输入路径
        FileInputFormat.addInputPath(job,new Path(args[0]));
        //设置输出路径
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        //job.setPartitionerClass(YearPartitioner.class);

        job.setMapperClass(MaxTempMapper.class);             //mapper类
        job.setReducerClass(MaxTempReducer.class);           //reducer类

        job.setNumReduceTasks(3);                           //reduce个数

        job.setMapOutputKeyClass(IntWritable.class);        //
        job.setMapOutputValueClass(IntWritable.class);      //

        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);         //
        job.waitForCompletion(true);
    }
}

其中在项目中需要有core-site.xml 文件。

而在core-site.xml文件中：

已经指定了路径，如果不修改它，可以使用

conf.set("fs.defaultFS","file:///"); 来指定本地路径

④最后修改输入路径和输出路径，其中输出路径必须不能有文件夹，否则会报错！：

⑤运行后可得out文件夹：

out文件夹：

因为设置了3个reduce，所以有3个part-r-0000*文件，然后每个文件虽然数据有序，但是3个文件组合起来就无序

4.对以上程序进行修改，使得将part-r-xxxxx文件合在一起，数据仍然有序，有几种方法。

方法①：设置 job.setNumReduceTasks(1);最后将产生一个文件，就可以全排序。但是这样会产生数据倾斜问题。

方法②：因为数据的总有一个范围，我们将这个范围划分3个分区，编写一个继承Partiitoner的类：

public class YearPartitioner extends Partitioner<IntWritable,IntWritable> {

    //3
    public int getPartition(IntWritable year, IntWritable temp, int parts) {
        int y = year.get()- 1970 ;
        if(y < 33){
            return 0 ;
        }
        else if(y >= 33 && y < 66){
            return 1 ;
        }
        else{
            return 2 ;
        }
    }
}

并打开上面的//job.setPartitionerClass(YearPartitioner.class);

因为数据都是在99年范围内，所以可以将99划分为3个区，其中0-32、33-65、66-69。这样生成的数据可以是全排序的。

但是可能说某个范围内数据量很大，某个范围数据量很少，也会造成数据倾斜问题。

方法③：使用hadoop采样机制，意思就是对数据进行随机采样，然后会得出合理的分区。跟方法②的对比是，方法②是人主观上的分区，可能在某一区间数据量很大，而其他范围数据量很少，造成数据倾斜。而对于方法③，方法③经过采样后会给出合理的分区，比如如果0-22和77-99的数据量比较大，它会给出0-22 、 23-76 、77- 99的分区。具体实现如下：

package com.it18zhang.hdfs.maxtemp.allsort;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.InputSampler;
import org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner;

/**
 *
 */
public class MaxTempApp {
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","file:///");

        Job job = Job.getInstance(conf);

        //设置job的各种属性
        job.setJobName("MaxTempApp");                        //作业名称
        job.setJarByClass(MaxTempApp.class);                 //搜索类
        job.setInputFormatClass(SequenceFileInputFormat.class); //设置输入格式

        //添加输入路径
        FileInputFormat.addInputPath(job,new Path(args[0]));
        //设置输出路径
        FileOutputFormat.setOutputPath(job,new Path(args[1]));


//
        job.setMapperClass(MaxTempMapper.class);             //mapper类
        job.setReducerClass(MaxTempReducer.class);           //reducer类


        job.setMapOutputKeyClass(IntWritable.class);        //
        job.setMapOutputValueClass(IntWritable.class);      //

        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);         //


        //创建随机采样器对象
        //freq:每个key被选中的概率
        //numSapmple:抽取样本的总数
        //maxSplitSampled:最大采样切片数
        InputSampler.Sampler<IntWritable, IntWritable> sampler =
                new InputSampler.RandomSampler<IntWritable, IntWritable>(0.5, 3000, 3);

        job.setNumReduceTasks(3);                           //reduce个数

        //将sample数据写入分区文件.
        TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),new Path("e:/data/par.lst"));
        //设置全排序分区类
        job.setPartitionerClass(TotalOrderPartitioner.class);

        InputSampler.writePartitionFile(job, sampler);
        //job.waitForCompletion(true);
    }
}

把获得的分区值写到par.lst序列文件，可以通过hdfs dfs -text file:///e:/data/par.lst 来查看该序列文件

如：

注意使用时采样代码在最后端,否则会出现错误。

//分区文件设置，设置的job的配置对象，不要是之前的conf.
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),new Path("e:/data/par.lst"));

这样就可以对数据进行全排序并且防止数据倾斜

Chenway丶

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用MapReduce对年份、温度进行全排序，防止数据倾斜（详）

何为全排序？全排序就是将part-r-xxxxx文件合在一起，数据仍然有序。先来讲讲对年份、温度进行简单排序： 1.以下是关于年份、温度的数据： .... 2.对以上数据计算出每年最高气温 3.编程： ①编写一个继承Mapper的类，重写map方法: ...
复制链接

扫一扫