Hadoop 二次排序实现

最新推荐文章于 2021-07-15 10:45:07 发布

it小奋

最新推荐文章于 2021-07-15 10:45:07 发布

阅读量1.3k

点赞数

分类专栏： Hadoop 文章标签： Hadoop 二次排序交通数据处理

本文链接：https://blog.csdn.net/u010820702/article/details/58584270

版权

Hadoop 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

业务场景:通常情况下,在MR操作中到达Reduce中的key值都是按照指定的规则进行排序,在单一key的情况下一切都进行的很自然,直到我们要求数据不再单纯的按key进行排序,以如下数据举例:

Key -> value:

100 -> 2017-02-27 19:21:31,45,67,68

50 -> 2017-02-27 19:22:04,89,90,56

90 -> 2017-02-27 19:22:27,90,89,99

50 -> 2017-02-27 19:20:42,88,45,89

现要求对结果数据进行分组,key值相同的为一组且组内有序

处理方式:

原理:借助MR排序的优势,提供可扩展的二次排序操作

流程

Map ->(复合主键,value) à自定义分区函数àReduce-

实例:

数据目录如下,其中.loz结尾的为真正的数据文件,以其文件名为前缀.writed为后缀的问文件标识当前.loz文件写状态,只有处于.writed状态的loz文件为有效可读文件

数据格式如下:VERSION=1.0,PASSTIME=2016-11-30 00:00:39 000,CARSTATE=1,CARPLATE=无,PLATETYPE=99,SPEED=0,PLATECOLOR=4,LOCATIONID=-1,DEVICEID=-1,DRIVEWAY=2,DRIVEDIR=4,CAPTUREDIR=1,CARCOLOR=10,CARBRAND=99,CARBRANDZW=其它,TGSID=1125,PLATECOORD=0,0,0,0,CABCOORD=0,0,0,0,IMGID1=http://11.110.248.59:9099/image/dhdfs/2016-11-30/archivefile-2016-11-30-000040-00677B0200000001:5750848/308059.jpg,IMGID2=,IMGID3=,

数据量>=2G

要求如下:统计全部数据中每辆车按时间序列经过的卡口信息(卡口字段为元数据中TGSID列) 数据输出格式为,文件名=号牌+辅助字段,内容=过车时间(PASSTIME)-卡口编号(TGSID),…

编程实现:

1. CarOrder.class,自定义组合键,借助MR的KEY排序操作实现Map内按key-time排序操作,要求实现两序列化接口

classCarOrder implements Writable, WritableComparable<CarOrder> {

//号牌

private Text carPlate;

//过车时间

private Text day;

 

public Text getDay() {

           return day;

}

 

public void setDay(Text day) {

           this.day = day;

}

 

public CarOrder() {

           carPlate = new Text();

           day = new Text();

}

 

public CarOrder(Text carPlate, Text day){

           super();

           this.carPlate = carPlate;

           this.day = day;

}

 

public int compareTo(CarOrder co){

           int compareValue = this.carPlate.compareTo(co.carPlate);

           // 相等

           if (compareValue == 0) {

                    compareValue = this.day.compareTo(co.day);

           }

           return compareValue;

}

 

public void write(DataOutput out)throws IOException {

           this.carPlate.write(out);

           this.day.write(out);

}

 

public void readFields(DataInputin) throws IOException {

           this.carPlate.readFields(in);

           this.day.readFields(in);

}

 

public Text getCarPlate() {

           return carPlate;

}

 

public void setCarPlate(TextcarPlate) {

           this.carPlate = carPlate;

}

 

@Override

public String toString() {

           return "CarOrder[carPlate=" + carPlate + ", day=" + day + "]";

}

 

}

2. CarComparator.java,定义分组比较器,决定在MR SHUFFLE过程中对数据分组的依据,要求号牌相同时间不同的为同一组

classCarComparator extends WritableComparator {

public CarComparator() {

           // 指定Key值

           super(CarOrder.class, true);

}

@SuppressWarnings("rawtypes")

@Override

public int compare(WritableComparable a,WritableComparable b) {

           CarOrder car1 = (CarOrder) a;

           CarOrder car2 = (CarOrder) b;

           return car1.getCarPlate().compareTo(car2.getCarPlate());

}

}

3. CarPartitioner.class,自定义分区函数,Mapper操作中间结果分区依据,将数据均匀划分.

classCarPartitioner extends Partitioner<CarOrder, Text> {

@Override

public int getPartition(CarOrder key, Textvalue, int numPartitions) {

           return Math.abs(key.getCarPlate().hashCode())% numPartitions;

}

}

4. CarMap.java Mapper函数,此处做简单的数据解析工作,文件过滤在驱动器节点完成,Mapper操作封装复合键CarOrder输出(CarOrder,”时间-卡口编号”)

classCarMap extends Mapper<LongWritable, Text, CarOrder, Text> {

@Override

protected void map(LongWritable key, Textvalue, Context context) throws IOException, InterruptedException {

           String temp = value.toString();

           if (temp.length() > 13) {

                    temp = temp.substring(12);

                    String[] items =temp.split(",");

                    if (items.length > 10) {

                             // CarPlate As Key

                             if(!items[2].endsWith("无牌")){

                                       try {

                                                CarOrderco = new CarOrder(new Text(items[2].substring(9)), newText(items[0].substring(9)));

                                                //time + tgsid

                                                context.write(co,new Text(items[0].substring(9)+"-"+items[14].substring(6)));

                                       } catch(Exception e) {

                                                e.printStackTrace();

                                       }

                             }

                    }

           }

}

}

5. CarCombine.java ,考虑到Map端数据可能会偏大的情况,使用MR的Combine特性对Map的结果预先进行处理,减少Mapper端输出,降低IO操作提高程序性能.

classCarCombine extends Reducer<CarOrder, Text, CarOrder, Text> {

@Override

protected void reduce(CarOrder co,Iterable<Text> values, Reducer<CarOrder, Text, CarOrder,Text>.Context context)

                    throws IOException,InterruptedException {

           StringBuffer buf = newStringBuffer();

           String before = null;

           String current = null;

           for (Text text : values) {

                    current = text.toString();

                    if (current.equals(before)){

                             continue;

                    }

                    buf.append(current);

                    buf.append(',');

                    before = current;

           }

           if(buf.length()==0){

                    return ;

           }

           context.write(co, newText(buf.toString()));

}

}

6. OutputByCarPlate.java ,Reduce操作,由于要求结果按好配进行分类,经测试发现,在MR中使用大量的文件输出并不是好事,举例来说如果有50万车辆则可能需要50万个文件来存储,性能开销极大容易造成MR假死,甚至内存溢出! 未避免此问题,此处仍然采用多文件输出只不过降级为:按所有数据以天为文件划分依据进行存储,单个车辆存储一行。

classOutputByCarPlate extends Reducer<CarOrder, Text, NullWritable, Text> {

 

MultipleOutputs<NullWritable, Text> mo;

@Override

protected void setup(Reducer<CarOrder, Text,NullWritable, Text>.Context context)

                    throws IOException,InterruptedException {

           mo = newMultipleOutputs<NullWritable, Text>(context);

}

 

@Override

protected void reduce(CarOrder key,Iterable<Text> values, Context context)

                    throws IOException,InterruptedException {

           StringBuffer buf = newStringBuffer();

           for (Text text : values) {

                    buf.append(text.toString());

                    buf.append(',');

           }

           String value = buf.toString();

           String[] flows =value.split(",");

           if (flows.length >= 3) {

                    String prefix =key.getDay().toString().replaceAll("[-\\s:]", "");

                    mo.write(NullWritable.get(),new Text(value.substring(0, value.length() - 1)), prefix.substring(0, 8));

           }

}

@Override

protected void cleanup(Reducer<CarOrder,Text, NullWritable, Text>.Context context)

                    throws IOException,InterruptedException {

           mo.close();

}

}

7. Main.java ,驱动器节点,组织MR作业,预处理文件,由于小文件数量较多此处采用CombineTextInputFormat输入文本文件(因源数据使用LOZ压缩算法,经测试此处需要明确标识输入块的大小否则文件合并无效,即: CombineTextInputFormat.setMaxInputSplitSize(job, 67108864);),MR提供的TextInputFormat.setInputPathFilter有局限,只能过滤已经被识别的子目录文件,不支持动态修改,此处使用在Mapper之外过滤以CombineTextInputFormat.addInputPath(job, temp);方式追加输入,以实现按特定要求去输入文件.

public class Main{

public static void main(String[] args) throwsException {

           Configuration conf = newConfiguration();

           conf.set("mapreduce.reduce.memory.mb","4096");

           Job job = Job.getInstance(conf,"TRACK_BY_TIME_TGSID");

           // 小文件合并

           job.setInputFormatClass(CombineTextInputFormat.class);

           job.setJarByClass(cn.com.zjf.MR_04.Car1.class);

           job.setOutputKeyClass(Text.class);

           job.setOutputValueClass(Text.class);

           job.setMapOutputKeyClass(CarOrder.class);

           job.setMapOutputValueClass(Text.class);

           job.setMapperClass(CarMap.class);

           job.setReducerClass(OutputByCarPlate.class);

           // 组合键的排序规则

           //job.setSortComparatorClass(ResultComparator.class);

           // 自定义分区

           job.setPartitionerClass(CarPartitioner.class);

           // 自定义比较器-->二次排序的依据

           job.setGroupingComparatorClass(CarComparator.class);

           FileSystem fs = FileSystem.get(conf);

           Path input = new Path(args[0]);

           /**

            * 预处理文件 .只读取写完毕的文件 .writed结尾 .只读取文件大小大于0的文件

            */

           {

                    FileStatus childs[] =fs.globStatus(input, new PathFilter() {

                             public booleanaccept(Path path) {

                                       if(path.toString().endsWith(".writed")) {

                                                returntrue;

                                       }

                                       returnfalse;

                             }

                    });

                    Path temp = null;

                    for (FileStatus file :childs) {

                             temp = newPath(file.getPath().toString().replaceAll(".writed", ""));

                             if (fs.listStatus(temp)[0].getLen()> 0) {

                                       CombineTextInputFormat.addInputPath(job,temp);

                             }

                    }

           }

           CombineTextInputFormat.setMaxInputSplitSize(job,67108864);

 

           Path output = new Path(args[1]);

           if (fs.exists(output)) {

                    fs.delete(output, true);

           }

           FileOutputFormat.setOutputPath(job,output);

           if (!job.waitForCompletion(true))

                    return;

}

}