1、问题引出
要求将统计结果按照条件输出到不同文件中(分区)。比如:将统计结果按照手机归属地不同省份输出到不同文件中(分区)
2、默认Patitioner分区
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, W value, int numReduceTasks){
return (key.hashCode() 6. Integer .MAX_ VALUE) 8 numReduceTasks;
}
}
默认分区是根据key的hashCode对Reduce Tasks个数取模得到的。用户没法控制哪个key存储到哪个分区。
3. 自定义分区步骤
(1)自定义类继承Partitioner, 重写getParition()方法
public class CustomP artiti oner extends P arti ti oner<Text, Fl owBean> (
0verride
public int getPartition(Text key, FlowBean value, int. numPartitions){
//逻辑代码控制
return parti tion;
}
)
(2)在Job驱动中,设置自定义Patitioner
job. setPartitionerC1ass( CustomPartitioner class);
(3)自定义Partition后,要根据自定义Partitioner的逻辑设置相应数量的
ReduceTaskjob. setNTumReduceTasks(5),
4.需求
(1)将统计结果按照手机归属地不同省份输出到不同文件中(分区)
数据:
1 13736230513 192.196.100.1 www.atguigu.com 2481 24681 200
2 13846544121 192.196.100.2 264 0 200
3 13956435636 192.196.100.3 132 1512 200
4 13966251146 192.168.100.1 240 0 404
5 18271575951 192.168.100.2 www.atguigu.com 1527 2106 200
6 84188413 192.168.100.3 www.atguigu.com 4116 1432 200
7 13590439668 192.168.100.4 1116 954 200
8 15910133277 192.168.100.5 www.hao123.com 3156 2936 200
9 13729199489 192.168.100.6 240 0 200
10 13630577991 192.168.100.7 www.shouhu.com 6960 690 200
11 15043685818 192.168.100.8 www.baidu.com 3659 3538 200
12 15959002129 192.168.100.9 www.atguigu.com 1938 180 500
13 13560439638 192.168.100.10 918 4938 200
14 13470253144 192.168.100.11 180 180 200
15 13682846555 192.168.100.12 www.qq.com 1938 2910 200
16 13992314666 192.168.100.13 www.gaga.com 3008 3720 200
17 13509468723 192.168.100.14 www.qinghua.com 7335 110349 404
18 18390173782 192.168.100.15 www.sogou.com 9531 2412 200
19 13975057813 192.168.100.16 www.baidu.com 11058 48243 200
20 13768778790 192.168.100.17 120 120 200
21 13568436656 192.168.100.18 www.alibaba.com 2481 24681 200
22 13568436656 192.168.100.19 1116 954 200
(2)期望输出数据
手机号136、137、138、139开头都分别放到一个独立的4个文件中,其他开头的放到一个文件中。
编写bean类
ublic class FlowBean implements Writable {
private long upFlow; //上行流量
private long downFlow; //下行流量
private long sumFlow; //总流量
//TODO 反序列化时,需要反射调用空参构造函数,所以必须有
public FlowBean() { }
public FlowBean(long upFlow, long downFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = upFlow + downFlow;
}
//TODO 序列化
@Override
public void write(DataOutput out) throws IOException {
out.writeLong( upFlow );
out.writeLong( downFlow );
out.writeLong( sumFlow );
}
//TODO 反序列化方法
//TODO 反序列化方法读顺序必须和写序列化方法的写顺序必须一致
@Override
public void readFields(DataInput in) throws IOException {
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow ;
}
public void set(long upFlow, long downFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = upFlow + downFlow;
}
}
编写mapper类
public class FlowCountMapper extends Mapper<LongWritable, Text,Text, FlowBean> {
Text k = new Text();
FlowBean v = new FlowBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//TODO 获取值和切分
String[] splits = value.toString().split( "\t" );
//TODO 封装手机号
k.set( splits[1]);
//TODO 反方向封装
//TODO 减去上行流量和下行流量
long upFlow = Long.parseLong( splits[splits.length - 3] );
long downFlow = Long.parseLong( splits[splits.length - 2] );
//TODO 拿到FlowBean类的方法,赋值
v.setUpFlow( upFlow );
v.setDownFlow( downFlow );
v.set( upFlow,downFlow );
//TODO 写出
context.write( k,v );
}
}
编写reduce类
public class FlowCountReducer extends Reducer<Text, FlowBean,Text, FlowBean> {
FlowBean v = new FlowBean();
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
//TODO 定义累加变量
long sum_unFlow = 0;
long sum_downFlow = 0;
// TODO 遍历所用bean,将其中的上行流量,下行流量分别累加
for (FlowBean value : values) {
sum_unFlow += value.getUpFlow();
sum_downFlow += value.getDownFlow();
}
//TODO 封装对象
v.set( sum_unFlow,sum_downFlow );
//TODO 写出
context.write( key,v );
}
}
编写分区类
public class ProvincePartitioner extends Partitioner<Text,FlowBean> {
@Override
public int getPartition(Text key, FlowBean values, int numPartitions) {
//TODO key是手机号
//TODO value是值
String substring = key.toString().substring( 0, 3 );
//TODO 一共分为四个分区
int partition = 4 ;
if("136".equals( substring )){
partition = 0;
}else if("137".equals( substring )){
partition = 1;
}else if("138".equals( substring )){
partition = 2;
}else if("139".equals( substring )){
partition = 3;
}
return partition;
}
}
编写Driver类
public class FlowsumDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//TODO 获取job
Configuration conf= new Configuration();
Job job = Job.getInstance( conf );
//TODO 获取jar
job.setJarByClass( FlowsumDriver.class );
//TODO 获取Map Reduce
job.setMapperClass( FlowCountMapper.class );
job.setReducerClass( FlowCountReducer.class );
//TODO 获取Map端输出类型
job.setMapOutputKeyClass( Text.class );
job.setMapOutputValueClass( FlowBean.class );
//TODO 设置分区的类
job.setPartitionerClass( ProvincePartitioner.class );
job.setNumReduceTasks(5);
//TODO 获取总输出类型
job.setOutputKeyClass( Text.class );
job.setOutputValueClass( FlowBean.class );
//TODO 设置输入路径和输出路径
FileInputFormat.setInputPaths( job,new Path( args[0] ) );
FileOutputFormat.setOutputPath( job,new Path( args[1] ) );
//TODO 提交
boolean result = job.waitForCompletion( true );
System.exit( result ? 0:1 );
}
}
最后可查看效果,可在本地或者集群
5、分区总结
(1)如果ReduceTask的数量> getPartition的结果数,则会多产生几个空的输出文件part-r-00000
(2)如果1<ReduceTak的数量<gePariticn的结果数,则有-部分分区数据无处安放,会Exception;
(3)如果RetuceTask的數量-1,则不管MaTask端输出多少个分区文件,最终结果都交给这一个ReduceTask,最终也就只会产生一个结果文件part-r-00000
(4)分区号必须从零开始,逐一加一
6,.案例分析
例收口:假设自定义分区数为5,则
(1) job setNumReduceTask(1);
会正常运行, 只不过会产生一个输出文件
(2) job setNJumReduceTask<(2);
会报错
(3) job.setNumReduceTask(6),
大于5,程序会正常运行,但会产生空文件