shuffle阶段的分区:
在mapreduce当中有一个抽象类叫做Partitioner,默认使用的实现类是HashPartitioner,我们可以通过HashPartitioner的源码,查看到分区的逻辑。
从源码可知,分区公式为(key.hashCode() & 2147483647) % numReduceTasks,即对numReduceTasks的大小求余数。
假如说 numReduceTasks=4,则(key.hashCode() & 2147483647) % numReduceTasks的计算结果可能为0,1,2,3,因此,有4个分区。所以可以看到,分区的个数跟reduce task的个数是相一致的(从分区的作用就可以推测)。
因为key.hashCode()的存在,所以用户没法控制哪些key的数据进入哪些分区。但是我们可以定义自己的分区逻辑。
基于手机流量数据,实现将不同的手机号的数据划分到不同的文件里面去
135开头的手机号分到一个文件里面去,
136开头的手机号分到一个文件里面去,
137开头的手机号分到一个文件里面去,
138开头的手机号分到一个文件里面去,
139开头的手机号分到一个文件里面去,
其他开头的手机号分到一个文件里面去
要将不同手机号数据分到不同文件去,实际上就是分区,将手机号数据分到不同的reducetask处理,然后生成不同的part-r-输出文件。分区等价于将结果输出到不同的part-r-文件。
因此,我们要定义自己分区器(分区类),根据不同开头的手机号返回不同的分区。
要使用我们定义的分区器,还要在job驱动中,将分区器设置为我们自己定义的。
又因为分区个数跟reducetask个数是一致的,所以要根据分区逻辑设置相应个数的reducetask。
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class PartitionOwn extends Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text text, FlowBean flowBean, int numPartitions) {
String phoenNum = text.toString();
if(null != phoenNum && !phoenNum.equals("")){
if(phoenNum.startsWith("135")){
return 0;
}else if(phoenNum.startsWith("136")){
return 1;
}else if(phoenNum.startsWith("137")){
return 2;
}else if(phoenNum.startsWith("138")){
return 3;
}else if(phoenNum.startsWith("139")){
return 4;
}else {
return 5;
}
}else{
return 5;
}
}
}
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class FlowBean implements Writable {
private Integer upFlow;
private Integer downFlow;
private Integer upCountFlow;
private Integer downCountFlow;
/**
* 序列化方法
* @param out
* @throws IOException
*/
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(upFlow);
out.writeInt(downFlow);
out.writeInt(upCountFlow);
out.writeInt(downCountFlow);
}
/**
* 反序列化方法
* @param in
* @throws IOException
*/
@Override
public void readFields(DataInput in) throws IOException {
this.upFlow= in.readInt();
this.downFlow= in.readInt();
this.upCountFlow = in.readInt();
this.downCountFlow = in.readInt();
}
public Integer getUpFlow() {
return upFlow;
}
public void setUpFlow(Integer upFlow) {
this.upFlow = upFlow;
}
public Integer getDownFlow() {
return downFlow;
}
public void setDownFlow(Integer downFlow) {
this.downFlow = downFlow;
}
public Integer getUpCountFlow() {
return upCountFlow;
}
public void setUpCountFlow(Integer upCountFlow) {
this.upCountFlow = upCountFlow;
}
public Integer getDownCountFlow() {
return downCountFlow;
}
public void setDownCountFlow(Integer downCountFlow) {
this.downCountFlow = downCountFlow;
}
@Override
public String toString() {
return "FlowBean{" +
"upFlow=" + upFlow +
", downFlow=" + downFlow +
", upCountFlow=" + upCountFlow +
", downCountFlow=" + downCountFlow +
'}';
}
}
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class FlowMapper extends Mapper<LongWritable,Text,Text,FlowBean> {
private FlowBean flowBean ;
private Text text;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
flowBean = new FlowBean();
text = new Text();
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] split = value.toString().split("\t");
String phoneNum = split[1];
String upFlow =split[6];
String downFlow =split[7];
String upCountFlow =split[8];
String downCountFlow =split[9];
text.set(phoneNum);
flowBean.setUpFlow(Integer.parseInt(upFlow));
flowBean.setDownFlow(Integer.parseInt(downFlow));
flowBean.setUpCountFlow(Integer.parseInt(upCountFlow));
flowBean.setDownCountFlow(Integer.parseInt(downCountFlow));
context.write(text,flowBean);
}
}
package com.jimmy.day05;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class FlowReducer extends Reducer<Text,FlowBean,Text,Text> {
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
int upFlow = 0;
int donwFlow = 0;
int upCountFlow = 0;
int downCountFlow = 0;
for (FlowBean value : values) {
upFlow += value.getUpFlow();
donwFlow += value.getDownFlow();
upCountFlow += value.getUpCountFlow();
downCountFlow += value.getDownCountFlow();
}
context.write(key,new Text(upFlow +"\t" + donwFlow + "\t" + upCountFlow + "\t" + downCountFlow));
}
}
package com.jimmy.day05;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class FlowMain extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
//获取job对象