Hadoop序列化
- 序列化概述
- 什么是序列化和反序列化:
①序列化就是将内存中的对象转化成字节序列(或其他数据传输协议),以便存储到磁盘中(持 久化)或进行网络传输
②反序列化:就是将收到的字节序列(或其他数据传输协议)或者磁盘持久化数据转化成内存中的对象 - 为什么不用Java序列化:
Java的序列化是重量级序列化框架,一个对象被序列化后会夹带很多其他信息(校验信息,继承体系等),不便于网络传输 - Hadoop序列化的特点:
①紧凑:高效实用存储空间
②快速:读写数据的额外开销小
③可扩展性:可以随着通信协议的升级而升级
④互操作:支持多语言的交互
- 什么是序列化和反序列化:
- 常用的数据序列化类型:
- 自定义bean对象实现序列化接口(Writable)
- 实现Writeable接口
- 反序列化时,反射需要调用空参构造器,所以必须提供空参构造器
- 重写序列化方法
- 重写反序列化方法(反序列化的顺序和序列化的实现保持一致)
- 需要把结果显示在文件中,需要重写toString方法,可以用 “\t” 分隔,方便后续使用
- 若将自定义的bean作为key传输,则还需要实现Compareable接口,因为MapReduce中的 Shuffle过程中要求对key必须能排序
- 序列化案例:
- 数据输入格式:
- 数据输出格式:
- 案例分析:
- map阶段:
①读取一行数据,切分字段
②抽取手机号,上行流量,下行流量
③已手机号为key,bean对象为value输出(context.write(手机号,bean))
④bean对象要传输,必须实现序列化接口 - reduce阶段:
累加上行流量和下行流量得到总流量
- map阶段:
- 代码实现:
- 流量统计的Bean对象
//实现Writable
public class FlowBean implements Writable {
private long upFlow;
private long downFlow;
private long sumFlow;
//空参构造器
public FlowBean() {
}
public FlowBean(long upFlow, long downFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = downFlow + upFlow;
}
//序列化方法
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(downFlow);
dataOutput.writeLong(upFlow);
dataOutput.writeLong(sumFlow);
}
//反序列化方法
public void readFields(DataInput dataInput) throws IOException {
//顺序和序列化保持一致
this.downFlow = dataInput.readLong();
this.upFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
public void set(long upFlow,long downFlow){
this.upFlow = upFlow;
this.downFlow = downFlow;
}
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow;
}
}
- Mapper类
public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
Text k = new Text();
FlowBean v = new FlowBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] res = line.split("\t");
String phoneNum = res[1];
Long upFlow = Long.parseLong(res[res.length - 3]);
Long downFlow = Long.parseLong(res[res.length - 2]);
k.set(phoneNum);
v.setUpFlow(upFlow);
v.setDownFlow(downFlow);
context.write(k,v);
}
}
- Reducer
public class FlowCountReducer extends Reducer<Text,FlowBean,Text,FlowBean> {
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
long sum_upFlow = 0;
long sum_downFlow = 0;
//遍历所有的bean,累加上行,下行流量(一个号码可能会有多条数据)
for (FlowBean value : values) {
sum_upFlow += value.getUpFlow();
sum_downFlow += value.getDownFlow();
}
//封装对象
FlowBean flowBean = new FlowBean(sum_upFlow,sum_downFlow);
context.write(key,flowBean);
}
}
- Driver驱动类
public class FlowCountDriver {
public static void main(String[] args) throws Exception {
String s1 = "E:\\input\\input2";
String s2 = "E:\\output\\output1";
//获取配置信息,job对象
Configuration con = new Configuration();
Job job = Job.getInstance(con);
//指定jar加载路径
job.setJarByClass(FlowCountDriver.class);
//设置map、reduce类
job.setMapperClass(FlowCountMapper.class);
job.setReducerClass(FlowCountReducer.class);
//设置map输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
//设置最终的KV类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
//指定job的输入原始路径和最终输出路径
FileInputFormat.setInputPaths(job, new Path(s1));
FileOutputFormat.setOutputPath(job, new Path(s2));
//提交
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
结果如下: