序列化在分布式环境的两大作用:进程间通信,永久存储。
自定义数据类型需要实现Writable接口才能实现序列化
Any key or value type in the Hadoop Map-Reduce framework implements this interface.
下面是Writable接口的源码:
public interface Writable {
/**
* Serialize the fields of this object to <code>out</code>.
*
* @param out <code>DataOuput</code> to serialize this object into.
* @throws IOException
*/
void write(DataOutput out) throws IOException;
/**
* Deserialize the fields of this object from <code>in</code>.
*
* <p>For efficiency, implementations should attempt to re-use storage in the
* existing object where possible.</p>
*
* @param in <code>DataInput</code> to deseriablize this object from.
* @throws IOException
*/
void readFields(DataInput in) throws IOException;
}
下面的例子以自定义数据类型作为value进行演示。
1.需求
假设有如下工资条,需要统计每个员工的基本工资,职位工资,绩效工资,岗位津贴的总和。
工资条结构:日期,部门,姓名,职位,基本工资,职位工资,绩效工资,岗位津贴,加班,奖金,差旅补贴,餐补
[hadoop@hadoop1 ~]$ hdfs dfs -cat /salarysummary/input/salarybill.txt
18/06/09 01:17:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-01-01,研发部,张三,软件工程师,2800,9000,3200,1000,0,0,0,0
2015-02-01,研发部,张三,软件工程师,2810,9000,3200,1000,0,0,0,0
2015-03-01,研发部,张三,软件工程师,2820,9000,3200,1000,0,0,0,0
2015-04-01,研发部,张三,软件工程师,2830,9000,3200,1000,0,0,0,0
2015-05-01,研发部,张三,软件工程师,2840,9000,3200,1000,0,0,0,0
2015-01-01,研发部,李四,软件工程师,2800,9010,3200,1000,0,0,0,0
2015-02-01,研发部,李四,软件工程师,2800,9020,3200,1000,0,0,0,0
2015-03-01,研发部,李四,软件工程师,2800,9030,3200,1000,0,0,0,0
2015-04-01,研发部,李四,软件工程师,2800,9040,3200,1000,0,0,0,0
2015-05-01,研发部,李四,软件工程师,2800,9050,3200,1000,0,0,0,0
2015-01-01,研发部,王五,软件工程师,2800,9000,3210,1000,0,0,0,0
2015-02-01,研发部,王五,软件工程师,2800,9000,3220,1000,0,0,0,0
2015-03-01,研发部,王五,软件工程师,2800,9000,3230,1000,0,0,0,0
2015-04-01,研发部,王五,软件工程师,2800,9000,3240,1000,0,0,0,0
2015-05-01,研发部,王五,软件工程师,2800,9000,3250,1000,0,0,0,0
分析
1、自定义类SalaryBillDetail,包含属性 基本工资,职位工资,绩效工资,岗位津贴
2、类SalaryBillDetail实现Writable接口,目的是作为value输出
3、map函数输入的键值对:(行偏移量,一行工资条)
3、map函数输出的键值对:(姓名,SalaryBillDetail对象)
4、reduce函数输入的键值对:(姓名,[SalaryBillDetail对象,…])
5、reduce函数输出的键值对:(姓名,SalaryBillDetail对象)
实现
SalaryBillDetail.java:
package com.demo;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class SalaryBillDetail implements Writable{
private long jbgz,zwgz,jxgz,gwjt;//分别代表基本工资,职位工资,绩效工资,岗位津贴
public SalaryBillDetail()//这个默认构造函数需要,否则会报错
{
}
public SalaryBillDetail(long jbgz,long zwgz,long jxgz,long gwjt) {
this.jbgz=jbgz;
this.zwgz=zwgz;
this.jxgz=jxgz;
this.gwjt=gwjt;
}
@Override
public void write(DataOutput out) throws IOException {//序列化
// TODO Auto-generated method stub
out.writeLong(jbgz);
out.writeLong(zwgz);
out.writeLong(jxgz);
out.writeLong(gwjt);
}
@Override
public void readFields(DataInput in) throws IOException {//反序列化
// TODO Auto-generated method stub
this.jbgz=in.readLong();
this.zwgz=in.readLong();
this.jxgz=in.readLong();
this.gwjt=in.readLong();
}
@Override
public String toString() {
// TODO Auto-generated method stub
return this.jbgz+" "+this.zwgz+" "+this.jxgz+" "+this.gwjt;
}
public long getJbgz() {
return jbgz;
}
public void setJbgz(long jbgz) {
this.jbgz = jbgz;
}
public long getZwgz() {
return zwgz;
}
public void setZwgz(long zwgz) {
this.zwgz = zwgz;
}
public long getJxgz() {
return jxgz;
}
public void setJxgz(long jxgz) {
this.jxgz = jxgz;
}
public long getGwjt() {
return gwjt;
}
public void setGwjt(long gwjt) {
this.gwjt = gwjt;
}
}
SalaryBillMapper.java:
package com.demo;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class SalaryBillMapper extends Mapper<LongWritable, Text, Text, SalaryBillDetail>{
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, SalaryBillDetail>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String line=value.toString();
String[] ss=line.split(",");
SalaryBillDetail sbd=new SalaryBillDetail(Long.parseLong(ss[4]), Long.parseLong(ss[5]), Long.parseLong(ss[6]), Long.parseLong(ss[7]));
context.write(new Text(ss[2]), sbd);
}
}
SalaryBillReducer.java:
package com.demo;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SalaryBillReducer extends Reducer<Text, SalaryBillDetail, Text, SalaryBillDetail>{
@Override
protected void reduce(Text key, Iterable<SalaryBillDetail> values,
Reducer<Text, SalaryBillDetail, Text, SalaryBillDetail>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
long jbgz=0,zwgz=0,jxgz=0,gwjt=0;//分别代表基本工资,职位工资,绩效工资,岗位津贴
for(SalaryBillDetail sbd:values)
{
jbgz+=sbd.getJbgz();
zwgz+=sbd.getZwgz();
jxgz+=sbd.getJxgz();
gwjt+=sbd.getGwjt();
}
context.write(key, new SalaryBillDetail(jbgz, zwgz, jxgz, gwjt));
}
}
JobRunner.java:
package com.demo;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class JobRunner {
public static void main(String[] args) throws Exception{
// TODO Auto-generated method stub
Configuration conf=new Configuration();
Job job=Job.getInstance(conf);
job.setJarByClass(JobRunner.class);
job.setMapperClass(SalaryBillMapper.class);
job.setReducerClass(SalaryBillReducer.class);
job.setCombinerClass(SalaryBillReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(SalaryBillDetail.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(SalaryBillDetail.class);
FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.137.23:9000/salarysummary/input"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.137.23:9000/salarysummary/output"));
System.exit(job.waitForCompletion(true)?0:1);
}
}
输出结果:
张三 14100 45000 16000 5000
李四 14000 45150 16000 5000
王五 14000 45000 16150 5000