MapReduce增强
1、分区
在mapreduce中,通过指定分区将一个区的数据发送到同一个reduce中处理,分区数不能大于reduceTask的数量
注意:在进行分区时,只能打成jar包发布到集群上去运行,不能在本地运行
在需要分区时,需要编写一个自定义的partitioner类并且继承Partitioner这个类,传入map阶段的输出结果,重写其中的getPartition方法,通过返回值来判断将数据传输到哪个分区
在编写main函数时需要添加分区类以及reduceTask的个数,确保分区书和reduceTask个数保持一致
2、mapreduce排序以及序列化
序列化:是指把结构化对象转化为字节流
反序列化:把字节流转为结构化对象
要在进程间传递对象或持久化对象时,需要序列化对象成字节流,反之要将接收到的字节流转换成对象,进行反序列化
Writable是Hadoop的序列化格式,实现writable接口就可以实现序列化,Writable还有子接口WritableComparable,这个接口既可以对Key进行序列化,又可以对key进行比较、排序
案例:
数据:
a 1
a 9
b 3
a 7
b 8
b 10
a 5
需求:将第一列按照字典顺序排列,第一列相同时,第二列按照升序排列
第一步:自定义数据类型和比较器
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class PairWritable implements WritableComparable<PairWritable> {
private String first;
private int second;
public String getFirst() {
return first;
}
public void setFirst(String first) {
this.first = first;
}
public int getSecond() {
return second;
}
public void setSecond(int second) {
this.second = second;
}
@Override
public String toString() {
return first + "-----" + second;
}
@Override
public int compareTo(PairWritable o) {
int i = this.first.compareTo(o.first);
if (i != 0) {
return i;
} else {
int i1 = Integer.valueOf(this.second).compareTo(Integer.valueOf(o.second));
return i1;
}
}
@Override
public void write(DataOutput output) throws IOException {
output.writeUTF(first);
output.writeInt(second);
}
@Override
public void readFields(DataInput input) throws IOException {
this.first = input.readUTF();
this.second = input.readInt();
}
}
第二步:编写map逻辑
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class PairWritableMapper extends Mapper<LongWritable, Text, PairWritable, NullWritable> {
private PairWritable pairWritable = new PairWritable();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] split = value.toString().split("\t");
pairWritable.setFirst(split[0]);
pairWritable.setSecond(Integer.valueOf(split[1]));
context.write(pairWritable, NullWritable.get());
}
}
第三步:编写reduce类
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class PairWritableReducer extends Reducer<PairWritable, NullWritable, PairWritable, NullWritable> {
@Override
protected void reduce(PairWritable key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
for (NullWritable value : values) {
context.write(key, NullWritable.get());
}
}
}
第四步:编写程序运行main方法
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class PairWritableJobMain extends Configured implements Tool {
@Override
public int run(String[] strings) throws Exception {
Job job = Job.getInstance(super.getConf(), "job");
job.setJarByClass(PairWritableJobMain.class);
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path("file:///D:\\排序\\input"));
job.setMapperClass(PairWritableMapper.class);
job.setMapOutputKeyClass(PairWritable.class);
job.setMapOutputValueClass(NullWritable.class);
job.setReducerClass(PairWritableReducer.class);
job.setOutputKeyClass(PairWritable.class);
job.setOutputValueClass(NullWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path("file:///D:\\排序\\output"));
boolean b = job.waitForCompletion(true);
return b ? 0 : 1;
}
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
Tool tool = new PairWritableJobMain();
int run = ToolRunner.run(configuration, tool, args);
System.exit(run);
}
}
3、MpaReduce当中的计数器
hadoop内置的计数器列表
MapReduce任务计数器 | org.apache.hadoop.mapreduce.TaskCounter |
---|---|
文件系统计数器 | org.apache.hadoop.mapreduce.FileSystemCounter |
FileInputFormat计数器 | org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter |
FileOutputFormat计数器 | org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter |
作业计数器 | org.apache.hadoop.mapreduce.JobCounter |
计数器可以有两种方式来实现
第一种:通过context上下文对象可以获取计数器记录
Counter counter = context.getCounter("MR_COUNT", "MapRecordCounter");
counter.increment(1L);
第二种:通过enum枚举类型来定义计数器
public static enum Counter{
REDUCE_INPUT_RECORDS, REDUCE_INPUT_VAL_NUMS,
}
context.getCounter(Counter.REDUCE_INPUT_RECORDS).increment(1L);
context.getCounter(Counter.REDUCE_INPUT_VAL_NUMS).increment(1L);
4、MpaReduce的combiner规约
在每一个map都会出现大量的输出,combiner就是对map阶段的输出先做一次合并,以减少在map和reduce节点之间的数据传输量提升网络的IO性能,是mapreduce的一种优化手段
1、combiner是mr程序的一种组件
2、combiner组件的父类是reducer
3、combiner和reducer的区别在于运行位置的不同
4、combiner的意义是对每一个maptask的输出进行局部汇总
5、具体实现步骤:
自定义combiner继承reducer。重写reduce方法
在job中设置:job.setCombinerClass(CustomCombiner.class)
t combiner 能够应用的前提是不能影响最终的业务逻辑,而且,combiner 的输出 kv 应该跟 reducer 的输入 kv 类型要对应起来
5、MapReduce综合练习
数据文件:
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 游戏娱乐 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 jd.com 京东购物 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 taobao.com 淘宝购物 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 cnblogs.com 技术门户 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 视频网站 15 12 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 未知 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 sougou.com 综合门户 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 信息安全 20 20 3156 2936 200
1363157983019 13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82 baidu.com 综合搜索 4 0 240 0 200
1363157984041 13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4 s19.cnzz.com 站点统计 24 9 6960 690 200
1363157973098 15013685858 5C-0E-8B-C7-F7-90:CMCC 120.197.40.4 rank.ie.sogou.com 搜索引擎 28 27 3659 3538 200
1363157986029 15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99 www.umeng.com 站点统计 3 3 1938 180 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 zhilian.com 招聘门户 15 9 918 4938 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 csdn.net 技术门户 3 3 180 180 200
1363157984040 13602846565 5C-0E-8B-8B-B6-00:CMCC 120.197.40.4 2052.flash2-http.qq.com 综合门户 15 12 1938 2910 200
1363157995093 13922314466 00-FD-07-A2-EC-BA:CMCC 120.196.100.82 img.qfc.cn 图片大全 12 12 3008 3720 200
1363157982040 13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99 y0.ifengimg.com 综合门户 57 102 7335 110349 200
1363157986072 18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99 input.shouji.sogou.com 搜索引擎 21 18 9531 2412 200
1363157990043 13925057413 00-1F-64-E1-E6-9A:CMCC 120.196.100.55 t3.baidu.com 搜索引擎 69 63 11058 48243 200
1363157988072 13760778710 00-FD-07-A4-7B-08:CMCC 120.196.100.82 http://youku.com/ 视频网站 2 2 120 120 200
1363157985079 13823070001 20-7C-8F-70-68-1F:CMCC 120.196.100.99 img.qfc.cn 图片浏览 6 3 360 180 200
1363157985069 13600217502 00-1F-64-E2-E8-B1:CMCC 120.196.100.55 www.baidu.com 综合门户 18 138 1080 186852 200
上网流量统计
需求一:统计求和
统计每个手机号的上行流量总和、下行流量总和,上行总流量之和和下行总流量之和
第一步:自定义map的输出value的对象FlowBean
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class FlowBean implements Writable {
private Integer upFlow;
private Integer downFlow;
private Integer upCountFlow;
private Integer downCountFlow;
@Override
public String toString() {
return upFlow + "-" + downFlow + "-" + upCountFlow + "-" + downCountFlow;
}
public Integer getUpFlow() {
return upFlow;
}
public void setUpFlow(Integer upFlow) {
this.upFlow = upFlow;
}
public Integer getDownFlow() {
return downFlow;
}
public void setDownFlow(Integer downFlow) {
this.downFlow = downFlow;
}
public Integer getUpCountFlow() {
return upCountFlow;
}
public void setUpCountFlow(Integer upCountFlow) {
this.upCountFlow = upCountFlow;
}
public Integer getDownCountFlow() {
return downCountFlow;
}
public void setDownCountFlow(Integer downCountFlow) {
this.downCountFlow = downCountFlow;
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(upFlow);
output.writeInt(downFlow);
output.writeInt(upCountFlow);
output.writeInt(downCountFlow);
}
@Override
public void readFields(DataInput input) throws IOException {
this.upFlow = input.readInt();
this.upCountFlow = input.readInt();
this.downFlow = input.readInt();
this.downCountFlow = input.readInt();
}
}
第二步:自定义map的逻辑
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
private FlowBean flowBean = new FlowBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] split = value.toString().split("\t");
flowBean.setUpFlow(Integer.parseInt(split[6]));
flowBean.setDownFlow(Integer.parseInt(split[7]));
flowBean.setUpCountFlow(Integer.parseInt(split[8]));
flowBean.setDownCountFlow(Integer.parseInt(split[9]));
context.write(new Text(split[1]), flowBean);
}
}
第三步:自定义reducer逻辑
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
private FlowBean flowBean = new FlowBean();
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
Integer upFlow = 0;
Integer downFlow = 0;
Integer upCountFlow = 0;
Integer downCountFlow = 0;
for (FlowBean value : values) {
upFlow += value.getUpFlow();
downFlow += value.getDownFlow();
upCountFlow += value.getUpCountFlow();
downCountFlow += value.getDownCountFlow();
}
flowBean.setUpFlow(upFlow);
flowBean.setDownFlow(downFlow);
flowBean.setUpCountFlow(upCountFlow);
flowBean.setDownCountFlow(downCountFlow);
context.write(key, flowBean);
}
}
第四步:程序main
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class FlowJobMain extends Configured implements Tool {
@Override
public int run(String[] strings) throws Exception {
Job job = Job.getInstance(super.getConf(), "job");
job.setJarByClass(FlowBean.class);
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path("file:///D:\\流量统计\\input"));
job.setMapperClass(FlowMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
job.setReducerClass(FlowReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path("file:///D:\\流量统计\\output"));
boolean b = job.waitForCompletion(true);
return b ? 0 : 1;
}
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
Tool tool = new FlowJobMain();
int run = ToolRunner.run(configuration, tool, args);
System.exit(run);
}
}
需求二:上行流量按照倒序排列
以需求一的输出数据为排序的输入数据
第一步:自定义FlowBean实现WritableComparable实现比较排序
o1.compareTo(o2):当返回正数,o1排在o2后面,返回负数,o1放在o2前面
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class FlowBean implements WritableComparable<FlowBean> {
private Integer upFlow;
private Integer downFlow;
private Integer upCountFlow;
private Integer downCountFlow;
@Override
public String toString() {
return upFlow + "-" + downFlow + "-" + upCountFlow + "-" + downCountFlow;
}
public Integer getUpFlow() {
return upFlow;
}
public void setUpFlow(Integer upFlow) {
this.upFlow = upFlow;
}
public Integer getDownFlow() {
return downFlow;
}
public void setDownFlow(Integer downFlow) {
this.downFlow = downFlow;
}
public Integer getUpCountFlow() {
return upCountFlow;
}
public void setUpCountFlow(Integer upCountFlow) {
this.upCountFlow = upCountFlow;
}
public Integer getDownCountFlow() {
return downCountFlow;
}
public void setDownCountFlow(Integer downCountFlow) {
this.downCountFlow = downCountFlow;
}
@Override
public int compareTo(FlowBean o) {
int i = this.upCountFlow.compareTo(o.upCountFlow);
return i > 0 ? -1 : 1;
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(upFlow);
output.writeInt(downFlow);
output.writeInt(upCountFlow);
output.writeInt(downCountFlow);
}
@Override
public void readFields(DataInput input) throws IOException {
this.upFlow = input.readInt();
this.upCountFlow = input.readInt();
this.downFlow = input.readInt();
this.downCountFlow = input.readInt();
}
}
第二步:自定义map逻辑
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class FlowSortMapper extends Mapper<LongWritable, Text, FlowBean, Text> {
private FlowBean flowBean = new FlowBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] split = value.toString().split("\t");
flowBean.setUpFlow(Integer.parseInt(split[1]));
flowBean.setDownFlow(Integer.parseInt(split[2]));
flowBean.setUpCountFlow(Integer.parseInt(split[3]));
flowBean.setDownCountFlow(Integer.parseInt(split[4]));
context.write(flowBean, new Text(split[0]));
}
}
第三步:自定义reducer逻辑
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class FlowSortReducer extends Reducer<FlowBean, Text, Text, FlowBean> {
@Override
protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text value : values) {
context.write(value, key);
}
}
}
第四步:程序运行main函数
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class FlowSortJobMain extends Configured implements Tool {
@Override
public int run(String[] strings) throws Exception {
Job job = Job.getInstance(super.getConf(), "job");
job.setJarByClass(FlowSortJobMain.class);
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path("file:///D:\\流量统计\\inputsort"));
job.setMapperClass(FlowSortMapper.class);
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(FlowSortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path("file:///D:\\流量统计\\outputsort"));
boolean b = job.waitForCompletion(true);
return b ? 0 : 1;
}
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
Tool tool = new FlowSortJobMain();
int run = ToolRunner.run(configuration, tool, args);
System.exit(run);
}
}
需求三:手机号码分区
改良一下需求一,按照手机号进行分区,根据前三个数字进行分区
自定义分区规则
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class FlowPartition extends Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text text, FlowBean flowBean, int i) {
String line = text.toString();
if (line.startsWith("135")) {
return 0;
} else if (line.startsWith("136")) {
return 1;
} else if (line.startsWith("137")) {
return 2;
} else if (line.startsWith("138")) {
return 3;
} else if (line.startsWith("139")) {
return 4;
} else {
return 5;
}
}
}
jobmain中添分区设置和reducertask数量
job.setPartitionerClass(FlowPartition.class);
job.setNumReduceTasks(6);
更改输入输出路径,并打包到集群上去运行
TextInputFormat.addInputPath(job, new Path("hdfs://node01:8020/input"));
TextOutputFormat.setOutputPath(job, new Path("hdfs://node01:8020/output"));
6、MapTask运行机制详解和map任务的并行度
MapTask的运行过程:
- TextInputFormat读取数据
- 调用map逻辑,默认是一个切片(也就是一个block块)对应一个mapTask
- 数据写入到环形缓冲区,默认的缓冲区的大小为100M,环形缓冲区相当于一个数组
- 数据一直往环形缓冲区写,数据在缓冲区实现分区,排序,规约,分组
- 等到数据量达到缓冲区的80%(80M)时,启动溢写线程,将内存当中的80M数据溢写到磁盘
- 等到mapTask完成后,磁盘上可能存在很多的小文件,这些文件已经做好了局部的排序。分区。规约等操作,再把这些小文件进行合并,变成一个大文件
- 等待reduce阶段来拉取文件
mapTask的一些基础配置设置(mapred-site.xml)
设置 | 默认的设置参数 |
---|---|
设置环形缓冲区的内存值大小 | mapreduce.task.io.sort.mb 100M |
设置溢写百分比 | mapreduce.map.sort.spill.percent 0 |
设置溢写数据目录 | mapreduce.cluster.local.dir ${hadoop.tmp.dir}/mapred/local |
设置一次最多合并多少个溢写文件 | mapreduce.task.io.sort.factor 10 |
7、ReduceTask的运行过程
- 启动线程去mapTask拷贝数据,拉取属于每个reducetask的数据
- 数据的合并,将拉去过来的数据进行合并,合并可能发生在内存中,有可能在磁盘中,也有可能同时在内存和磁盘中,合并的同时进行分组的操作
- 调用reduce的逻辑
- 数据输出
注意:maptask的个数通过block块的个数来确认,reducetask的个数不能确认,需要反复的设置job.setNumReduceTask()来确定个数
8、Hadoop中的压缩
文件压缩有两大好处,节约磁盘空间,加速数据在网络和磁盘上的传输
通过命令bin/hadoop checknative来查看hadoop支持的压缩算法
各种压缩算法对应的Java类
压缩格式 | 对应使用的Java类 |
---|---|
DEFLATE | org.apache.hadoop.io.compress.DeFaultCodec |
gzip | org.apache.hadoop.io.compress.GZipCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
LZO | com.hadoop.compression.lzo.LzopCodec |
LZ4 | org.apache.hadoop.io.compress.Lz4Codec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
在代码中设置压缩方式:
设置map阶段的压缩
configuration.set("mapreduce.map.output.compress","true");
configuration.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");
设置reduce阶段的压缩
configuration.set("mapreduce.output.fileoutputformat.compress","true");
configuration.set("mapreduce.output.fileoutputformat.compress.type","RECORD");
configuration.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");