目录
3.1.5 CombineTextInputFormat切片机制
3.1 InputFormat数据输入
3.1.1 切片与MapTask并行度决定机制
MapTask并行度决定机制
- 数据块:Block是HDFS物理上把数据分成一块一块。数据块是HDFS存储数据单位。
- 数据切片:数据切片只是在逻辑上对输入进行分片,并不会在磁盘上将其切分成片存储。数据切片是MapReduce程序计算输入数据的单位,一个切片会对应启动一个MapTask。
3.1.2 Job提交流程源码和切片源码总结
// minSize默认为1,maxSize默认为Long的最大值,blockSize为块大小,本地默认32M,集群默认128M
protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
3.1.3 FileInputFormat切片机制
FileInputFormat切片机制
FileInputFormat切片大小参数配置
3.1.4 TextInputFormat
TextInputFormat是默认的FileInputFormat实现类。按行读取每条记录。键是存储该行在整个文件中的起始字节偏移量,LongWritable类型。值是这行的内容,不包括任何行终止符(换行符和回车符),Text类型。
如:一个分片包含了2条文本记录。
I have a happy day
Learning more
每条记录表示为以下键值对
(0,I have a happy day)
(21,Learning more )
3.1.5 CombineTextInputFormat切片机制
用于小文件过多的场景,可以将多个小文件从逻辑上规划到一个切片中,这样,多个小文件就可以交给一个MapTask处理。生成切片过程包括:虚拟内存过程和切片过程两部分。
//设置切片方式
job.setInputFormatClass(CombineTextInputFormat.class);
//设置虚拟存储切片最大值设置4M
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
3.2 MapReduce工作流程
3.3 Shuffle机制
3.3.1 Shuffle机制
Map方法之后,Reduce方法之前的数据处理过程称为Shuffle。
3.3.2 Partition分区
提交工作可设置reduceTask个数:job.setNumReduceTasks(2)。此时Partition分区是根据key的hashCode对ReduceTasks个数取模得到的。用户没法控制哪个key存储到哪个分区。如果不设置partition,默认走内部类,partition=0;
public class HashPartitioner<K,V> extends Partitioner<K,V>{
public int getPartition(K key, V value, int numReduceTasks){
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
自定义partition步骤
3.3.3 Partition分区实操
package com.atguigu.mapreduce.partitioner;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
* @author
* @date 2021/06/08
**/
public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
public int getPartition(Text text, FlowBean flowBean, int i) {
// text 是key,手机号
String phone = text.toString();
String prePhone = phone.substring(0, 3);
int partition;
if ("136".equals(prePhone)) {
partition = 0;
} else if ("137".equals(prePhone)) {
partition = 1;
} else if ("138".equals(prePhone)) {
partition = 2;
} else if ("139".equals(prePhone)) {
partition = 3;
} else {
partition = 4;
}
return partition;
}
}
Driver驱动类设置:
job.setPartitionerClass(ProvincePartitioner.class);
job.setNumReduceTasks(5);
分区总结:
3.3.4 WritableComparable排序
MapTask和ReduceTask均会对数据按照key进行排序,不管逻辑上是否需要,是hadoop的默认行为,可以提高效率。默认排序是按照字典排序,且实现排序的方法是快排。
对于MapTask,会将处理的结果暂时放到环形缓冲区中,当环形缓冲区使用率达到一定阈值后,再对缓冲区中的数据进行一次快速排序,并将这些有序数据溢写到磁盘上,而当数据处理完毕后,它会对磁盘上所有文件继进行归并排序。
对于ReduceTask,他从每个MapTask上远程拷贝相应的数据文件,如果数据文件大小超过一定阈值,则溢写在磁盘上,否则存储在内存中。如果磁盘上文件数目达到一定阈值,则进行一次归并排序以生成一个更大文件;如果内存中文件大小或者数目超过一定阈值,则进行一次合并后将数据溢写到磁盘上。当所有数据拷贝完毕后,ReduceTask统一对内存和磁盘上的所有数据进行一次归并排序。
- 排序列表:
- 部分排序:MapReduce根据输入记录的键对数据集排序,保证输出的每个文件内部是有序的。
- 全排序:最终输出结果只有一个文件,且文件内部有序。实现方式只设置一个ReduceTask,该方法处理大型文件效率极低,只有一台机器来处理所有文件。
- 辅助排序:(GroupingComparator分组),在Reduce端对key进行分组。应用于:在接收的key为bean对象时,想让一个或几个字段相同(全部字段比较不相同)的key进入到同一个reduce方法时,可以采用分组排序。
- 二次排序:在自定义排序过程中,如果compareTo中的判断条件为两个即为二次排序。
- 二次排序案例:一个文件中包含手机号,上行流量,下行流量,总流量,首先按照总流量进行排序,然后再按照上行流量进行排序。分析:定义一个FlowBean对象(当做map阶段的key),包含上行流量,下行流量和总流量,继承WritableComparable方法,实现二次排序;然后将手机号作为reduce阶段的key进行输出。FlowBean,FlowMapper,FlowReducer和FlowDriver实现如下:
package com.atguigu.mapreduce.writableComparable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* @author
* @date 2021/06/03
**/
public class FlowBean implements WritableComparable<FlowBean> {
private long upFlow; //上行流量
private long downFlow; //下行流量
private long sumFlow; //总流量
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
public void setSumFlow() {
this.sumFlow = this.upFlow + this.downFlow;
}
//空参构造
public FlowBean() {
}
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
public void readFields(DataInput dataInput) throws IOException {
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
@Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow;
}
public int compareTo(FlowBean o) {
//按照总流量的倒序进行排序,一次排序
if (this.sumFlow > o.sumFlow) {
return -1;
} else if (this.sumFlow < o.sumFlow) {
return 1;
} else {
//按照上行流量的正序排列,二次排序
if (this.upFlow > o.upFlow) {
return 1;
} else if (this.upFlow < o.upFlow) {
return -1;
} else {
return 0;
}
}
}
}
package com.atguigu.mapreduce.writableComparable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @author
* @date 2021/06/05
* 按总流量的顺序排列,因此总流量作为key
*/
public class FlowMapper extends Mapper<LongWritable, Text, FlowBean, Text> {
private FlowBean outK = new FlowBean();
private Text outValue = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 1.获取一行
// 手机号 上行流量 下行流量 总流量
//15847512684 1454 245 1699
//14784125941 4101 142 4243
String line = value.toString();
//切割
String[] split = line.split("\t");
//封装
outValue.set(split[0]);
outK.setUpFlow(Long.parseLong(split[1]));
outK.setDownFlow(Long.parseLong(split[2]));
outK.setSumFlow();
//写出
context.write(outK,outValue);
}
}
package com.atguigu.mapreduce.writableComparable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author
* @date 2021/06/06
**/
public class FlowReducer extends Reducer<FlowBean, Text, Text, FlowBean> {
private FlowBean outV = new FlowBean();
@Override
protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
/**
* key value
* 总 上 下 手机号
* 240 100 140 18715674147
* 240 90 50 15784159324
* 240 120 120 12547821543
*/
for (Text value : values) {
context.write(value,key);
}
}
}
package com.atguigu.mapreduce.writableComparable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author
* @date 2021/06/06
**/
public class FlowDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 1.获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2.获取jar包
job.setJarByClass(FlowDriver.class);
// 3.关联mapper和reducer
job.setMapperClass(FlowMapper.class);
job.setReducerClass(FlowReducer.class);
// 4.设置mapper 输出的key和value类型
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(Text.class);
// 5.设置最终数据输出的key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
// 6.设置数据输入路径和输出路径
FileInputFormat.setInputPaths(job, new Path("F:\\input"));
FileOutputFormat.setOutputPath(job, new Path("F:\\output"));
// 7.提交job
boolean result = job.waitForCompletion(true);
System.out.println(result ? 0 : 1);
}
}
- 区内排序案例:基于上一个需求,增加自定义分区类,分区按照省份手机号设置。分析:增加一个ProvincePartitioner方法,在driver的job中配置Partitioner和reduceTasks
package com.atguigu.mapreduce.partitionandwritableComparable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; /** * @author * @date 2021/06/08 **/ public class ProvincePartitioner extends Partitioner<FlowBean, Text> { public int getPartition(FlowBean flowBean, Text text , int i) { // flowBean 是key,上行流量,下行流量和总流量 // text是手机号,以省份进行分区 String phone = text.toString(); String prePhone = phone.substring(0, 3); int partition; if ("136".equals(prePhone)) { partition = 0; } else if ("137".equals(prePhone)) { partition = 1; } else if ("138".equals(prePhone)) { partition = 2; } else if ("139".equals(prePhone)) { partition = 3; } else { partition = 4; } return partition; } }
job.setPartitionerClass(ProvincePartitioner.class); job.setNumReduceTasks(5);
3.3.5 Combiner合并
根据不同的业务逻辑判断是否需要使用,如果需要则在driver阶段,指定哪个类作为combiner的逻辑即可。
job.setCombinerClass(WordCountReducer.class);
3.4 OutputFormat数据输出
案例:将网址分别保存到不同的文件中,mapper,reducer,driver,outputFormat,recordWriter实现如下
package com.atguigu.mapreduce.outputformat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @author
* @date 2021/06/11
**/
public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// http://www.baidu.com
// http://www.goole.com
// http://www.atguigu.com
// (http://www.goole.com,NullWritable)
// 不做任何处理
context.write(value,NullWritable.get());
}
}
package com.atguigu.mapreduce.outputformat;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @author
* @date 2021/06/11
**/
public class LogReducer extends Reducer<Text, NullWritable,Text,NullWritable> {
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
//防止有相同数据而导致丢数据
for (NullWritable value : values) {
context.write(key,NullWritable.get());
}
}
}
package com.atguigu.mapreduce.outputformat;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author
* @date 2021/06/11
**/
public class LogOutputformat extends FileOutputFormat<Text, NullWritable> {
@Override
public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
LogRecordWriter lrw = new LogRecordWriter(job);
return lrw;
}
}
package com.atguigu.mapreduce.outputformat;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import java.io.IOException;
/**
* @author
* @date 2021/06/11
**/
public class LogRecordWriter extends RecordWriter<Text, NullWritable> {
private FSDataOutputStream atguiguOut;
private FSDataOutputStream other;
public LogRecordWriter(TaskAttemptContext job) {
//创建两条流
try {
FileSystem fs = FileSystem.get(job.getConfiguration());
atguiguOut = fs.create(new Path("F:\\output\\atguigu.log"));
other = fs.create(new Path("F:\\output\\other.log"));
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void write(Text key, NullWritable nullWritable) throws IOException, InterruptedException {
//具体往出写
String log = key.toString();
if (log.contains("atguigu")) {
atguiguOut.writeBytes(log + "\n");
} else {
other.writeBytes(log + "\n");
}
}
@Override
public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
IOUtils.closeStream(atguiguOut);
IOUtils.closeStream(other);
}
}
package com.atguigu.mapreduce.outputformat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author
* @date 2021/05/27
**/
public class LogDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//1.获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//2.获取jar包路径
job.setJarByClass(LogDriver.class);
//3.关联mapper,关联reducer
job.setMapperClass(LogMapper.class);
job.setReducerClass(LogReducer.class);
//4.设置map输出的kv类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
//5.设置最终输出的kv类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//设置自定义的outputformat
job.setOutputFormatClass(LogOutputformat.class);
//6.设置输入路径和输出路径
FileInputFormat.setInputPaths(job, new Path("F:\\input"));
// 虽然已经自定义了输出,但outputformat继承自fileoutputformat,必须要有一个_SUCCESS文件,因此要有一个指定的输出目录
FileOutputFormat.setOutputPath(job, new Path("F:\\output"));
//7.提交job
boolean result = job.waitForCompletion(true);
System.out.println(result ? 0 : 1);
}
}
3.5 MapReduce内核源码
3.5.1 MapTask工作机制
参照3.2图,mapTask整个工作流分为10个步骤,共五个阶段:
- Read阶段:步骤一,准备待处理文本;步骤二,客户端submit前,获取待处理的数据信息,然后根据参数配置,形成一个任务分配规划;步骤三,提交split数、jar包和xml文件;步骤四,计算得出mapTask数量;
- Map阶段:步骤五,默认TextInputFormat读取,步骤六,逻辑运算;
- Collect阶段:步骤七,向环形缓冲区写入<k,v>数据;步骤八,分区,排序;
- 溢写阶段:步骤九,溢出到文件(分区且区内有序);
- Merge阶段:步骤十,merge归并和排序
3.5.2 ReduceTask工作机制
参照3.2图,reduceTask整个工作流共三个阶段:
- Copy阶段:ReduceTask从各个MapTask上远程拷贝一片数据,并针对某一片数据,如果其大小超过一定阈值,则写到磁盘上,否则直接放到内存中;
- Sort阶段:在远程拷贝数据的同时,reduceTask启动了两个后台线程对内存和磁盘上的文件进行合并,以防止内存使用过多或磁盘上文件过多。按照Mapreduce语义,用户编写reduce()函数输入数据是按照key进行聚集的一组数据。 为了将key相同的数据聚集在一起,hadoop采用了基于排序的策略。由于各个MapTask已经实现了对自己的处理数据进行了局部的排序,因此,ReduceTask只需对所有数据进行一次归并排序即可;
- Reduce阶段:reduce()函数将计算结果写到HDFS上。
3.5.3 ReduceTask并行度决定机制
3.5.4 MapTask和ReduceTask源码解析
3.6 Join多种应用
3.6.1 Reduce Join
3.6.2 Reduce Join实例
案例:bean,mapper,reducer,driver实现如下:
package com.atguigu.mapreduce.reduceJoin;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* @author
* @date 2021/06/16
* 创建商品和订单合并后的bean类
**/
public class TableBean implements Writable {
// id pid amount
// pid name
private String id; //订单id
private String pid; //商品id
private int amount; //商品数量
private String pname; //商品名称
private String flag; //标记什么是表 order pd
//空参构造
public TableBean() {
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getPid() {
return pid;
}
public void setPid(String pid) {
this.pid = pid;
}
public int getAmount() {
return amount;
}
public void setAmount(int amount) {
this.amount = amount;
}
public String getPname() {
return pname;
}
public void setPname(String pname) {
this.pname = pname;
}
public String getFlag() {
return flag;
}
public void setFlag(String flag) {
this.flag = flag;
}
public void write(DataOutput out) throws IOException {
out.writeUTF(id);
out.writeUTF(pid);
out.writeInt(amount);
out.writeUTF(pname);
out.writeUTF(flag);
}
public void readFields(DataInput in) throws IOException {
this.id = in.readUTF();
this.pid = in.readUTF();
this.amount = in.readInt();
this.pname = in.readUTF();
this.flag = in.readUTF();
}
@Override
public String toString() {
// id pname amount
return id + "\t" + pname + "\t" + amount;
}
}
package com.atguigu.mapreduce.reduceJoin;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
/**
* @author
* @date 2021/06/16
**/
public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean> {
private String fileName;
private Text outK = new Text();
private TableBean outV = new TableBean();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
//初始化,读取order和pd文件,获取到相应的文件名称
// order 内容
// id pid amount
// pd 内容
// pid pname
FileSplit split = (FileSplit) context.getInputSplit();
fileName = split.getPath().getName();
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 获取一行
String line = value.toString();
// 判断是哪个文件的
if (fileName.contains("order")) {
String[] split = line.split("\t");
//封装对应的key和value
outK.set(split[1]);
outV.setId(split[0]);
outV.setPid(split[1]);
outV.setAmount(Integer.valueOf(split[2]));
outV.setPname("");
outV.setFlag("order");
} else {
String[] split = line.split("\t");
//封装对应的key和value
outK.set(split[0]);
outV.setId("");
outV.setPid(split[0]);
outV.setAmount(0);
outV.setPname(split[1]);
outV.setFlag("pd");
}
//写出
context.write(outK, outV);
}
}
package com.atguigu.mapreduce.reduceJoin;
import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.codehaus.jackson.map.util.BeanUtil;
import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
import java.util.List;
/**
* @author
* @date 2021/06/16
**/
public class TableReducer extends Reducer<Text, TableBean, TableBean, NullWritable> {
@Override
protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
/**
* 一组数据示例,遍历flag为order的数据,将其pname置为flag为pd的pname
* pid id pdname amount flag
* 01 1001 1 order
* 01 1004 4 order
* 01 小米 0 pd
*/
//准备两个集合
List<TableBean> orderBeans = new ArrayList();
TableBean pdBean = new TableBean();
//循环遍历,入值
for (TableBean value : values) {
if (value.getFlag().equals("order")) {
TableBean tmpTableBean = new TableBean();
try {
BeanUtils.copyProperties(tmpTableBean, value);
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
orderBeans.add(tmpTableBean);
} else {
try {
BeanUtils.copyProperties(pdBean, value);
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
}
}
//循环遍历orderBeans,赋值pname
for (TableBean orderBean : orderBeans) {
orderBean.setPname(pdBean.getPname());
context.write(orderBean, NullWritable.get());
}
}
}
package com.atguigu.mapreduce.reduceJoin;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author
* @date 2021/06/16
**/
public class TableDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 1.获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2.获取jar包
job.setJarByClass(TableDriver.class);
// 3.关联mapper和reducer
job.setMapperClass(TableMapper.class);
job.setReducerClass(TableReducer.class);
// 4.设置mapper 输出的key和value类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(TableBean.class);
// 5.设置最终数据输出的key和value类型
job.setOutputKeyClass(TableBean.class);
job.setOutputValueClass(NullWritable.class);
// 6.设置数据输入路径和输出路径
FileInputFormat.setInputPaths(job, new Path("F:\\input"));
FileOutputFormat.setOutputPath(job, new Path("F:\\output"));
// 7.提交job
boolean result = job.waitForCompletion(true);
System.out.println(result ? 0 : 1);
}
}
总结:合并操作在Reduce阶段完成,reduce端的处理压力太大,map节点的运算负载低,资源利用率不高,且在reduce阶段极易产生数据倾斜。所以通过在map端实现数据合并。
3.6.3 Map Join
3.6.4 Map Join案例实操
案例:mapper,driver实现如下:
package com.atguigu.mapreduce.mapJoin;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
/**
* @author
* @date 2021/06/17
**/
public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
private HashMap<String, String> pdMap = new HashMap<String, String>();
private Text outK = new Text();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
/**
* pd内容
* pid pname
* 01 小米
* 02 华为
* 03 格力
*/
// 获取缓存的文件,并把文件内容封装到集合pd.tx
URI[] cacheFiles = context.getCacheFiles();
FileSystem fs = FileSystem.get(context.getConfiguration());
FSDataInputStream fis = fs.open(new Path(cacheFiles[0]));
//从流中读取数据
BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line;
while (StringUtils.isNotEmpty(line = reader.readLine())) {
// 切割
String[] fields = line.split("\t");
// 赋值
pdMap.put(fields[0], fields[1]);
}
IOUtils.closeStream(reader);
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
/**
* order内容
* id pid amount
* 1001 01 1
*/
// 处理 order.txt
String line = value.toString();
String[] fields = line.split("\t");
// 获取pid
String pname = pdMap.get(fields[1]);
// 获取订单id和订单数量
// 封装
outK.set(fields[0] + "\t" + pname + "\t" + fields[2]);
context.write(outK, NullWritable.get());
}
}
package com.atguigu.mapreduce.mapJoin;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
/**
* @author
* @date 2021/06/17
**/
public class MapJoinDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
// 1.获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2.获取jar包
job.setJarByClass(MapJoinDriver.class);
// 3.关联mapper
job.setMapperClass(MapJoinMapper.class);
// 4.设置mapper 输出的key和value类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
// 5.设置最终数据输出的key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
// 加载缓存路径
job.addCacheFile(new URI("file:///F:/tablecache/pd.txt"));
// map端的join的逻辑不需要reduce阶段,设置reduceTask数量为0
job.setNumReduceTasks(0);
// 6.设置数据输入路径和输出路径
FileInputFormat.setInputPaths(job, new Path("F:\\input\\inputtable2"));
FileOutputFormat.setOutputPath(job, new Path("F:\\hadoop\\output"));
// 7.提交job
boolean result = job.waitForCompletion(true);
System.out.println(result ? 0 : 1);
}
}
3.7 数据清洗(ETL)
ETL,用来描述将数据从来源经过抽取(Extract)、转换(Transform)、加载(Load)至目的端的过程。ETL一词较常用在数据仓库,但其对对象并不限于数据仓库。
在运行核心业务MapReduce程序之前,需要对不符合用户要求的数据进行清理。清理的过程往往只需要运行Mapper程序,不需要运行Reduce程序。
实例:去除日志中字段个数小于11的日志。mapper和driver代码如下:
package com.atguigu.mapreduce.etl;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @author
* @date 2021/06/19
**/
public class WebLogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 获取一行
String line = value.toString();
// ETL清洗
boolean result = parseLog(line, context);
if (!result) {
return;
}
//写出
context.write(value, NullWritable.get());
}
private boolean parseLog(String line, Context context) {
// 切割
String[] fields = line.split(" ");
// 判断日志长度是否大于11
if (fields.length > 11) {
return true;
} else {
return false;
}
}
}
package com.atguigu.mapreduce.etl;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.net.URISyntaxException;
/**
* @author
* @date 2021/06/19
**/
public class WebLogDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
args = new String[]{"F:\\input\\inputtable2","F:\\hadoop\\output"};
// 1.获取job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2.获取jar包
job.setJarByClass(WebLogDriver.class);
// 3.关联mapper
job.setMapperClass(WebLogMapper.class);
// 5.设置最终数据输出的key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
// 不需要reduce阶段,设置reduceTask数量为0
job.setNumReduceTasks(0);
// 6.设置数据输入路径和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 7.提交job
boolean result = job.waitForCompletion(true);
System.out.println(result ? 0 : 1);
}
}
3.8 MapReduce开发总结
- InputFormat
- 默认的是TextInputFormat kv key偏移量,v一行内容;
- 处理小文件combineTextInputFormat 把多个文件合并到一起统一切片
- Mapper
- setup()初始化;
- map()用户的业务逻辑;
- clearup()关闭资源
- 分区
- 默认分区是HashPartitioner,默认按照key的hash值%numreducetask个数;
- 自定义分区
- 排序
- 部分排序,每个输出的文件内部有序;
- 全排序,一个reduce,对所有数据大排序;
- 二次排序,自定义排序范畴,实现writableCompare接口,重写compareTo方法;
- Combiner
- 前提,不影响最终的业务逻辑(求和,求平均值);
- 提前聚合map,解决数据倾斜的一个方法
- Reducer
- setup()初始化;
- reduceup()用户的业务逻辑;
- clearup()关闭资源
- OutputFormat
- 默认是TextOutPutFormat,按行输出到文件;
- 自定义