一、OutputFormat接口
OutputFormat是MapReduce输出的基类,所有实现MapReduce输出都实现了OutputFormat接口。
1.文本输出TextOutPutFormat
默认的输出格式是TextOutputFormat,它把每条记录写为文本行。他的键和值可以是任意类型,会通过toString()方法吧他们转换为在字符串。
2.SequenceFileOutputFormat
SequenceFileOutputFormat将它的输出写为一个顺序文件。如果输出需要作为后续MapReduce任务的输出,这便是一种很好的输出格式,因为它的格式紧凑,很容易被压缩。
3.自定义OutputFormat
二、自定义OutputFormat
为了实现控制最终文件的输出路径,可以自定义OutputFormat
在一个MapReduce程序中更具数据的不同输出两类结果到不同目录,这种灵活的输出需求就需要通过自定义outputformat来实现。
1.自定义OutputFormat步骤
-
自定义一个类继承FileOutputFormat
-
改写recordwriter,重写输出数据的方法write()
三、.过滤文本内容及自定义文件输出路径(自定义OutputFormat)
1.需求
过滤输入的log日志中是否包含.com
-
以com结尾的网站输出到d:/com.log里
-
不以com结尾的网站输出到d:/other.log里
2.输入数据
一个名叫log.txt的日志文件里包含多条url
3.自定义OutputFormat
public class FilterOutputFormat extends FileOutputFormat<Text,NullWritable>{
@Override
public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
return new FilterRecordWriter(taskAttemptContext);
}
}
4.自定义RecordWriter
public class FilterRecordWriter extends RecordWriter<Text, NullWritable> {
private Configuration configuration;
private FSDataOutputStream comFs = null;
private FSDataOutputStream otherFs = null;
public FilterRecordWriter() {
}
public FilterRecordWriter(TaskAttemptContext job){
configuration = job.getConfiguration();
//获取文件系统
FileSystem fileSystem = null;
try {
fileSystem = FileSystem.get(configuration);
//创建两个输出流
comFs = fileSystem.create(new Path("d:/com.log"));
otherFs = fileSystem.create(new Path("d:/other.log"));
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void write(Text text, NullWritable nullWritable) throws IOException, InterruptedException {
//判断数据是否包含com
if (text.toString().contains("com")){
comFs.write(text.getBytes());
}else {
otherFs.write(text.getBytes());
}
}
@Override
public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
//关闭流
if (comFs !=null){
comFs.close();
}
if (otherFs !=null){
otherFs.close();
}
}
}
5.编写Mapper代码
public class FilterMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//获取一行数据
String line = value.toString();
//设置key
k.set(line);
//输出
context.write(k,NullWritable.get());
}
}
6.编写Reducer代码
public class FilterReducer extends Reducer<Text,NullWritable,Text,NullWritable> {
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
String k = key.toString() + "\r\n";
context.write(new Text(k),NullWritable.get());
}
}
7.编写Driver
public class FilterDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//获取配置信息
Configuration conf=new Configuration();
Job job = Job.getInstance(conf);
//设置jar包加载路径
job.setJarByClass(FilterDriver.class);
//加载map/reduce类
job.setMapperClass(FilterMapper.class);
job.setReducerClass(FilterReducer.class);
//设置OutFormat
job.setOutputFormatClass(FilterOutputFormat.class);
//设置map输出数据key和value类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
//设置最终输出数据key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//设置输入数据和输出数据路径
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//提交
boolean result = job.waitForCompletion(true);
System.exit(result?0:1);
}
}
四、ReduceJoin
1.原理
Map端的主要工作:为来自不同表(文件)的key/value对打标签以区别不同来源的记录。然后用连接字段作为key,其余部分和新加的标志作为value,最后进行输出。
Reduce端的工作:在reduce端以连接字段作为key的分组已经完成,只需要在每一个分组当中将那些来源不同文件的记录分开,最后进行合并就可以了。
2.缺点
这种方法的缺点比较明显会造成shuffle阶段出现大量的数据传输,效率低下。
五、MapReduce中多表合并案例(数据倾斜)
1.需求
订单数据表t_order(文件名为order.txt)
id | pid | amount |
1001 | 01 | 1 |
1002 | 02 | 2 |
1003 | 03 | 3 |
商品信息表t_product(文件名为pd.txt)
pid | pname |
01 | 小米 |
02 | 华为 |
03 | 格力 |
将商品信息表中数据根据商品pid合并到订单数据表中
最终结果:
id | pid | amount |
1001 | 小米 | 1 |
1002 | 华为 | 2 |
1003 | 格力 | 3 |
2.程序分析
-
mapper中处理逻辑
-
获取输入文件类型
-
获取输入数据
-
不同文件分别处理
-
封装bean对象输出
-
- 默认对产品id排序
-
reduce方法缓存订单数据集合和数据表然后再合并
3.编写Bean代码
public class TableBean implements Writable {
/**
* 订单id
*/
private String orderId;
/**
* 产品id
*/
private String pid;
/**
* 产品数量
*/
private int amount;
/**
* 产品名称
*/
private String pName;
/**
* 标记是订单表(0)还是产品表(1)
*/
private String flag;
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(orderId);
dataOutput.writeUTF(pid);
dataOutput.writeInt(amount);
dataOutput.writeUTF(pName);
dataOutput.writeUTF(flag);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.orderId = dataInput.readUTF();
this.pid = dataInput.readUTF();
this.amount = dataInput.readInt();
this.pName = dataInput.readUTF();
this.flag = dataInput.readUTF();
}
@Override
public String toString() {
return orderId + "/t" + pName + "/t" + amount;
}
public String getOrderId() {
return orderId;
}
public void setOrderId(String orderId) {
this.orderId = orderId;
}
public String getPid() {
return pid;
}
public void setPid(String pid) {
this.pid = pid;
}
public int getAmount() {
return amount;
}
public void setAmount(int amount) {
this.amount = amount;
}
public String getpName() {
return pName;
}
public void setpName(String pName) {
this.pName = pName;
}
public String getFlag() {
return flag;
}
public void setFlag(String flag) {
this.flag = flag;
}
}
4.编写Mapper代码
public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean> {
TableBean v = new TableBean();
Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//区分两张表
FileSplit split = (FileSplit) context.getInputSplit();
String name = split.getPath().getName();
//获取一行数据
String line = value.toString();
//切割数据
String[] fields = line.split("\t");
if (name.startsWith("order")) {
//订单表
v.setOrderId(fields[0]);
v.setPid(fields[1]);
v.setAmount(Integer.parseInt(fields[2]));
v.setpName("");
v.setFlag("0");
k.set(fields[1]);
} else {
//产品表
v.setOrderId("");
v.setPid(fields[0]);
v.setAmount(0);
v.setpName(fields[1]);
v.setFlag("1");
k.set(fields[0]);
}
context.write(k, v);
}
}
5.编写Reducer代码
public class TableReducer extends Reducer<Text, TableBean, TableBean, NullWritable> {
@Override
protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
//准备集合
List<TableBean> orderBeans = new ArrayList<>();
TableBean pdBean = new TableBean();
//数据拷贝
for (TableBean value : values) {
if ("0".equals(value.getFlag())) {
//订单表
TableBean tableBean = new TableBean();
try {
BeanUtils.copyProperties(tableBean, value);
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
orderBeans.add(tableBean);
} else {
//产品表
try {
BeanUtils.copyProperties(pdBean, value);
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (InvocationTargetException e) {
e.printStackTrace();
}
}
}
//拼接表
for (TableBean orderBean : orderBeans) {
orderBean.setpName(pdBean.getpName());
//输出
context.write(orderBean, NullWritable.get());
}
}
}
6.编写Driver代码
public class TableDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//获取配置信息,或者job对象实例
Configuration entries = new Configuration();
Job job = Job.getInstance(entries);
//指定程序的jar包所在位置
job.setJarByClass(TableDriver.class);
//指定jbo要是的mapper和Reducer
job.setMapperClass(TableMapper.class);
job.setReducerClass(TableReducer.class);
//指定mapper的输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(TableBean.class);
//指定最终输出
job.setOutputKeyClass(TableBean.class);
job.setOutputValueClass(NullWritable.class);
//指定job输入原始文件的目录和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//执行
boolean flag = job.waitForCompletion(true);
System.exit(flag ? 0 : 1);
}
}