MapReduce概念
MapReduce是一个分布式运算编程的编程框架,是用户开发“基于Hadoop的数据分析应用”的核心框架
MapReduce核心功能是将用户编写的业务逻辑代码和自带默认组件组合成一个完整的分布式运算程序,并且运行在一个Hadoop集群上。
MapReduce优点(简单)
1.MapReduce易于编程
简单实现一些接口,就可以完成一个分布式程序,这个分布式程序可以分步到大量廉价的PC机上运行。也就是说写一个分布式程序,跟写一个简单的串行程式一模一样的。就是因为这个特点使MapReduce编程非常流行。
2.良好的扩展性
当你的计算资源不能得到满足时,可以通过简单的增加机器来扩展计算能力。
3.高容错性
MapReduce的书记初中是使程序能够部署在廉价的PC机器上,这就要求它具有很高的容错性,比如**在其中一台机器挂了,它可以把上面的计算任务转移到另外一个节点上运行,不至于这个任务运行失败,**这个过程不需要人工参与,而是完全由Hadoop内部完成
4.适合PB级以上海量数据的离线处理
可以实现上千台服务器集群并发工作,提供数据处理能力。
MapReduce缺点(慢)
1.不擅长实时计算
MapReduce无法像MySQL一样,在毫秒或者秒级内返回结果。
2.不擅长流式计算
流式计算的输入数据量是动态的,而MapReduce的 输入数据集是静态的,不能动态变化。这时因为MaoReduce自身的设计特点决定了数据源必须是静态的。
3.不擅长DAG(有向图)计算
多个应用程序存在依赖关系,后一个应用程序的输入为前一个的输出。在这种情况下,MapReduce不是不能做,而是使用后,每个MapReduce作业的输出结果都会写入到磁盘,会造成磁盘大量IO,导致性能非常底下。
MapReduce的核心思想
Map:将数据映射为我们需要的形式
Reduce:将我们需要的形式的数据进行处理
MapReduce进程
一个完整的MapReduce程序在分布式运行时由三类实例进程:
1.MrAppMaster:负责整个程序的过程调度以及状态协调
2.MapTask:负责Map阶段的整个数据处理流程
3.ReduceTask:负责Reduce阶段的整个数据处理流程
常用数据序列化类型
MapReduce编程规范
用户编写程序分为三个部分:Mapper、Reducer、Driver
案例
自己实现一个需求(WordCount案例)
原文件:
在给定的文本文件中统计输出每一个单词出现的总次数
pom.xml
如果不是1.8,Linux无法编译
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>MapReduceWordCount</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>RELEASE</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version>
</dependency>
</dependencies>
<properties>
<java.version>1.8</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
</project>
Mapper类:
package com.yyx.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
private IntWritable one = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//Map要做,将数据变为(word,1)的形式
//拿到一行数据
String line = value.toString();
//按照空格切分
String[] words = line.split(" ");
// 遍历,把单词变为word,1的形式输回给框架
for (String word : words) {
this.word.set(word);// 修改word对象(将String word转换为Text格式)
context.write(this.word,this.one);
/*context.write(new Text(word),new IntWritable(1));
大数据情况下这里会大量生成新的对象(JVM回收机制会占用内存)
//最后要的形式是Text,IntWritable形式
//write()对应Mapper<LongWritable, Text, Text, IntWritable>中后两个
//而前两个表示输入
*/
}
}
}
Reducer类:
package com.yyx.wordcount;
import org.apache.commons.math3.stat.inference.OneWayAnova;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
//Iterable为框架所输入的
import java.io.IOException;
// <Text, IntWritable,Text,IntWritable>前两位为Map输出泛型
public class WcReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
private IntWritable total= new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
// 做累加
int sum = 0;
for (IntWritable value : values) {
sum+=value.get();
}
// 包装并输出
total.set(sum);
context.write(key,total);
}
}
Driver类:
package com.yyx.wordcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WcDriver{
public static void main(String[] args) {
//在Driver对任务进行相关设置
try {
//1.获取一个Job实例(反射)
Job job = Job.getInstance(new Configuration());
//2.设置路径
job.setJarByClass(WcDriver.class);
//3.设置Mapper和Reducer
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
//4.设置Mapper和Reducer输出类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 5.设置输入输出数据
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//6.提交job
try {
boolean b = job.waitForCompletion(true);
System.exit(b ? 0:1);
//如果成功返回0,异常则返回1
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
运行
输入变量位置
查看输出结果:
打包到集群并运行
将jar包复制到桌面上,改一下名并传到hadoop目录下
启动集群
先将文件WordCount.txt上传到集群
运行:
结果:
Hadoop序列化
序列化概述
序列化:就是把内存中的对象,转换成字节序列,(或其他数据传输协议)以便利于存储到磁盘(持久化)和网络运输
反序列化:就是将收到的字节序列(或其他数据传输协议)或者磁盘持久化数据,转换成内存中的对象
为什么序列化
一般来讲,对象只能生活在内存里,关机断电就没了。而且只能由本地进程使用,不能被发送到网络上另外一台计算机。然而序列化可以存储对象,可以将对象发送到远程计算机
为什么不用Java序列化
Java序列化是一个重量级序列化框架,一个对象被序列化后,会附带很多额外信息(各种校验信息,Header,继承体系等),不便在网络中高效传输。所以Hadoop自己开发了一台序列化机制
Hadoop序列化特点
紧凑 :高效使用存储空间
快速 : 读写数据的额外开销小
可扩展 :随着通信协议的升级而升级
互操作 : 支持多种语言交互
自定义类对象实现序列化接口(Writale)
首先,类对象必须实现接口
反序列化时需要调用空参构造器来反射,所以必须有空参构造
反序列化顺序和序列化顺序必须保持一致
用toString将对象表述在文件中
实操
原数据:
目标:
即,输出电话号以及上行流量,下行流量以及总流量
首先,创建FlowBean类:
注意实现接口
package com.yyx.sumflow;
import org.apache.hadoop.io.Writable;
import org.apache.http.client.UserTokenHandler;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
// 为了Hadoop序列化,继承Writable方法
public class FlowBean implements Writable {
// 上行流量,下行流量,总流量
private long upFlow;
private long downFlow;
private long sumFlow;
// get、set方法
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
// 构造方法
public FlowBean() {
}
/**
* 添加set方法,使一会序列化方便一点,非必须
* @param upFlow
* @param downFlow
*/
public void set(long upFlow,long downFlow){
this.downFlow = downFlow;
this.upFlow = upFlow;
this.sumFlow = upFlow + downFlow;
}
@Override
public String toString() {
return upFlow+"\t"+downFlow+"\t"+sumFlow;
}
/**
* 序列化方法
* @param out 框架给我们提供的数据出口(我们写给框架)
* @throws IOException
*/
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
/**
* 序列化方法
* @param in 框架给我们提供的数据来源(框架返回给我们)
* @throws IOException
*/
@Override
public void readFields(DataInput in) throws IOException {
// 注意:要与上面的out.write一一对应
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
}
Mapper类:
package com.yyx.sumflow;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
// 传入的为LongWritable,Text,传出为Text与对象
public class FlowMapper extends Mapper<LongWritable, Text,Text,FlowBean> {
private Text phone = new Text();
private FlowBean flowBean = new FlowBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t");
phone.set(line[1]);// 手机号为第1位
// 字符串数组的倒数第二位是下行流量,倒数第三位是上行流量。 Long.parseLong将字符串转换为long
flowBean.set(Long.parseLong(line[line.length-2]),Long.parseLong(line[line.length-3]));
context.write(phone,flowBean);
}
}
Reducer类:
package com.yyx.sumflow;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class FlowReducer extends Reducer<Text,FlowBean,Text,FlowBean> {
private FlowBean sum = new FlowBean();// 加和的流量
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
for (FlowBean flowBean : values) {
sum.set(flowBean.getUpFlow(),flowBean.getDownFlow());
}
context.write(key,sum);
}
}
Driver类:
package com.yyx.sumflow;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class FlowDriver{
public static void main(String[] args) throws IOException {
// 首先,获得一个job实例
Job job = Job.getInstance(new Configuration());
// 设置路径
job.setJarByClass(FlowDriver.class);
//设置Mapper和Reducer
job.setReducerClass(FlowReducer.class);
job.setMapperClass(FlowMapper.class);
// 设置Mapper和Reducer的输出类
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
// 设置输入输出数据
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
// 提交job
try {
boolean b = job.waitForCompletion(true);
System.exit(b ? 0:1);
//如果成功返回0,异常则返回1
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
}
MapReduce框架原理
Input输入数据
InputFormat:负责将我们的输入文件变为(k,v)值
OutputFormat:负责将(k,v)值变为输出文件
Shuffle:MapTask后半段和ReduceTask前半段组成
切片与MapTask
数据的切片大小默认为块大小,假设不是块大小,也可以,但是数据网络传输时会产生问题,所以并不建议这样。
另外注意上图4,切片是对每个文件单独切片,而不是对文件整体切片
提交任务与切片源码
waitForCompletion()
submit();
// 1建立连接
connect();
// 1)创建提交Job的代理
new Cluster(getConfiguration());
// (1)判断是本地yarn还是远程
initialize(jobTrackAddr, conf);
// 2 提交job
submitter.submitJobInternal(Job.this, cluster)
// 1)创建给集群提交数据的Stag路径
Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
// 2)获取jobid ,并创建Job路径
JobID jobId = submitClient.getNewJobID();
// 3)拷贝jar包到集群
copyAndConfigureFiles(job, submitJobDir);
rUploader.uploadFiles(job, jobSubmitDir);
// 4)计算切片,生成切片规划文件
writeSplits(job, submitJobDir);
maps = writeNewSplits(job, jobSubmitDir);
input.getSplits(job);
// 5)向Stag路径写XML配置文件
writeConf(conf, submitJobFile);
conf.writeXml(out);
// 6)提交Job,返回提交状态
status = submitClient.submitJob(jobId, submitJobDir.toString(), job.getCredentials());
InputFormat
InputFormat:
首先,切片
其次,为每一个切片生成一个RecordReader();
FileInputFormat
TextInputFormat
TextInputFormat的切片方法直接调用父类的切片方法
读kv值的方法 返回一个 LineRecordReader
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit genericSplit, JobConf job,
Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
String delimiter = job.get("textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
}
return new LineRecordReader(job, (FileSplit) genericSplit,
recordDelimiterBytes);
}
KeyValueTextInputFormat
KeyValueTextInputFormat也是用的父类的切片方法
kv方法不同,返回了一个KeyValueLineRecordReader
public RecordReader<Text, Text> getRecordReader(InputSplit genericSplit,
JobConf job,
Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
return new KeyValueLineRecordReader(job, (FileSplit) genericSplit);
}
NLineInoutFormat
重写了切片方法:
切片方法自定义,n行一片
kv方法和TextInputFormat一样
读kv值的方法 返回一个 LineRecordReader
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
for (org.apache.hadoop.mapreduce.lib.input.FileSplit split :
org.apache.hadoop.mapreduce.lib.input.
NLineInputFormat.getSplitsForFile(status, job, N)) {
splits.add(new FileSplit(split));
}
}
return splits.toArray(new FileSplit[splits.size()]);
}
自定义InputFormat
无论HDFS还是MapReduce,在处理小文件时效率都非常低,但又难免面临处理大量小文件的场景,此时,就需要有相应解决方案。可以自定义InputFormat实现小文件的合并。
1.需求
将多个小文件合并成一个SequenceFile文件(SequenceFile文件是Hadoop用来存储二进制形式的key-value对的文件格式),SequenceFile里面存储着多个文件,存储的形式为文件路径+名称为key,文件内容为value。
(1)输入数据为三个文本文件
1.
yongpeng weidong weinan
sanfeng luozong xiaoming
ong fanfan
mazong kailun yuhang yixin
longlong fanfan
mazong kailun yuhang yixin
shuaige changmo zhenqiang
dongli lingu xuanxuan
(2)期望输出文件格式为(给MapReduce看的)
SEQorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable 惧弱D掷絰?憥? L file:/F:/MyInput/1.txt 1yongpeng weidong weinan
sanfeng luozong xiaoming p file:/F:/MyInput/2.txt Uong fanfan
mazong kailun yuhang yixin
longlong fanfan
mazong kailun yuhang yixin
L file:/F:/MyInput/3.txt 1shuaige changmo zhenqiang
dongli lingu xuanxuan
InputFormat类:
package com.yyx.newinputformat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
public class MyInputFormat extends FileInputFormat<Text, BytesWritable> {
@Override
protected boolean isSplitable(JobContext context, Path filename) { // 用来保证同一个文件不能被切
return false;
}
@Override
public RecordReader<Text, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
return new MyFileRecordReader();
}
}
RecordReader类:
package com.yyx.newinputformat;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
/**
* 每个RecordReader处理一个文件,把这个文件直接读成一个KV值
*/
public class MyFileRecordReader extends RecordReader<Text, BytesWritable> {
private boolean notRead= true;
private Text key = new Text();
private BytesWritable value = new BytesWritable();
private FSDataInputStream fsDataInputStream;
private FileSplit fs;
/**
* 初始化方法,框架会在开始的时候调用一次
* @param split 切片
* @param context 任务所有信息
* @throws IOException
* @throws InterruptedException
*/
@Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
fs = (FileSplit) split; //想获得和父类一样的FileSplit 强转
Path path = fs.getPath(); // 获取路径
FileSystem fileSystem = path.getFileSystem(context.getConfiguration()); //通过路径获取文件系统
// 开流
fsDataInputStream = fileSystem.open(path);
}
/**
* 读取下一组K/V值
* @return 如果读到了,返回true,否则返回false
* @throws IOException
* @throws InterruptedException
*/
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
// 一次将一个文件读成一个KV值
//只会在第一次读返回true 第二次读完了之后返回false
if (notRead){// 具体读文件过程
// 读key
key.set(fs.getPath().toString());
// 读value
byte[] buf = new byte[(int) fs.getLength()];
fsDataInputStream.read(buf);
value.set(buf,0,buf.length);
notRead = false;
return true; //读到数据返回true
}else {
// 表示读过文件
return false;
}
}
/**
* 获取当前读到的Key
* @return 当前读到的Key
* @throws IOException
* @throws InterruptedException
*/
@Override
public Text getCurrentKey() throws IOException, InterruptedException {
return key;
}
/**
* 获取当前读到的Value
* @return 当前读到的Value
* @throws IOException
* @throws InterruptedException
*/
@Override
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
return value;
}
/**
* 获取当前读取进度
* @return 0-1之间的小数(读取进度)
* @throws IOException
* @throws InterruptedException
*/
@Override
public float getProgress() throws IOException, InterruptedException {// 要么没读,要么读完了
return notRead? 0 : 1;
}
/**
* 关闭资源
* @throws IOException
*/
@Override
public void close() throws IOException {
//关流
IOUtils.closeStream(fsDataInputStream);
}
}
Driver类:
package com.yyx.newinputformat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import java.io.IOException;
public class MyDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(MyDriver.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(BytesWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setInputFormatClass(MyInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job,new Path("F:\\MyInput")); //继承哪个父类就写哪个父类
FileOutputFormat.setOutputPath(job,new Path("F:\\MyOutput"));
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
MapReduce详细工作流程
待处理文件=》在客户端对文件完成切片操作(切片规则),生成一些相应信息(切片信息、jar包本身、配置信息)给Yarn看的=》Yarn根据切片信息来计算出MapTask数量=》以其中一个MapTask为例,MapTask1拿到其中一个切片后生成一个InputFormat对象,调用其方法获取一个RecorderReader。RecorderReader负责将切片都为KV值=》将KV值输入给Mapper的map方法,map方法处理kv(自己写)context.write(K,V)=》K,V值被收集(以序列化方式收集)到outputCollector对象中(收集到环形缓冲区中),缓冲区左边写索引,右边写数据(K,V)写到80%时,发生溢写(从另一边写,并且80%发生全排序(二次排序)),生成分区且分区内部有序,将文件正式写到磁盘上。(环形缓冲区,有文件就一直写,会生成多次溢写文件)=》将所有文件归并排序成一个文件(分区且分区内部有序,是Map阶段最终的输出文件,多个Maptask生成多个文件)=》Reduce阶段:启动和MapTask相同数量的ReduceTask,每个ReduceTask从所有MapTask下载相应分区数据(ReduceTask1下载所有MapTask的分区1的数据…)=》以ReduceTask1为例:发生对并排序,将ReduceTask中文件合并为一个文件(即为Ruducer的输入文件)一次读取一组,输入给Reducer=》分组(可以自定义分组规则GroupingComparator)=》OutputFormat=》RecordWriter=》输出
MapTask数量和ReduceTask数量无关,但是MapTask数量和分区数有关
分区
案例:
与之前的流量计算类似,手机号136、137、138、139开头都分别放到一个独立的4个文件中,其他开头的放到一个文件中。
package com.yyx.partition;
import com.yyx.sumflow.FlowBean;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
// 泛型类型应为Mapper输出类型
public class MyPartitioner extends Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text text, FlowBean flowBean, int numPartitions) {
String phone = text.toString();
switch (phone.substring(0,3)){
case "136":
return 0; // 表示分到0区
case "137":
return 1;
case "138":
return 2;
case "139":
return 3;
default: return 4;
}
}
}
package com.yyx.partition;
import com.yyx.sumflow.FlowBean;
import com.yyx.sumflow.FlowMapper;
import com.yyx.sumflow.FlowReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class PartitionerFlowDriver {
public static void main(String[] args) throws IOException {
// 首先,获得一个job实例
Job job = Job.getInstance(new Configuration());
// 设置路径
job.setJarByClass(PartitionerFlowDriver.class);
//设置Mapper和Reducer
job.setReducerClass(FlowReducer.class);
job.setMapperClass(FlowMapper.class);
// 设置Mapper和Reducer的输出类
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
// 设置Partitioner类
job.setPartitionerClass(MyPartitioner.class);
// 9 同时指定相应数量的reduce task
job.setNumReduceTasks(5);
// 设置输入输出数据
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
// 提交job
try {
boolean b = job.waitForCompletion(true);
System.exit(b ? 0:1);
//如果成功返回0,异常则返回1
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
}
总结:
Reduce阶段并行度是我们手设的
分区是告诉我们具体哪条数据被哪个ReduceTask处理,如果分区数比ReduceTask数量多,有一部分数据就分不到ReduceTask,就会报错。分区数比ReduceTask数量少,不会报错但是会造成资源浪费。分区号不能跳,从零开始逐一累加
排序
实操:对上述实操(流量总和)进行排序,将其最终结果按照流量综合进行排序
输入文件:
输出文件:
FlowBean类:(较之前进行了接口的修改)
package com.yyx.descendingflow;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
// 为了Hadoop序列化,继承Writable方法
public class FlowBean implements WritableComparable<FlowBean> {
// 上行流量,下行流量,总流量
private long upFlow;
private long downFlow;
private long sumFlow;
// get、set方法
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
// 构造方法
public FlowBean() {
}
/**
* 添加set方法,使一会序列化方便一点,非必须
* @param upFlow
* @param downFlow
*/
public void set(long upFlow,long downFlow){
this.downFlow = downFlow;
this.upFlow = upFlow;
this.sumFlow = upFlow + downFlow;
}
@Override
public String toString() {
return upFlow+"\t"+downFlow+"\t"+sumFlow;
}
/**
* 序列化方法
* @param out 框架给我们提供的数据出口(我们写给框架)
* @throws IOException
*/
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
/**
* 序列化方法
* @param in 框架给我们提供的数据来源(框架返回给我们)
* @throws IOException
*/
@Override
public void readFields(DataInput in) throws IOException {
// 注意:要与上面的out.write一一对应
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
@Override
public int compareTo(FlowBean flowBean) { //降序排列总流量
return Long.compare(flowBean.sumFlow,this.sumFlow);
}
}
Mapper:
package com.yyx.descendingflow;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class DescendingMapper extends Mapper<LongWritable, Text,FlowBean,Text> {// 因为要按照流量排序,所以对输出进行修改
// 首先,创建对象
private Text phone = new Text();
private FlowBean flowBean = new FlowBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t"); // 获取数据
phone.set(line[0]); // 将字符串写入
flowBean.set(Long.parseLong(line[1]),Long.parseLong(line[2])); // 将流量数据写入,与下面语句结果一样
// flowBean.setDownFlow(Long.parseLong(line[2]));
// flowBean.setUpFlow(Long.parseLong(line[1]));
// flowBean.setSumFlow(Long.parseLong(line[3]));
// 写入(按照流量排序,所以流量为key)
context.write(flowBean,phone);
}
}
Reducer:
package com.yyx.descendingflow;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class DescendingReducer extends Reducer<FlowBean, Text, Text,FlowBean> { //Mapper输入为FlowBean和电话号码
@Override
protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text value : values) {
context.write(value,key); // 刚刚在Mapper到Reducer已经排序好,反过来输出即可
}
}
}
Driver:
package com.yyx.descendingflow;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class DescendingDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(DescendingDriver.class);
job.setMapperClass(DescendingMapper.class);
job.setReducerClass(DescendingReducer.class);
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
// 为了避免麻烦,写死了
FileInputFormat.setInputPaths(job,new Path("f://output"));
FileOutputFormat.setOutputPath(job,new Path("f://output2"));
boolean b = job.waitForCompletion(true);
System.exit(b ? 0:1);
}
}
优化:即与上述分区结合,输出结果要求多个文件(按照号码首字母进行分区)
即输出为:
创建Partitioner类:
package com.yyx.descendingflow;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class DescendingPartition extends Partitioner<FlowBean, Text> { //Mapper输出类型(流量和电话号码)
@Override
public int getPartition(FlowBean flowBean, Text text, int numPartitions) {
switch (text.toString().substring(0,3)){
case "136":
return 0;
case "137":
return 1;
case "138":
return 2;
case "139":
return 3;
default:
return 4;
}
}
}
在Driver中添加
// 设置ReduceTask数量
job.setNumReduceTasks(5);
// 设置Partitioner类
job.setPartitionerClass(DescendingPartition.class);
Combiner
Combiner是MapReduce程序中Mapper和Reducer之外的一种组件
Combiner组件的父类是Reducer
Combiner和Reducer的区别在于运行的位置Combiner是在每一个MapTask所在的节点运行,Reducer是接受全局所有Mapper的输出结果
Combiner的意义就是对每一个MapTask的输出进行汇总,以减小网络传输
Combiner能够使用的前提是不能影响最终业务逻辑,而且,Combiner的输出kv要和Reducer输入的KV对应,Combiner不能改变Mapper输出的类型。
它不能改变Mapper的输出类型,也不能能改变Reducer的输入类型
Combiner在MapTask中起效
使用好处:减少IO吞吐量(冗余的信息被合并)
默认不启用(启用受限制)
Combiner实现步骤
1.继承Reducer类,并且实现reduce方法
2.在job中设置
job.setCombinerClass(WordcountCombiner.class);
实操:
数据输入
数据输出:
Combine输入数据多,输出时经过合并,输出数据降低。
Mapper:
package com.yyx.combinertest;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class CombinerMapper extends Mapper<LongWritable, Text,Text,IntWritable> {
private Text word = new Text();
private IntWritable i = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split(" ");
for (String data : line
) {
word.set(data);
context.write(word,i);
}
}
}
Reducer:
package com.yyx.combinertest;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class CombinerReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
private IntWritable total = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
total.set(sum);
context.write(key,total);
}
}
Driver:
package com.yyx.combinertest;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class CombinerDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// 获取Job
Job job = Job.getInstance(new Configuration());
job.setJarByClass(CombinerDriver.class);
job.setCombinerClass(Combiner.class);
job.setMapperClass(CombinerMapper.class);
job.setReducerClass(CombinerReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job,new Path("F:\\MyInput\\combiner"));
FileOutputFormat.setOutputPath(job,new Path("F:\\MyOutput\\combiner"));
boolean b = job.waitForCompletion(true);
System.exit(b ? 0:1);
}
}
未使用:
使用:
结果:
复习:
Mapper输入K:类型是每一行的标识,也就是每一行的起始偏移量,数据类型为LongWritable
Mapper输入Value:指的是一行内容 数据类型为Text
Mapper输出key:指的是每一个单词 数据类型为Text
Mapper输出的value:指词频
Combiner输入的K:与Mapper输出的Key对应
Combiner输入的V:与Mapper输出的Value对应
Combiner输出的K:与Mapper输出的Key对应
Combiner输出的V:与Mapper输出的Value对应
Reducer输入的Key:与Mapper输出的Key对应
Reducer输入的Value:与Mapper输出的Value对应
Reducer输入的Key:根据结果而定
Reducer输出的Value:根据结果而定
GroupingComparator(辅助排序)
不想按照key的默认分组来进行分组,就启动GroupingComparator,由于MapReduce先排序后分组,所以排序一定要更细致。
对Reduce阶段的数据根据某一个或几个字段进行分组
步骤:
1.创建实例类,继承WritableComparator接口
2.重写compare方法
3.创建一个构造将比较对象的类传给父类
实操
输入如下订单数据,返回每个订单中价格最高的物品
分析:MapReduce时先排序后分组,但是本次实操明显是先分组后排序,所以,此次我们要按照订单id排序,之后分组
OrderBean类(重点注意比较方法)
package com.yyx.groupingcomparator;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class OrderBean implements WritableComparable<OrderBean> {
private String orderId;// 订单ID
private String productId;//商品ID
private double price;// 商品价格
@Override
public String toString() {
return "OrderBean{" +
"orderId='" + orderId + '\'' +
", productId='" + productId + '\'' +
", price=" + price +
'}';
}
public String getOrderId() {
return orderId;
}
public void setOrderId(String orderId) {
this.orderId = orderId;
}
public String getProductId() {
return productId;
} public void setProductId(String productId) {
this.productId = productId;
}
public double getPrice() {
return price;
}
public void setPrice(double price) {
this.price = price;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(orderId);
out.writeUTF(productId);
out.writeDouble(price);
}
@Override
public void readFields(DataInput in) throws IOException {
orderId = in.readUTF();
productId = in.readUTF();
price = in.readDouble();
}
@Override
public int compareTo(OrderBean orderBean) {
// 首先比较订单号
int compare = this.orderId.compareTo(orderBean.orderId);
if (compare == 0){
// 如果订单号相等,比较价格
return Double.compare(orderBean.price,this.price);
} else {
return compare;
}
}
}
GroupingComparator类:
package com.yyx.groupingcomparator;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class OrderGroupComparator extends WritableComparator {
protected OrderGroupComparator() { //这步一定要写,否则空指针
super(OrderBean.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
OrderBean oa = (OrderBean) a;
OrderBean ob = (OrderBean) b;
// 只比较订单ID,不比较价格
return oa.getOrderId().compareTo(ob.getOrderId());
}
}
Map类:
package com.yyx.groupingcomparator;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class OrderMap extends Mapper<LongWritable, Text, OrderBean, NullWritable> {
OrderBean orderBean = new OrderBean();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 首先,获取对象属性
String[] datas = value.toString().split("\t");
orderBean.setOrderId(datas[0]);
orderBean.setProductId(datas[1]);
orderBean.setPrice(Double.parseDouble(datas[2]));
// 写入
context.write(orderBean,NullWritable.get());
}
}
Reduce:
package com.yyx.groupingcomparator;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class OrderReduce extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {
@Override
protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
context.write(key,NullWritable.get());
}
}
Driver:
package com.yyx.groupingcomparator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class OrderDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
// 设置类
job.setJarByClass(OrderDriver.class);
job.setMapperClass(OrderMap.class);
job.setReducerClass(OrderReduce.class);
job.setGroupingComparatorClass(OrderGroupComparator.class);
//设置输入输出
job.setMapOutputKeyClass(OrderBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(OrderBean.class);
job.setOutputValueClass(NullWritable.class);
//设置路径
FileInputFormat.setInputPaths(job,new Path("F:\\MyInput\\grouping"));
FileOutputFormat.setOutputPath(job,new Path("F:\\MyInput\\grouping\\output"));
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
运行结果:(和toString有关)
OrderBean{orderId=‘0000001’, productId=‘Pdt_01’, price=222.8}
OrderBean{orderId=‘0000002’, productId=‘Pdt_05’, price=722.4}
OrderBean{orderId=‘0000003’, productId=‘Pdt_06’, price=232.8}
OutputFormat数据输出
OutputFormat接口实现类
OutputFormat是MapReduce输出的基类,所有实现MapReduce输出都实现了OutFormat接口,如下十几个常见的几个常见的实现类:
文本输出TextOutPutFormat
默认的输出格式是TextOutputFormat,他把每条记录写为文本行,他的键值可以是任意类型,因为TextOutputFormat调用toString方法将他们转换为字符串。
SequenceFileOutputFormat
将SequenceFileOutputFormat输出作为后续输出任务,这是一种很好的输出格式,因为格式紧凑,很容易被压缩
自定义OutputFormat(重要)
根据需求来自定义输出,例如,为了实现控制最终文件的输出路径和格式
步骤:
1.自定义一个类继承FileOutputFormat
2.改写RecordWriter,具体改写输出数据的方法write();
实例:
需求:过滤输入的log日志,包含bilibili的网站输出到e:/bilibili.log,不包含atguigu的网站输出到e:/other.log。
代码:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MyDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
job.setJarByClass(MyDriver.class);
job.setOutputFormatClass(MyOutputFormat.class);
FileInputFormat.setInputPaths(job,new Path("F:\\testMyOutputFormat"));
FileOutputFormat.setOutputPath(job,new Path("F:\\testMyOutputFormat\\output"));
boolean b = job.waitForCompletion(true);
System.exit(b ? 0:1);
}
}
package com.yyx.newoutputformat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MyOutputFormat extends FileOutputFormat<LongWritable, Text> {
@Override
public RecordWriter<LongWritable, Text> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
MyRecordWriter myRecordWriter = new MyRecordWriter();
myRecordWriter.initialize(job); //手动调用方法初始化
return myRecordWriter;
}
}
package com.yyx.newoutputformat;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
public class MyRecordWriter extends RecordWriter<LongWritable, Text> {
private FSDataOutputStream bilibili;
private FSDataOutputStream others;
/*
这样写会存在瑕疵(生成的目录中没有目标文件)
private FileOutputStream bilibili;
private FileOutputStream others;
*/
/**
* 初始化方法
* @param job 传过来信息
*/
public void initialize(TaskAttemptContext job) throws IOException {
// 首先获取文件路径的文件夹
String dic = job.getConfiguration().get(FileOutputFormat.OUTDIR);
// 获取文件流
FileSystem fileSystem = FileSystem.get(job.getConfiguration());
// 实例化
bilibili = fileSystem.create(new Path(dic + "\\bilibili.log"));
others = fileSystem.create(new Path(dic + "\\others.log"));
}
/**
* 将KV写入,每对KV调用一次
* @param key
* @param value
* @throws IOException
* @throws InterruptedException
*/
@Override
public void write(LongWritable key, Text value) throws IOException, InterruptedException {
String out = value.toString() + "\n"; //之所以换行是因为MapReduce之后数据变为一整行
if (out.contains("bilibili")){
bilibili.write(out.getBytes());
}else {
others.write(out.getBytes());
}
}
/**
* 关闭资源
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
IOUtils.closeStream(bilibili);
IOUtils.closeStream(others);
}
}