MapReduce原理及编程
Hadoop架构
- HDFS——分布式文件系统
- MapReduce——分布式计算框架
- YARN——分布式资源管理系统
- Common
MapReduce
什么是MapReduce
MapReduce是一个分布式计算框架
- 它将大型数据操作作业分解为可以跨服务器集群并行执行的单个任务
适用于大规模数据处理场景
- 每个节点处理存储在该节点的数据
每个job包含Map和Reduce两部分
MapReduce的设计思想
分而治之
- 简化并行计算的编程模型
构建抽象模型:Map和Reduce
- 开发人员专注于实现Mapper和Reduce函数
隐藏系统层xijie
- 开发人员专注于业务逻辑实现
MapReduce特点
优点
- 易于编程
- 可扩展性
- 高容错性
- 高吞吐量
不适用领域
- 难以实时计算
- 不适合流式计算
MapReduce实现WordCount
MapReduce执行过程
数据定义格式
- map:(K1,V1)–>list(K2,V2)
- reduce:(K2,list(V2))–>list(K3,V3)
MapReduce执行过程 - Mapper
- Combiner
- Partitioner
- Shuffle and Sort
- Reduce
Hadoop V1 MapReduce引擎
Job Tracker
- 运行在Namenode
- 接受客户端Job请求
- 提交给Task Tracker
Task Tracker
- 从Job Tracker接受任务请求
- 执行map、reduce等操作
- 返回心跳给Job Tracker
Hadoop V2 YARN
YARN的变化
- 支持更多的计算引擎,兼容MapReduce
- 更好的资源管理,减少Job Tracker的资源消耗
- 将Job Tracker的资源管理分为ResourceManager
- 将Job Tracker的作业资源调度分为ApplicationMaster
- NodeManager称为每个节点的资源和任务管理器
Hadoop及YARN架构
Hadoop2 MapReduce在YARN上运行流程
InputSplit(输入分片)
在Map之前,根据输入文件创建inputsplit
- 每个InputSplit对应一个Mapper任务
- 输入分片存储的是分片长度和记录数据位置的数组
block和split的区别
- block是数据的物理表示
- split是块中数据的逻辑表
- split划分是在记录的边界处
- split的数量应不大于block的数量(一般相等)
Shuffle阶段
数据从Map输出带Reduce输入的过程
Key&Value类型
必须可序列化
- 作用:网络传输以及持久化存储
- IntWritable、LongWriteable、FloatWritable、Text、DoubleWritable, BooleanWritable、NullWritable等
都继承了Writable接口
- 并实现write()和readFields()方法】
Keys必须实现WritableComparable接口
- Reduce阶段需要sort
- keys需要可比较
MapReduce编程模型
InputFormat接口
定义了如何将数据读入Mapper
-
InputSplit[] getSplits
InputSplit表示由单个Mapper处理的数据 getSplits方法将一个大数据在逻辑上拆分为InputSplit
常用InputFormat接口实现类
- TestInputFormat
- FileInputFormat
- KeyValueInputFormat
Mapper类
Mapper主要方法
-
void setup(Context context)
org.apache.hadoop.mapreduce.Mapper.Context
-
void map(KEY key,VALUE value,Context context)
为输入分片中的每个键/值对调用一次
-
void cleanup(Context context)
-
void run(Context context)
可通过重写该方法对Mapper进行更完整控制
public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context ctx)
throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer( value.toString() );
while ( itr.hasMoreTokens() )
{
word.set( itr.nextToken() );
ctx.write( word, one);
}
}
}
Combiner类
Combiner相当于本地化的Reduce操作
- 在shuffle之前进行本地聚合
- 用于性能优化,可选项
- 输入和输出类型一致
Reduce可以被用作Combiner的条件
- 符合交换律和结合律
实现Combiner
- job.setCombinerClass(WCReducer.class)
Partitioner类
用于在Map端对可以进行分区
-
默认使用的是HashPartitioner
获取key的哈希值 使用key的哈希值对Reduce任务数求模
-
决定每条记录应该送到哪个Reduce处理
自定义Partitioner
- 继承抽象类Partitioner,重写getPartition方法
- job.setPartitionerClass(MyPartitioner.class)
Reducer类
Reducer主要方法
-
void setup(Context context)
org.apache.hadoop.mapreduce.Reducer.Context
-
void reduce(KEY key,Iterable values,Context context)
为每个key调用一次
-
void cleanup(Context context)
-
voiid run (Context context)
可通过重写该方法来控制reduce任务的工作方式
public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context ctx)
throws IOException, InterruptedException
{
int sum = 0;
for ( IntWritable value : values )
{
sum += value.get();
}
result.set( sum );
ctx.write( key, result );
}
}
OutputFormat接口
定义了如何将数据从Reduce进行输出
-
RecordWriter<K,V> getRecordWriter
将Reducerde <key,value>写入到目标文件
-
checkOutputSpecs
判断输出目录是否存在
常用OutputFormat接口实现类
- TextOutputFormat
- SequenceFileOutputFormat
- MapFileOutputFormat
编写M/R Job
Job job = Job.getInstance(getConf(), "WordCountMR" );
//InputFormat
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]) );
job.setInputFormatClass(TextInputFormat.class);
//OutputFormat
FileOutputFormat.setOutputPath( job, new Path(args[1]) );
job.setOutputFormatClass(TextOutputFormat.class);
//Mapper
job.setMapperClass( WCMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//Reducer
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
使用MapReduce实现WordCount
在本地Windows系统下安装和配置Hadoop
- 解压文件hadoop-2.6.0-cdh5.14.2.tar.gz
- 把hadoopBin.rar中的内容解压到解压好的hadoop-2.6.0-cdh5.14.2的bin目录下(注意是把解压后的内容,不是解压后的文件夹)
- 把解压好的hadoopBin中的hadoop.dll复制到C:/windows/system32/目录下
- 配置hadoop环境变量
编写Java代码
- Mapper
- Reducer
- Job
执行M/R Job
hadoop jar WCMR.jar cn.kgc.WCDriver /user/data /user/out
//WCMR.jar指定jar包
//WCDriver指定job
设置M/R参数
java实现wordcount
- 下载maven的hadoop依赖包
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.0</version>
</dependency>
<!--<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.0</version>
</dependency>-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.6.0</version>
</dependency>
a.txt,路径:D:/test/a.txt
i wish to wish the wish you wish to wish,but
if you wish the wish the wish wishes,i won't
wish the wish you wish to wish
WCMapper
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
//实现word count的mapper过程
public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}
}
WCReducer
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int total=0;
for (IntWritable value : values) {
total += value.get();
}
context.write(key,new IntWritable(total));
}
}
WCPartitioner
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class WCPartitioner extends Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text text, IntWritable intWritable, int i) {
return Math.abs(text.hashCode())% i;
}
}
WCDriver
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WCDriver {
public static void main(String[] args)throws Exception {
//1、建立连接
Configuration cfg = new Configuration();
Job job = Job.getInstance(cfg, "job_wc");
job.setJarByClass(WCDriver.class);
//2、指定mapper和reduce
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
//指定mapper输出类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//指定paritioner
job.setNumReduceTasks(4);
job.setPartitionerClass(WCPartitioner.class);
//指定reduce输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//指定输入输出路径
FileInputFormat.setInputPaths(job, new Path("D:/test/a.txt"));
FileOutputFormat.setOutputPath(job, new Path("D:/test/wcResult"));//这个文件夹得是没有创建过的
//3、运行
boolean result = job.waitForCompletion(true);
System.out.println(result ? "成功" : "失败");
System.exit(result ? 0 : 1);
}
}
结果如下:
part-r-00000
part-r-00001
part-r-00002
part-r-00003
我们会发现运行结果后,虽然能出来成功,但是有警告。这不是错误,只是告诉你你没有log4j,也查看不了日志。这时我们只需要做个简单的操作就不会有这种警告了。
- 在工程下面创建一个package包,取名为resources。
- 然后给resources添加为资源包。
第一步:打开idea左上角的projects structure。
第二步:点击Modules。
第三步:Sources。
第四步:点击创建好的resources包。
第五步、选择上面的Resources添加成为资源包。
第六部、点击ok。
- 把log4j.properties文件放到资源包中
log4j.properties
### 设置###
log4j.rootLogger = debug,stdout,D,E
### 输出信息到控制台 ###
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n
### 输出DEBUG 级别以上的日志到=D://logs/error.log ###
log4j.appender.D = org.apache.log4j.DailyRollingFileAppender
log4j.appender.D.File = D://logs/log.log
log4j.appender.D.Append = true
log4j.appender.D.Threshold = DEBUG
log4j.appender.D.layout = org.apache.log4j.PatternLayout
log4j.appender.D.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss} [ %t:%r ] - [ %p ] %m%n
### 输出ERROR 级别以上的日志到=E://logs/error.log ###
log4j.appender.E = org.apache.log4j.DailyRollingFileAppender
log4j.appender.E.File =D://logs/error.log
log4j.appender.E.Append = true
log4j.appender.E.Threshold = ERROR
log4j.appender.E.layout = org.apache.log4j.PatternLayout
log4j.appender.E.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss} [ %t:%r ] - [ %p ] %m%n
- 再运行一次就不会出现警告,会出现一大堆的日志文件(运行之前先删掉之前的wcResult文件夹)。
HDFS实现wordcount
- 把WCDriver类更改一下:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WCDriver {
public static void main(String[] args)throws Exception {
//1、建立连接
Configuration cfg = new Configuration();
Job job = Job.getInstance(cfg, "job_wc");
job.setJarByClass(WCDriver.class);
//2、指定mapper和reduce
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
//指定mapper输出类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//指定paritioner
job.setNumReduceTasks(4);
job.setPartitionerClass(WCPartitioner.class);
//指定reduce输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//指定输入输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//3、运行
boolean result = job.waitForCompletion(true);
System.out.println(result ? "成功" : "失败");
System.exit(result ? 0 : 1);
}
}
- 打jar包。
- 打开idea左上角的projects structure
- 选择WCDriver。
- build jar包
- 找到jar包放入linux的家目录下
- 在hdfs上创建一个test目录:
hdfs dfs -mkdir /test/
- 创建一个a.txt:
vi a.txt
- 输入:
i wish to wish the wish you wish to wish,but
if you wish the wish the wish wishes,i won't
wish the wish you wish to wish
i love java
i love mysql
i love linux
i love python
i love hadoop
hadoop hdfs mapreduce yarn hbase hive
these things are very
- 把a.txt上传到hdfs的test目录下:
hdfs dfs -put a.txt /test/a.txt
- 对a.txt进行wordcount:
hadoop jar testhdfs.jar cn.kgc.kb09.mr.WCDriver /test/a.txt /test/result
- 查看一下文件:
hdfs dfs -cat /test/result/part-r-00000
hdfs dfs -cat /test/result/part-r-00001
hdfs dfs -cat /test/result/part-r-00002
hdfs dfs -cat /test/result/part-r-00003
使用MapReduce实现join操作
map端join
- 大文件+小文件
示例:
COJoinMapper
import cn.kgc.kb09.join.CustomOrder;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;
//在mapper端进行文件join
public class COJoinMapper extends Mapper<LongWritable, Text, Text, CustomOrder> {
Map<String,String> map = new HashMap();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
URI[] cacheFiles = context.getCacheFiles();
if (cacheFiles != null) {
String filePath = cacheFiles[0].getPath();
FileReader fr = new FileReader(filePath);
BufferedReader br = new BufferedReader(fr);
String line;
while ((line = br.readLine()) != null && !"".equals(line)) {
String[] columns = line.split(" ");
map.put(columns[0],columns[1]);
}
}
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] columns = line.split(" ");
CustomOrder co = new CustomOrder();
String orderId = columns[0];
String orderStatus = columns[2];
String custId = columns[3];
co.setCustomId(custId);
String custName = map.get(custId);
co.setCustomName(custName);
co.setOrderId(orderId);
co.setOrderStatus(orderStatus);
//获取没关联到的用户map
//map.rem
// ove(custId);
context.write(new Text(custId),co);
}
/* @Override
protected void cleanup(Context context) throws IOException, InterruptedException {
Set<String> keys = map.keySet();
for (String key : keys) {
CustomOrder co = new CustomOrder();
co.setCustomId(key);
co.setCustomName(map.get(key));
context.write(new Text(key),co);
}
}*/
}
COJoinDriver
import cn.kgc.kb09.join.CustomOrder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.net.URI;
//map join对应的driver
public class COJoinDriver {
public static void main(String[] args)throws Exception {
Job job = Job.getInstance(new Configuration(), "mapjoinJob");
job.setJarByClass(COJoinDriver.class);
job.setMapperClass(COJoinMapper.class);
job.setOutputKeyClass(Text.class);
job.setMapOutputValueClass(CustomOrder.class);
String inPath = "file:///D:/ideashuju/testhdfs/data/order.csv";
String outPath = "file:///D:/test/b";
String cachePath = "file:///D:/ideashuju/testhdfs/data/customers.csv";
job.addCacheFile(new URI(cachePath));
FileInputFormat.setInputPaths(job, new Path(inPath));
FileOutputFormat.setOutputPath(job, new Path(outPath));
boolean result = job.waitForCompletion(true);
System.out.println(result ? "执行成功" : "执行失败");
System.exit(result?0:1);
}
}
reduce端join
结果展示:
示例:
CustomOrder
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class CustomOrder implements Writable {
private String customId;
private String customName;
private String orderId;
private String orderStatus;
private String tableFlag;//为0时是custom表,为1时是order表
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(customId);
out.writeUTF(customName);
out.writeUTF(orderId);
out.writeUTF(orderStatus);
out.writeUTF(tableFlag);
}
@Override
public void readFields(DataInput in) throws IOException {
this.customId = in.readUTF();
this.customName = in.readUTF();
this.orderId = in.readUTF();
this.orderStatus = in.readUTF();
this.tableFlag = in.readUTF();
}
public String getCustomId() {
return customId;
}
public void setCustomId(String customId) {
this.customId = customId;
}
public String getCustomName() {
return customName;
}
public void setCustomName(String customName) {
this.customName = customName;
}
public String getOrderId() {
return orderId;
}
public void setOrderId(String orderId) {
this.orderId = orderId;
}
public String getOrderStatus() {
return orderStatus;
}
public void setOrderStatus(String orderStatus) {
this.orderStatus = orderStatus;
}
public String getTableFlag() {
return tableFlag;
}
public void setTableFlag(String tableFlag) {
this.tableFlag = tableFlag;
}
@Override
public String toString() {
return "customId='" + customId + '\'' +
", customName='" + customName + '\'' +
", orderId='" + orderId + '\'' +
", orderStatus='" + orderStatus + '\'';
}
}
COMapperJoin
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class COMapperJoin extends Mapper<LongWritable, Text,Text, CustomOrder> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] columns = line.split(",");
for (int i = 0; i < columns.length; i++) {
columns[i]=columns[i].split("\"")[1];
}
CustomOrder co = new CustomOrder();
if (columns.length == 4) {//order表
co.setCustomId(columns[2]);
co.setCustomName("");
co.setOrderId(columns[0]);
co.setOrderStatus(columns[3]);
co.setTableFlag("1");
} else if (columns.length == 9) {//custom表
co.setCustomId(columns[0]);
co.setCustomName(columns[1] + "." + columns[2]);
co.setOrderId("");
co.setOrderStatus("");
co.setTableFlag("0");
}
context.write(new Text(co.getCustomId()), co);
//{1,{CustomOrder(1,xxx,,,0),CustomOrder(1,,20,closed,1)}}
}
}
COReducerJoin
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class COReducerJoin extends Reducer<Text, CustomOrder, CustomOrder, NullWritable> {
// List<CustomOrder> coList = new ArrayList<>();
@Override
protected void reduce(Text key, Iterable<CustomOrder> values, Context context) throws IOException, InterruptedException {
StringBuffer orderIds = new StringBuffer();
StringBuffer statuses = new StringBuffer();
CustomOrder customOrder = new CustomOrder();
for (CustomOrder co : values) {
if (co.getCustomName().equals("")) {
orderIds.append(co.getOrderId() + "|");
statuses.append(co.getOrderStatus() + "|");
} else {
customOrder.setCustomId(co.getCustomId());
customOrder.setCustomName(co.getCustomName());
}
}
String orderId = "";
String status = "";
if(orderIds.length()>0) {
orderId = orderIds.substring(0, orderIds.length() - 1);
}
if(statuses.length()>0) {
status = statuses.substring(0, statuses.length() - 1);
}
customOrder.setOrderId(orderId);
customOrder.setOrderStatus(status);
context.write(customOrder, NullWritable.get());
}
}
CODriver
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CODriver {
public static void main(String[] args) throws Exception {
Configuration cfg = new Configuration();
Job job = Job.getInstance(cfg, "co_job");
job.setJarByClass(CODriver.class);
job.setMapperClass(COMapperJoin.class);
job.setReducerClass(COReducerJoin.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CustomOrder.class);
job.setOutputKeyClass(CustomOrder.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path("file:///D:/ideashuju/testhdfs/data"));
FileOutputFormat.setOutputPath(job, new Path("file:///D:/test/coResult"));
boolean result = job.waitForCompletion(true);
System.out.println(result ? "执行成功" : "执行失败");
System.exit(result?0:1);
}
}
结果展示: