MapReduce执行流程:
(1)Client向Yarn主节点RM提交应用
bin/yarn jar MainClass args
(2)RM在某个NM节点上启动一个Container运行AppMaster,运行应用的管理者
(3)AppMaster向RM请求资源,为了运行MapReduce中所有的Task,RM将分配NM是哪个资源,并且告知AppMaster
(4)AppMaster联系NM,启动Container中相关Task(Map Task和Reduce Task)
(5)运行的Task会实时的向AppMaster进行汇报,永不监控整个应用。
(6)当所有Task(Reduce Task)运行完成,appMaster告知RM,销毁AppMaster
(7)RM给Client响应
知识点:
Container容器:将资源(CPU和memory)进行隔离,单独个某个Task独立使用
阿里云服务,给你提供的服务就是一个容器
MapReduce原理:
MapReduce编程
工程导入
MapReduce处理数据流程
在整个MapReduce程序中,所有的数据的流程流式都是键值对(Key-value)
Input→Map→shuffle→Reduce→Output
(1)针对于Input和Out来说,正常情况下不需要编写任何的代码,只需要指定对应目录即可
(2)核心关注map和reduce
MapReduce执行过程
input 环节
输入:读取HDFS上的数据
输出:
Key value
0 hadoop java spring springmvc
28 jave spring java
Mapper环节
class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
<输入Key,输入Value,输出Key,输出Value>
<行偏移量,行内容,XX,YY>
protected void map(KEYIN key, VALUEIN value, Context context)
map要干什么:
通过空格
分割,取出里面的单词
输出:
key value
Hadoop 1
java 1
spring 1
springMvc 1
java 1
shuffle环节
功能:
分区
分组:将相同的key的value放到一个集合里
排序:按照字典顺序排序
输出:
key value
Hadoop {1}
java {1,1}
spring {1}
Reduce环节
class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
<单词,1,单词,频率>
void reduce(KEYIN key, Iterable values, Context context )
处理:将集合里面的值拿出来相加
输出:
key value
单词 频率
java 2
spring 1
output环节
输入:
key value
单词 频率
java 2
spring 1
输出:将内容写到HDFS文件中
执行程序
package com.huadian.bigdata.mapreduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author 徐苗
* @create 2019-07-02- 18:13
*/
public class WordCountMapReduce {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.out.println(args.length);
System.out.println(args);
//1.读取配置文件
Configuration configuration = new Configuration();
//2.创建Job
//Job getInstance(Configuration conf,String jobName)
Job job = Job.getInstance(configuration, "WordCountMapReduce");
//设置Job运行的主类
job.setJarByClass(WordCountMapReduce.class);
//3.设置job
//3.1 input
Path inputPath = new Path(args[0]);
FileInputFormat.setInputPaths(job,inputPath);
//3.2 map
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//3.3 shuffle
//3.4 reduce
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//3.5 output
Path outputPath = new Path( args[1] );
FileOutputFormat.setOutputPath( job,outputPath );
//4.提交job,去运行
//print the progress to the user
boolean isSuccess = job.waitForCompletion(true);
System.exit(isSuccess?0:1);
}
/*
* map方法
* KeyIn:输入Key的类型
* 文本的偏移量,
* 使用Long类型表示
* KeyValue:输入Value的类型
* 文本中,每一行的内容,使用String表示
* KeyOut:输出Key的类型
* 单词
* valueOut:输出Value的类型
* 单词对应频率
* */
private static class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
private Text mapOutKey = new Text( );
private final static IntWritable mapOutValue = new IntWritable(1);
@Override
protected void map(LongWritable key,Text value,Context context)throws IOException,InterruptedException{
//需要将行内容转成一个一个单词
String row = value.toString();//行内容
String[] strs = row.split(" ");
for (String str:strs){
mapOutKey.set(str);
//借助context将Map方法结果进行输出
context.write(mapOutKey,mapOutValue);
}
}
}
private static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
private IntWritable outputValue = new IntWritable();
@Override
protected void reduce(Text key,Iterable<IntWritable> values,Context context)throws IOException,InterruptedException{
int sum = 0;
//将集合中值相加
for (IntWritable value:values){
sum += value.get();
}
outputValue.set(sum);
context.write(key,outputValue);
}
}
}
将程序打包
mvn package
上传到hadoop上,运行测试
bin/yarn jar hadoop-1.0-SNAPSHOT.jar com.huadian.bigdata.mapreduce.WordCountMapReduce /datas/tmp /datas/mapreduce/output5