MapReduce编程模型:映射+化简。
编程方式
----------------
extends Mapper{
map(){
...
}
}
extends Reducer{
reduce(){
...
}
}
分区
-----------------
Map端的工作,预先对产生的kv进行分组。分组数量等于reducer数量,该步骤是shuffle前完成的工作。
shuffle
----------------
M和R之间进行数据分发。
编写MR
----------------
1.编写Mapper
package mr;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* Mapper,Key为字节偏移量,value为文本。
* 以[a.txt]为例,其中内容为:
* hello world tom ----->0:hello world tom
* tom world hello ----->17:tom world hello
* (第一行结束有回车和换行各一个字符)
*/
public class WCMapper extends Mapper<LongWritable,Text,Text,IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Text keyOut = new Text();
IntWritable valueOut = new IntWritable();
String[] arr = value.toString().split(" ");
for (String s:arr){
keyOut.set(s);
valueOut.set(1);
context.write(keyOut,valueOut);
}
}
2.编写Reduce
package mr;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer
import java.io.IOException;
public class WCReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{
/**
* reduce
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for(IntWritable iw:values){
count = count + iw.get();
}
context.write(key,new IntWritable(count));
}
}
3.编写Job
package mr;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WCApp
{
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
//在Windows上做本地调试
conf.set("fs.defaultFS","file:///");
//如果原输出路径存在,删除之
if(args.length >1) {
FileSystem.get(conf).delete(new Path(args[1]));
}
Job job = Job.getInstance(conf);
//设置job的各种属性
job.setJobName("WCApp"); //作业名
job.setJarByClass(WCApp.class); //搜索类
job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式
FileInputFormat.addInputPath(job,new Path(args[0]));//添加输入路径
FileOutputFormat.setOutputPath(job,new Path(args[1]));//设置输出路径
job.setMapperClass(WCMapper.class); //设置mapper类
job.setReducerClass(WCReducer.class); //设置Reducer类
job.setNumReduceTasks(1); //设置Reduce个数
job.setMapOutputKeyClass(Text.class); //设置map端输出键类型
job.setMapOutputValueClass(IntWritable.class);//设置map端输出值类型
job.setOutputKeyClass(Text.class);//设置reduce端输出键类型
job.setOutputValueClass(IntWritable.class);//置reduce端输出值类型
job.waitForCompletion(true);
}
}
Local模式运行MR流程
-------------------------
1.创建外部Job(mapreduce.Job),设置配置信息
2.通过jobsubmitter将job.xml +split等文件写入临时目录
3.通过jobSubmitter提交job给localJobRunner,
4.LocalJobRunner将外部Job 转换成成内部Job
5.内部Job线程,开放分线程执行job
6.job执行线程分别计算Map和reduce任务信息并通过线程池孵化新线程执行MR任务。