30.MapReduce
mapreduce任务过程分为两个处理阶段:map阶段和reduce阶段。每个阶段都以k-v对作为输入和输出,其类型由开发者选择。
map阶段的输入时NCDC原始数据。我们选择文本格式作为输入格式,将数据集的每一行作为文本输入。
1.编写MR程序
【创建mapper】
public class MyMaxTempMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private static final int MISSING = 9999;
/**
* mapper
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
//value String
String line = value.toString();
//提取年份
String year = line.substring(15, 19);
//提取气温
int airTemperature;
if (line.charAt(87) == '+') {
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
//质量
String quality = line.substring(92, 93);
//判断气温的有效性
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
【创建Reducer】
public class MyMaxTempReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
//定义一个最小值
int maxValue = Integer.MIN_VALUE;
//提取年度最高气温
for(IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
//写入输出
context.write(key, new IntWritable(maxValue));
}
}
【创建app运行作业】
public class MyMaxTempApp {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("Usage: MaxTemperature <input path> <output path>");
System.exit(1);
}
Job job = Job.getInstance();
job.setJarByClass(MyMaxTempApp.class);
//设置作业名称
job.setJobName("Max temp");
//输入路径
FileInputFormat.addInputPath(job, new Path(args[0]));
//输出路径
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//设置mapper类型
job.setMapperClass(MyMaxTempMapper.class);
//设置reducer类型
job.setReducerClass(MyMaxTempReducer.class);
//设置对应输出key的类型
job.setOutputKeyClass(Text.class);
//设置对应输出value的类型
job.setOutputValueClass(IntWritable.class);
//开始执行job
System.out.println(job.waitForCompletion(true) ? 0 : 1);
}
}
31.Job提交过程分析
【编程模型】
map(映射) + reduce(化简)
【调用流程】
1.job.waitForCompletion()
2.submit()提交作业给cluster,并等待完成
a)ensureState(JobState.DEFINE)确保状态
b)setUseNewAPI()使用新型API
c)connect()创建集群对象
d)创建JobSubmitter
3.submitter.submitJobInternal(Job.this, cluster)
a)checkSpecs(job)检查输出目录,已存在则出异常
b)JobSubmissionFiles.getStagingDir()建立hdfs的临时目录
c)InetAddress.getLocalHost()取得本地ip
d)submitClient.getNewJobID()创建作业id
e)copyAndConfigureFiles()使用命令行参数设置conf信息
f)writeSplits(job, submitJobDir)在临时目录下产生创建切片文件
g)conf.setInt(MRJobConfig.NUM_MAPS, maps)设置map数量
h)writeConf(conf, submitJobFile)提交job.xml到提交目录
i)submitClient.submitJob()通过执行器提交作
4.submitClient.submitJob()通过执行器提交作业
a)Job job = new Job()创建LocalJobRunner.Job内部类对象
5.Job job = new Job()
a)通过临时目录下job.xml创建JobConf
b)this.start()启动线程,即调用run()方法
6.this.start()
a)TaskSplitMetaInfo[]获取task切片信息
b)getMapTaskRunnables()得到mapper对应的runnable
c)runTasks(mapRunnables, mapService, "map")
d)getReduceTaskRunnables()得到reduce的任务数
e)runTasks(reduceRunnables, reduceService, "reduce")
7.runTasks()
for (Runnable r : runnables) {
service.submit(r);
}
8.LocalJobRunner$Job$MapTaskRunnable
a)创建MapAttempId
b)创建MapTask
c)创建MaoOutFile
d)map.setXXX()
e)map.run()
f)
9.org.apache.hadoop.mapred.MapTask$run()
a)runNewMapper()
10.runNewMapper
a)创建taskContext
b)taskContext.getMapperClass()用反射拿到Mapper对象
c)创建InputFormat
d)创建split
e)创建NewOutputCollector即context对象
11.mapper.run(mapperContext)
12.MyMaxTempMapper$run()
setup(context);
try {
while (context.nextKeyValue()) {
map(...)
}
} finally {
cleanup()
}
【提交作业至集群】
编译打包后,在集群中通过
hadoop jar jarFile classname arg1 arg2 ..
进行作业的提交