《Hadoop实战》读书笔记=========链接MapReduce Job
1、线性MapReduce Job流将每个Job的启动代码设置成只有上一个Job结束之后才开始执行,然后将Job的输入设置成上一个Job的输出路径.
优点:简单直观
缺点:不好处理非线性的job流(例如job3需要job1和job2的输出结果组合起来作输入)
可能会用到的筛选文件的方法:
FileSplit fs = context.getInputSplit();
fs.getPath().getName().contains("");
fs.getPath().getName().startswith("");
fs.getPath().getName().endswith("");
2、复杂的MapReduce流MapReduce提供了API---ControlledJob类和JobControl类来处理复杂的非线性的Job流具体做法:
1、按照正常情况来配置各个job
Configuration conf = new Configuration();
//配置job1
Job job1 = new Job(conf);
job.setJarByClass();
job.setMapperClass();
job.setReducerClass();
job.setMapOutputKeyClass();
job.setMapOutputValueClass();
//配置job2
Job job2 = new Job(conf);
job.setJarByClass();
job.setMapperClass();
job.setReducerClass();
job.setMapOutputKeyClass();
job.setMapOutputValueClass();
2、将各个Job封装到ControlledJob对象中
ControlledJob cJob1 = new ControlledJob(conf);
cJob1.setJob(job1);
ControlledJob cJob2 = new ControlledJob(conf);
cJob2.setJob(job2);
3、添加依赖关系
/***
* 设置多个作业直接的依赖关系
* 如下所写:
* 意思为job2的启动,依赖于job1作业的完成
* **/
cjob2.addDependingJob(cjob1);</span>
4、实例化一个JobControl对象,使用addJon()方法将所有Job注入JobControl对象中,起一个线程启动run()方法
JobControl jc = new JobControl();
jc.addJob(cjob1);
jc.addJob(cjob2);
Thread jcThread = new Thread(jc);
jcThread.start();
while(true){
if(jc.allFinished()){
System.out.println(jc.getSuccessfulJobList());
jc.stop();
return 0;
}
if(jc.getFailedJobList().size() > 0){
System.out.println(jc.getFailedJobList());
jc.stop();
return 1;
}
}
3、Job设置预处理和后处理过程
通过org.apache.hadoop.mapred.lib包下的ChainMapper和ChainReducer两个静态类实现,这种方法最终形成的是一个独立的Job而不是Job流,并且只有针对Job的输入输出流,各阶段函数之间的输入输出MapReduce框架会自动组织
Configuration conf = new Configuration();
JobConf job = new JobConf(conf);
job.setJobName("ChianJob");
......
// 添加Map1
JobConf map1conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class,Text.class, Text.class, true, map1conf);
//添加Reducer
JobConf</span> reduceconf = new JobConf(false);
ChainReducer.setReducer(job, Reduce.class, LongWritable.class,Text.class, Text.class, Text.class, true, reduceconf);
//添加Map2
JobConf map2conf = new JobConf(false);
ChainReducer.addMapper(job, Map2.class, LongWritable.class, Text.class,Text.class, Text.class, true, map2conf);
job.waitForCompletion(true);
需要注意的是ChainMapper和ChainReducer到目前为止只支持旧版API,所以Map和Reduce必须是实现org.apache,hadoop.mapred.Mapper接口或org.apache,hadoop.mapred.Reducer接口的静态类