书上说的不清晰透彻,下面是在StackOverflow上的一个方案,我觉得很好: (1) Cascading jobs Create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job:Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job:
- JobClient.run(job1);
(2) Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run. Then create two Job objects with jobconfs as parameters:
- JobClient.run(job2);
Using the jobControl object, you specify the job dependencies and then run the jobs:
- Job job1=new Job(jobconf1);
- Job job2=new Job(jobconf2);
(3) If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards. Note that in this case, u can use only one reducer but any number of mappers before or after it. 下面对这段话进行一定的展开。 (注:以下标为原文是为了在日记中进行突出显示,非原文字句,请作者及读者见谅,如果存在版权问题请指出~)
- JobControl jbcntrl=new JobControl("jbcntrl");
- jbcntrl.addJob(job1);
- jbcntrl.addJob(job2);
- job2.addDependingJob(job1);
- jbcntrl.run();
对于A,这个简单,因为Hadoop已经提供了JobClient这个类,我们只需要构建多个Job就行,注意使用JobClient的runJob(JobConf)方法,别一不小心用了submitJob(JobConf),文档的说明是: runJob(JobConf) : submits the job and returns only after the job has completed. submitJob(JobConf) : only submits the job, then poll the returned handle to the RunningJob to query status and make scheduling decisions. runJob会等到job执行结束才返回。于是可以很简洁的写出类似下面的代码:MapReduce的Map与Reduce进行链接可能存在多重关系,按照本书的说明,主要考虑四种:A、顺序级联,类似与Unix/Linux管道:Map1/Reduce1 -> Map2/Reduce2 -> ...B、具有比较复杂的依赖关系,如:Map1/Reduce1 && Map2/Reduce2 -> Map3/Reduce3,倒立的树形结构;C、与预处理及后阶段处理相关的组合/链接:Map1-> Map2-> Map3 -> Reduce -> Map4 -> Map5D、对于B,先贴张图说明一下到底是啥:
- /* some configuration code goes here */
- Job job1=new Job(jobconf1);
- JobClient.run(job1);
- // ...
- Job job2=new Job(jobconf2);
- JobClient.run(job2);
(说明:本图片来自http://www.cnblogs.com/xuqiang/archive/2011/06/05/2073155.html,非常感谢) 这种纠结的问题就是JobControl的用武之处了,请见下面代码:![]()
- Job job1=new Job(jobconf1);
- Job job2=new Job(jobconf2);
- Job job3=new Job(jobconf3);
- Job job4=new Job(jobconf4);
- Job job5=new Job(jobconf5);
- JobControl jbcntrl=new JobControl("MyJobCtrl"); // 传入一个字符串作为名字
- jbcntrl.addJob(job1);
- jbcntrl.addJob(job2);
- jbcntrl.addJob(job3);
- jbcntrl.addJob(job4);
- jbcntrl.addJob(job5);
- job2.addDependingJob(job1); // 表示job2依赖于job1,job1没有完成,job2就不会启动
- job4.addDependingJob(job3);
- job5.addDependingJob(job2);
- job5.addDependingJob(job4);
- jbcntrl.run();
Mapper与Reducer的链接
最新推荐文章于 2024-06-11 15:54:09 发布