hadoop源码解析-Job提交

执于代码

已于 2022-06-06 18:26:29 修改

阅读量228

点赞数

分类专栏： # hadoop 文章标签： hadoop big data hdfs

于 2022-06-06 18:01:57 首次发布

原文链接：http://www.atguigu.com/jsjj/34366.html

版权

hadoop 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

hadoop源码解析-Yarn源码解析.doc

目录
相关代码如下：
小结：
参考资料和推荐阅读

LD is tigger forever，CG are not brothers forever， throw the pot and shine.
Efficient work is better than attitude。All right, hide it。Advantages should be hidden.
talk is selected, show others the code,Keep progress，make a better result.

// 2 提交 job
submitter.submitJobInternal(Job.this, cluster)
// 1）创建给集群提交数据的 Stag 路径
Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
// 2）获取 jobid ，并创建 Job 路径
JobID jobId = submitClient.getNewJobID();
// 3）拷贝 jar 包到集群
copyAndConfigureFiles(job, submitJobDir);
rUploader.uploadFiles(job, jobSubmitDir);
// 4）计算切片，生成切片规划文件
writeSplits(job, submitJobDir);
maps = writeNewSplits(job, jobSubmitDir);
input.getSplits(job);
// 5）向 Stag 路径写 XML 配置文件
writeConf(conf, submitJobFile);
conf.writeXml(out);
// 6）提交 Job,返回提交状态
status = submitClient.submitJob(jobId, submitJobDir.toString(),
job.getCredentials());

1.job.submit();
2.提交到Clusterproxy 到yarn，MR程序运行在本地模拟器
3.成员代理stagingDirfile://…/staging
4.Jobid汇总到file://staging/jobid
5.调用FIleinputFormat.getSplit()获取切片规划，并序列化成文件
Job.split形成file://staging/jobid
6.如果是yarnRunner，还需要获取Job的jar xxx.jar hdfs 😕/stateing/jobid/job.jar

FIleInputFormat切片源码解析：
（1）程序先找到你数据存储的目录
（2）开始遍历处理（规划）目录下的每一个文件
（3）遍历第一个文件ss.txt
(a)获取文件大小fs.sizeof(ss.txt)
(b)计算切片大小computeSplitSize(Math.max(minSize,Math.mini))=blocksize=128M
©默认情况下，切片大小=blocksize
(d)开始切，形成第一个切片0:128，第二个切片256，第三个切片结束
每次切片时，都要判断切完剩下的部分是否大于块的1.1 倍，不大于1.1 倍就划分为一块切片。
(e)将切片文件信息写到一个切片文件规划中
(f)整个切片的核心过程在getSplit()方法中完成
(g)inputSplit只是记录切边的元数据信息比如起始位置，长度和所在的节点列表
（4）提交切片规划文件到Yarn上，yarn的mrAppMaster就可以根据切片规划文件计算开启MapTask个数。

5.2 MapTask &ReduceTask源码解析

1）MapTask源码解析流程
Context.write(k,NullWriteable.get);
Output.write(key,value);
Collector.collect(key,value,partitioner.getPartition(key,value,partitions));

HashPartitioner(); //默认分区器
collect()
close() /
collector.flush() /
sortAndSpill() //溢写排序
sorter.sort()
mergeParts(); //合并文件，MapTask1527 行，进入
collector.close();

2）ReduceTask源码解析流程：
if (isMapOrReduce())
Initianlize()
Init(shuffleContext);
totalMaps=job.getNumMapTasks();
merger = createMergeManager(context);
this.inMemoryMerger = createInMemoryMerger(); //内存合并
rIter = shuffleConsumerPlugin.run();
eventFetcher.start();
eventFetcher.shutDown(); //抓取结束，Shuffle 第 141 行，提前打断点

copyPhase.complete();
taskStatus.setPhase(TaskStatus.Phase.SORT);
sortPhase.complete();
cleanup(context); //reduce 完成之前，会最后调用一次 Reducer 里面的 cleanup 方法

设计思路

小结：

主要讲述了自己的一些体会，里面有许多不足，请大家指正~

参考资料和推荐阅读

深度开源: link

欢迎阅读，各位老铁，如果对你有帮助，点个赞加个关注呗！~

执于代码

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hadoop源码解析-Job提交

hadoop源码解析-Yarn源码解析5.1 Job 提交流程源码和切片源码详解1）Job 提交流程源码详解waitForCompletion()submit();// 1 建立连接connect();// 1）创建提交 Job 的代理new Cluster(getConfiguration());// （1）判断是本地运行环境还是 yarn 集群运行环境initialize(jobTrackAddr, conf);// 2 提交 jobsubmitter.submitJobInterna
复制链接

扫一扫