hadoop 的 mapreduce（1）

最新推荐文章于 2024-05-15 07:16:47 发布

IronWring_Fly

最新推荐文章于 2024-05-15 07:16:47 发布

阅读量156

点赞数

分类专栏：分布式大数据学习之路

本文链接：https://blog.csdn.net/IronWring_Fly/article/details/101052016

版权

大数据学习之路同时被 2 个专栏收录

23 篇文章 4 订阅

订阅专栏

分布式

4 篇文章 0 订阅

订阅专栏

参考资料：
Difference between hadoop block Size and Input Splits in hadoop and why two parameter are there ?
A  Very  Brief  Introduction  to  MapReduce

在这里插入图片描述

对于文件分片的过程，参考文献第一篇有专门论述这个问题。

if input split is not specified and start and end positions of records are in the same block,then HDFS block size will be split size then 10 mappers are initialized to load the file, each mapper loads one block.
If the start and end positions of the records are not in the same block, this is the exact problem that input splits solve, Input split is going to provide the Start and end positions(offsets) of the records to make sure split having complete record as key/value pairs to the mappers, then mapper is going to load the block of data according to start and end offset values.
If we specify split size is false then whole file will form one input split and processed by one map which it takes more time to process when file is big.
If your resource is limited and you want to limit the number of maps then you can mention Split size as 256 MB then then logical grouping of 256 MB is formed and only 5 maps will be executed with a size of 256 MB.

简单说，就是可以用一个map处理所有的文件，但是效率非常低，也可以自定义map的数量。
如果有hadoop自己决定map的数量则有两种情况：