hadoop 的 mapreduce(1)

参考资料:
Difference between hadoop block Size and Input Splits in hadoop and why two parameter are there ?
A
 Very 
Brief
 Introduction
 to
 MapReduce

在这里插入图片描述

对于文件分片的过程,参考文献第一篇有专门论述这个问题。

  1. if input split is not specified and start and end positions of records are in the same block,then HDFS block size will be split size then 10 mappers are initialized to load the file, each mapper loads one block.
  2. If the start and end positions of the records are not in the same block, this is the exact problem that input splits solve, Input split is going to provide the Start and end positions(offsets) of the records to make sure split having complete record as key/value pairs to the mappers, then mapper is going to load the block of data according to start and end offset values.
  3. If we specify split size is false then whole file will form one input split and processed by one map which it takes more time to process when file is big.
  4. If your resource is limited and you want to limit the number of maps then you can mention Split size as 256 MB then then logical grouping of 256 MB is formed and only 5 maps will be executed with a size of 256 MB.

简单说,就是可以用一个map处理所有的文件,但是效率非常低,也可以自定义map的数量。
如果有hadoop自己决定map的数量则有两种情况:

  1. 文件正好存在一个block中,那分片大小就是block大小。
  2. 如果由于文件过大,存在了多个block中,那就按照文件实际大小来分片。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值