InputFormat切片

最新推荐文章于 2022-08-10 17:02:53 发布

旧时光中的旅人

最新推荐文章于 2022-08-10 17:02:53 发布

阅读量245

点赞数 2

分类专栏： hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/weixin_46266718/article/details/107391930

版权

hadoop 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

处理数据时怎么完成切片的？

**
inpt -> inputFormat -> map ->shuffle -> reduce -> outputformat -> 本地文件
默认私用的是TextInputFormat

getSplits():切片方法
isSplitable(job, path)：判断文件是否支持切片,根据文件路径获取压缩格式，如果支持切片返回true，如果文件不是压缩文件，则直接返回true
file.getBlockSize();获取块大小
Math.max(minSize, Math.min(maxSize, blockSize))：获取切片的大小
当需要让切片的大小小于块大小的时，要调整maxsize
当需要让切片的大小大于块大小的时，要调整minsize
默认的切片大小时128M
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts(),
blkLocations[blkIndex].getCachedHosts()));
bytesRemaining -= splitSize;
}

####hadoop系统自带的inputFormat类型
inputFormat:把文件内容转换成key,value值
默认使用的是TextInputFormat
InputFormat<?, ?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(), conf);

TextInputFormat extends FileInputFormat<LongWritable, Text>
TextInputFormat:没有重写切片方法getSplits,默认使用的时FileInputFormat的切片方法
返回：LineRecordReader -> LongWritable, Text

NLineInputFormat: NLineInputFormat extends FileInputFormat<LongWritable, Text>
NLineInputFormat:重写了getSplits方法，默认按照行进行切片
返回:LineRecordReader -> LongWritable, Text

CombineFileInputFormat: CombineFileInputFormat<K, V> extends FileInputFormat
CombineTextInputFormat extends CombineFileInputFormat<LongWritable,Text>
CombineFileInputFormat:重写了getSplits方法，自定义切片的大小
返回的是：CombineFileRecordReader ->LongWritable,Text

KeyValueTextInputFormat:KeyValueTextInputFormat extends FileInputFormat<Text, Text>
KeyValueTextInputFormat:没有重写切片方法getSplits,默认使用的时FileInputFormat的切片方法
返回的是：KeyValueLineRecordReader -> Text, Text

SequenceFileInputFormat<K, V> extends FileInputFormat<K, V>
SequenceFileInputFormat:没有重写切片方法getSplits,默认使用的时FileInputFormat的切片方法
返回的是：SequenceFileRecordReader -> Text, Text

旧时光中的旅人

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
InputFormat切片

**处理数据时怎么完成切片的？**inpt -> inputFormat -> map ->shuffle -> reduce -> outputformat -> 本地文件默认私用的是TextInputFormatgetSplits():切片方法isSplitable(job, path)：判断文件是否支持切片,根据文件路径获取压缩格式，如果支持切片返回true，如果文件不是压缩文件，则直接返回truefile.getBlockSize();获取块大小
复制链接

扫一扫